21 comments

  • bluelightning2k
    3 minutes ago
    Is this Python only?

    More importantly: can this be used to run untrusted jobs? E.g. user-supplied or AI supplied code?

  • nik736
    1 hour ago
    The readme assumes users with darkmode outweigh users without (the logo is white, invisible without darkmode). Would be interesting to see stats from Github for this!
  • followben
    9 hours ago
    How does this compare to other pg-backed python job runners like Procrastinate [0] or Chancy [1]?

    [0] https://github.com/procrastinate-org/procrastinate/

    [1] https://github.com/TkTech/chancy

    • gabrielruttner
      5 hours ago
      Gabe here, one of the hatchet founders. I'm not very familiar with these runner so someone please correct me if I missed something.

      These look like great projects to get something running quickly, but likely will experience many of the challenges Alexander mentioned under load. They look quite similar to our initial implementation using FOR UPDATE and maintaining direct connections from workers to PostgreSQL instead of a central orchestrator (a separate issue that deserves its own post).

      One of the reasons for this decision to performantly support more complex scheduling requirements and durable execution patterns -- things like dynamic concurrency [0] or rate limits [1] which can be quite tricky to implement on a worker-pull model where there will likely be contention on these orchestration tables.

      They also appear to be pure queues to run individual tasks in python only. We've been working hard on our py, ts, and go sdks

      I'm excited to see how these projects approach these problems over time!

      [0] https://docs.hatchet.run/home/concurrency [1] https://docs.hatchet.run/home/rate-limits

    • wcrossbow
      8 hours ago
      Another good one is pgqueuer https://github.com/janbjorge/pgqueuer
    • INTPenis
      9 hours ago
      Celery also has postgres backend, but I maybe it's not as well integrated.
      • igor47
        8 hours ago
        It's just a results backend, you still have to run rabbitmq or redis as a broker
  • stephen
    2 hours ago
    Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?

    Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?

    Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.

    And I might as well be using SQS at that point.

  • diarrhea
    19 hours ago
    This is very exciting stuff.

    I’m curious: When you say FOR UPDATE SKIP LOCKED does not scale to 25k queries/s, did you observe a threshold at which it became untenable for you?

    I’m also curious about the two points of:

    - buffered reads and writes

    - switching all high-volume tables to use identity columns

    What do you mean by these? Were those (part of) the solution to scale FOR UPDATE SKIP LOCKED up to your needs?

    • abelanger
      19 hours ago
      I'm not sure of the exact threshold, but the pathological case seemed to be (1) many tasks in the backlog, (2) many workers, (3) workers long-polling the task tables at approximately the same time. This would consistently lead to very high spikes in CPU and result in a runaway deterioration on the database, since high CPU leads to slower queries and more contention, which leads to higher connection overhead, which leads to higher CPU, and so on. There are a few threads online which documented very similar behavior, for example: https://postgrespro.com/list/thread-id/2505440.

      Those other points are mostly unrelated to the core queue, and more related to helper tables for monitoring, tracking task statuses, etc. But it was important to optimize these tables because unrelated spikes on other tables in the database could start getting us into a deteriorated state as well.

      To be more specific about the solutions here:

      > buffered reads and writes

      To run a task through the system, we need to write the task itself, write the instance of that retry of the count to the queue, write an event that the task has been queued, started, completed | failed, etc. Generally one task will correspond to many writes along the way, not all of which need to be extremely latency sensitive. So we started buffering items coming from our internal queues and flushing them once every 10ms, which helped considerably.

      > switching all high-volume tables to use identity columns

      We originally had combined some of our workflow tables with our monitoring tables -- this table was called `WorkflowRun` and it was used for both concurrency queues and queried when serving the API. This table used a UUID as the primary key, because we wanted UUIDs over the API instead of auto-incrementing IDs. The UUIDs caused some headaches down the line when trying to delete batches of data and prevent index bloat.

      • chaz6
        1 hour ago
        Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load.
      • diarrhea
        10 hours ago
        Thank you! Very insightful, especially the forum link and the observation around UUIDs bloating indexes.
  • anentropic
    1 hour ago
    Quick feedback:

    Would love to see some sort of architecture overview in the docs

    The top-level docs have a section on "Deploying workers" but I think there are more components than that?

    It's cool there's a Helm chart but the docs don't really say what resources it would deploy

    https://docs.hatchet.run/self-hosting/docker-compose

    ...shows four different Hatchet services plus, unexpectedly, both a Postgres server and RabbitMQ. Can't see anywhere that describes what each one of those does

    Also in much of the docs it's not very clear where the boundary between Hatchet Cloud and Hatchet the self-hostable OSS part lies

  • lysecret
    10 hours ago
    This is awesome and I will take a closer look! One question: We ran into issue with using Postgres as a message queue with messages that need to be toasted/have large payloads (50mb+).

    Only fix we could find was using unlogged tables and a full vacuum on a schedule. We aren’t big Postgres experts but since you are I was wondering if you have fixed this issue/this framework works well for large payloads.

    • igor47
      8 hours ago
      Don't put them in the queue. Put the large payload into an object store like s3/gcs and put a reference into the db or queue
      • szvsw
        6 hours ago
        Yep - this is also the official recommended method by Hatchet, also sometimes called payload thinning.
  • morsecodist
    13 hours ago
    This is great timing. I am in the process of designing an event/workflow driven application and nothing I looked at felt quite right for my use case. This feels really promising. Temporal was close but it just felt like not the perfect fit. I like the open source license a lot it gives me more confidence designing an application around it. The conditionals are also great. I have been looking for something just like CEL and despite my research I had never heard of it. It is exactly how I want my expressions implemented, I was on the verge of trying to build something like this myself.
  • kianN
    13 hours ago
    Congratulations on the v1 launch! I’ve been tinkering with hatchet for almost a year, deployed it in production about 6 months ago.

    The open source support and QuickStart are excellent. The engineering work put into the system is very noticeable!

  • latchkey
    12 hours ago
    Cool project. Every time one of these projects comes up, I'm always somewhat disappointed it isn't an open source / postgres version of GCP Cloud Tasks.

    All I ever want is a queue where I submit a message and then it hits an HTTP endpoint with that message as POST. It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed. Pairs extremely well with autoscaling Cloud Functions, but could be anything really.

    I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.

    • abelanger
      5 hours ago
      We actually have support for that, we just haven't migrated the doc over to v1 yet: https://v0-docs.hatchet.run/home/features/webhooks. We'll send a POST request for each task.

      > It is such a better system than dedicated long running worker listeners, because then you can just scale your HTTP workers as needed.

      This depends on the use-case - with long running listeners, you get the benefit of reusing caches, database connections, and disk, and from a pricing perspective, if your task spends a lot of time waiting for i/o operations (or waiting for an event), you don't get billed separately for CPU time. A long-running worker can handle thousands of concurrently running functions on cheap hardware.

      > I also find that DAGs tend to get ugly really fast because it generally involves logic. I'd prefer that logic to not be tied into the queue implementation because it becomes harder to unit test. Much easier reason about if you have the HTTP endpoint create a new task, if it needs to.

      We usually recommend that DAGs which require too much logic (particularly fanout to a dynamic amount of workflows) should be implemented as a durable task instead.

      • gabrielruttner
        4 hours ago
        Admittedly webhook workers aren't exactly this since we send multiple tasks to the same endpoint, where I believe you can register one endpoint per task with Cloud Task. Although, this is not a large change.
    • jsmeaton
      5 hours ago
      Cloudtasks are excellent and I’ve been wanting something similar for years.

      I’ve been occasionally hacking away at a proof of concept built on riverqueue but have eased off for a while due to performance issues obvious with non-partitioned tables and just general laziness.

      https://github.com/jarshwah/dispatchr if curious but it doesn’t actually work yet.

    • lysecret
      9 hours ago
      Yea I also like this system only problem I was facing with it was http read will lead to timeouts/lost connections. And task queues specifically have a 30 min execution limit. But I really like how it separates the queueing logic from the whole application/execution graph. Task queues are one of my favourite pieces of cloud infrastructure.
    • igor47
      8 hours ago
      How do you deal with cloud tasks in dev/test?
      • jerrygenser
        5 hours ago
        What we did was mock it to make the http request blocking.

        Alternatively you can use ngrok(or similar) and a test task queue that is calling your service running on localhost tunneled via ngrok.

  • szvsw
    5 hours ago
    I’ve been using Hatchet since the summer, and really do love it over celery. I’ve been using Hatchet for academic research experiments with embarrassingly parallel tasks - ie thousands of simultaneous tasks just with different inputs, each CPU bound and on the order of 10s-2min, totaling in the millions of tasks per experiment - and it’s been going great. I think the team is putting together a very promising product. Switching from a roll-my-own SQS+AWS batch system to Hatchet has made my research life so much better. Though part of that also probably comes from the forced improvements you get when re-designing a system a second time.

    Although there was support for pydantic validation in v0, now that the v1 SDK has arrived, I would definitely say that the #1 distinguishing feature (at least from a dx perspective) for anyone thinking of switching from Celery or working on a greenfield project is the type safety that comes with the first class pydantic support in v1. That is a huge boon in my opinion.

    Another big boon for me was that the combo of both Python and Typescript SDKs - being able to integrate things into frontend demos without having to set up a separate Python api is great.

    There are a couple rough edges around asyncio/single worker concurrency IMO - for instance, choosing between 100 workers each with capacity for 8 concurrent task runs vs 800 workers each with capacity for 1 concurrent task run. In Celery it’s a little bit easier to launch a worker node which uses separate processes to handle its concurrent tasks, whereas right now with Hatchet, that’s not possible as far as I am aware, due to how asyncio is used to handle the concurrent task runs which a single worker may be processing. If most of your work is IO bound or already asyncio friendly, this does not really affect you and you can safely use eg a worker with 8x task run capacity, but if you are CPU bound there might be some cases where you would prefer the full process isolation and feel more assured that you are maximally utilizing all your compute in a given node, and right now the best way to do that is only through horizontal scaling or 1x task workers I think. Generally, if you do not have a great mental model already of how Python handles asyncio, threads, pools, etc, the right way to think about this stuff can be a little confusing IMO, but the docs on this from Hatchet have improved. In the future though, I’d love to see an option to launch a Python worker with capacity for multiple simultaneous task runs in separate processes, even if it’s just a thin wrapper around launching separate workers under the hood.

    There are also a couple of rough edges in the dashboard right now, but the team has been fixing them, and coming from celery/flower or SQS, it’s already such an improved dashboard/monitoring experience that I can’t complain!

    It’s hard to describe, but there is just something fun about working with Hatchet for me, compared to Celery or my previous SQS system. Almost all of the design decision just align with what I would desire, and feel natural.

  • lysecret
    9 hours ago
    I would appreciate a comparison to cloud tasks in your docs.
  • themanmaran
    18 hours ago
    How does queue observability work in hatchet? I've used pg as a queueing system before, and that was one of my favorite aspects. Just run a few SQL queries to have a dashboard for latency/throughput/etc.

    But that requires you to keep the job history around, which at scale starts to impact performance.

    • abelanger
      18 hours ago
      Yeah, part of this rewrite was separating our monitoring tables from all of our queue tables to avoid problems like table bloat.

      At one point we considered partitioning on the status of a queue item (basically active | inactive) and aggressively running autovac on the active queue items. Then all indexes for monitoring can be on the inactive partitioned tables.

      But there were two reasons we ended up going with separate tables:

      1. We started to become concerned about partitioning _both_ by time range and by status, because time range partitioning is incredibly useful for discarding data after a certain amount of time

      2. If necessary, we wanted our monitoring tables to be able to run on a completely separate database from our queue tables. So we actually store them as completely independent schemas to allow this to be possible (https://github.com/hatchet-dev/hatchet/blob/main/sql/schema/... vs https://github.com/hatchet-dev/hatchet/blob/main/sql/schema/...)

      So to answer the question -- you can query both active queues and a full history of queued tasks up to your retention period, and we've optimized the separate tables for the two different query patterns.

  • hyuuu
    17 hours ago
    i have been looking for something like this, the closest I could find by googling was celery workflow, i think you should do better marketing, I didn't even realize that hatchet existed!
  • digdugdirk
    20 hours ago
    Interesting! How does it compare with DBOS? I noticed it's not in the readme comparisons, and they seem to be trying to solve a similar problem.
    • KraftyOne
      16 hours ago
      (DBOS co-founder here) From a DBOS perspective, the biggest differences are that DBOS runs in-process instead of on an external server, and DBOS lets you write worklflows as code instead of explicit DAGs. I'm less familiar with Hatchet, but here's a blog post comparing DBOS with Temporal, which also uses external orchestration for durable execution: https://www.dbos.dev/blog/durable-execution-coding-compariso...
    • abelanger
      20 hours ago
      Yep, durable execution-wise we're targeting a very similar use-case with a very different philosophy on whether the orchestrator (the part of the durable execution engine which invokes tasks) should run in-process or as a separate service.

      There's a lot to go into here, but generally speaking, running an orchestrator as a separate service is easier from a Postgres scaling perspective: it's easier to buffer writes to the database, manage connection overhead, export aggregate metrics, and horizontally scale the different components of the orchestrator. Our original v0 engine was architected in a very similar way to an in-process task queue, where each worker polls a tasks table in Postgres. This broke down for us as we increasing volume.

      Outside of durable execution, we're more of a general-purpose orchestration platform -- lots of our features target use-cases where you either want to run a single task or define your tasks as a DAG (directed acyclic graph) instead of using durable execution. Durable execution has a lot of footguns if used incorrectly, and DAGs are executed in a durable way by default, so for many use-cases it's a better option.

      • darkteflon
        18 hours ago
        Hatchet looks very cool! As an interested dilettante in this space, I’d love to read a comparison with Dagster.

        Re DBOS: I understood that part of the value proposition there is bundling transactions into logical units that can all be undone if a critical step in the workflow fails - the example given in their docs being a failed payment flow. Does Hatchet have a solution for those scenarios?

        • abelanger
          17 hours ago
          Re DBOS - yep, this is exactly what the child spawning feature is meant for: https://docs.hatchet.run/home/child-spawning

          The core idea being that you write the "parent" task as a durable task, and you invoke subtasks which represent logical units of work. If any given subtask fails, you can wrap it in a `try...catch` and gracefully recover.

          I'm not as familiar with DBOS, but in Hatchet a durable parent task and child task maps directly to Temporal workflows and activities. Admittedly this pattern should be documented in the "Durable execution" section of our docs as well.

          Re Dagster - Dagster is much more oriented towards data engineering, while Hatchet is oriented more towards application engineers. As a result tools like Dagster/Airflow/Prefect are more focused on data integrations, whereas we focus more on throughput/latency and primitives that work well with your application. Perhaps there's more overlap now that AI applications are more ubiquitous? (with more data pipelines making their way into the application layer)

          • darkteflon
            17 hours ago
            Perfect - great answer and very helpful, thanks.
  • avan1
    16 hours ago
    Don't want to steal your topic but I had written a lightweight task runner to learn GoLang [0]. Would be great to have your and others' comments. It works only as a Go library.

    [0] https://github.com/oneapplab/lq

    P.S: far from being alternative to Hatchet product

    • abelanger
      15 hours ago
      Nice! I haven't looked closely, but some initial questions/comments:

      1. Are you ordering the jobs by any parameter? I don't see an ORDER BY in this clause: https://github.com/oneapplab/lq/blob/8c9f8af577f9e0112767eef...

      2. I see you're using a UUID for the primary key on the jobs, I think you'd be better served by an auto-inc primary key (bigserial or identity columns in Postgres) which will be slightly more performant. This won't matter for small datasets.

      3. I see you have an index on `queue`, which is good, but no index on the rest of the parameters in the processor query, which might be problematic when you have many reserved jobs.

      4. Since this is an in-process queue, it would be awesome to allow the tx to be passed to the `Create` method here: https://github.com/oneapplab/lq/blob/8c9f8af577f9e0112767eef... -- so you can create the job in the same tx when you're performing a data write.

      • avan1
        5 hours ago
        Thanks a lot for the review you did which was much more than i requested. i noted all the 4 comments you did to apply on the package. Thanks again. Also currently we have Laravel backend and Laravel + Redis + Horizon [0] + Supervisor as a queue runner for our production and it's working fine for us. but would be great to can access Hatchet from php as well which we might switch in future as well. Another thing since you mentioned handling large work load do you recommend Hatchet as kafka or Rabbit message queue alternative to microservice communications ?

        [0] https://laravel.com/docs/12.x/horizon

      • someone13
        11 hours ago
        I just want to say how cool it is to see you doing a non-trivial review of someone else’s thing here
  • wilted-iris
    18 hours ago
    This looks very cool! I see a lot of Python in the docs; is it usable in other languages?
  • bomewish
    18 hours ago
    Why not fix all the broken doc links and make sure you have the full sdk spec down first, ready to go? Then drop it all at once, when it’s actually ready. That’s better and more respectful of users. I love the product and want y’all to succeed but this came off as extremely unprofessional.
    • abelanger
      17 hours ago
      Really appreciate the candid feedback, and glad to hear you like the product. We ran a broken links checker against our docs, but it's possible we missed something. Is there anywhere you're seeing a broken link?

      Re SDK specs -- I assume you mean full SDK API references? We're nearly at the point where those will be published, and I agree that they would be incredibly useful.

  • krainboltgreene
    11 hours ago
    A lot of these tools show off what a full success backlog looks like, in reality I care significantly more about what failure looks like, debugging, etc.
    • lysecret
      9 hours ago
      Ha this is a really good point! I worked with so many different kinds of observability approaches and always fell back to traced logs. This might be part of the reason.
  • revskill
    11 hours ago
    Confusing docs as there is no setup self hosted for postgres.
  • tombhowl
    19 hours ago
    [dead]