Summary
I am the maintainer of graphile_worker_rs, a Rust rewrite of graphile/worker.
A user recently reported a bug, and I think this exposes a race condition in the keyed scheduling path (add_jobs / add_job).
Original report: leo91000/graphile_worker_rs#378
Steps to reproduce
- Start PostgreSQL and initialize Graphile Worker schema.
- Run a worker with concurrency
10 and two tasks:
printer: very short task (e.g. sleeps ~2ms) so keyed jobs frequently become locked/running.
scheduler: loops many times (e.g. 100), each time calling addJob("printer", { key }, { jobKey: key, jobKeyMode: "preserve_run_at" }) with key chosen from a small keyspace (e.g. 10 keys).
- Enqueue multiple
scheduler jobs concurrently (e.g. 4).
- Let it run for ~30-60 seconds.
This creates high contention on the same jobKey while some conflicting rows are locked.
Expected results
addJob(...) should always return a valid job row.
- For
replace / preserve_run_at, scheduling should not occasionally return “no row” under contention.
Actual results
Under contention, graphile_worker.add_jobs(...) can return no row for a spec because of:
ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at IS NULL
When that WHERE condition is false, the conflict path does nothing and returns nothing for that spec.
Then add_job(...) (which selects from add_jobs(...) LIMIT 1) can return a null/empty row.
In strict clients this surfaces clearly (example from Rust/sqlx):
error occurred while decoding column "id": unexpected null; try decoding as an Option
In JS this can manifest as rows[0] missing from add_job(...) result in edge cases.
Additional context
- Reproduced from issue
https://github.com/leo91000/graphile_worker_rs/issues/378
- Reproduced against upstream SQL shape in
sql/000018.sql:
add_job delegates to add_jobs (select * into v_job from ...add_jobs(...))
add_jobs uses ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at is null
- Current repo version checked:
0.17.0-rc.0 (from package.json)
- PostgreSQL: reproduced on Docker Postgres (15/16 class behavior)
Summary
I am the maintainer of graphile_worker_rs, a Rust rewrite of
graphile/worker.A user recently reported a bug, and I think this exposes a race condition in the keyed scheduling path (
add_jobs/add_job).Original report: leo91000/graphile_worker_rs#378
Steps to reproduce
10and two tasks:printer: very short task (e.g. sleeps ~2ms) so keyed jobs frequently become locked/running.scheduler: loops many times (e.g.100), each time callingaddJob("printer", { key }, { jobKey: key, jobKeyMode: "preserve_run_at" })with key chosen from a small keyspace (e.g. 10 keys).schedulerjobs concurrently (e.g. 4).This creates high contention on the same
jobKeywhile some conflicting rows are locked.Expected results
addJob(...)should always return a valid job row.replace/preserve_run_at, scheduling should not occasionally return “no row” under contention.Actual results
Under contention,
graphile_worker.add_jobs(...)can return no row for a spec because of:ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at IS NULLWhen that
WHEREcondition is false, the conflict path does nothing and returns nothing for that spec.Then
add_job(...)(which selects fromadd_jobs(...) LIMIT 1) can return a null/empty row.In strict clients this surfaces clearly (example from Rust/sqlx):
error occurred while decoding column "id": unexpected null; try decoding as an OptionIn JS this can manifest as
rows[0]missing fromadd_job(...)result in edge cases.Additional context
https://github.com/leo91000/graphile_worker_rs/issues/378sql/000018.sql:add_jobdelegates toadd_jobs(select * into v_job from ...add_jobs(...))add_jobsusesON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at is null0.17.0-rc.0(frompackage.json)