Race in add_job/add_jobs with keyed jobs can drop scheduling requests (and returns no row under contention)

  ### Summary

  I am the maintainer of [graphile_worker_rs](https://github.com/leo91000/graphile_worker_rs), a Rust rewrite of `graphile/worker`.

  A user recently reported a bug, and I think this exposes a race condition in the keyed scheduling path (`add_jobs` / `add_job`).

  Original report: https://github.com/leo91000/graphile_worker_rs/issues/378

  ### Steps to reproduce

  1. Start PostgreSQL and initialize Graphile Worker schema.
  2. Run a worker with concurrency `10` and two tasks:
  - `printer`: very short task (e.g. sleeps ~2ms) so keyed jobs frequently become locked/running.
  - `scheduler`: loops many times (e.g. `100`), each time calling `addJob("printer", { key }, { jobKey: key, jobKeyMode: "preserve_run_at" })` with key chosen from a small keyspace (e.g. 10 keys).
  3. Enqueue multiple `scheduler` jobs concurrently (e.g. 4).
  4. Let it run for ~30-60 seconds.

  This creates high contention on the same `jobKey` while some conflicting rows are locked.

  ### Expected results

  - `addJob(...)` should always return a valid job row.
  - For `replace` / `preserve_run_at`, scheduling should not occasionally return “no row” under contention.

  ### Actual results

  Under contention, `graphile_worker.add_jobs(...)` can return no row for a spec because of:

  - `ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at IS NULL`

  When that `WHERE` condition is false, the conflict path does nothing and returns nothing for that spec.
  Then `add_job(...)` (which selects from `add_jobs(...) LIMIT 1`) can return a null/empty row.

  In strict clients this surfaces clearly (example from Rust/sqlx):
  - `error occurred while decoding column "id": unexpected null; try decoding as an Option`

  In JS this can manifest as `rows[0]` missing from `add_job(...)` result in edge cases.

  ### Additional context

  - Reproduced from issue `https://github.com/leo91000/graphile_worker_rs/issues/378`
  - Reproduced against upstream SQL shape in `sql/000018.sql`:
  - `add_job` delegates to `add_jobs` (`select * into v_job from ...add_jobs(...)`)
  - `add_jobs` uses `ON CONFLICT (key) DO UPDATE ... WHERE jobs.locked_at is null`
  - Current repo version checked: `0.17.0-rc.0` (from `package.json`)
  - PostgreSQL: reproduced on Docker Postgres (15/16 class behavior)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race in add_job/add_jobs with keyed jobs can drop scheduling requests (and returns no row under contention) #580

Summary

Steps to reproduce

Expected results

Actual results

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Race in add_job/add_jobs with keyed jobs can drop scheduling requests (and returns no row under contention) #580

Description

Summary

Steps to reproduce

Expected results

Actual results

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions