🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets by joelanford · Pull Request #2643 · operator-framework/operator-controller

joelanford · 2026-04-11T17:08:56Z

Description

The skipProgressDeadlineExceededPredicate blocked all update events for COS objects
with ProgressDeadlineExceeded, which prevented archival of stuck revisions — the
lifecycle state patch was silently dropped.

This PR:

Removes the predicate so all COS events are fully reconciled
Updates markAsProgressing to set ProgressDeadlineExceeded instead of
RollingOut/Retrying when the deadline has been exceeded, preventing the reconcile
loop the predicate was masking. Succeeded always applies; unregistered reasons panic
Continues reconciling after ProgressDeadlineExceeded rather than clearing the
error and stopping requeue. This allows revisions to recover if a transient error
resolves itself, even after the deadline was exceeded
Extracts durationUntilDeadline as a shared helper for deadline computation
Adds a deadlineAwareRateLimiter that caps exponential backoff at the deadline so
ProgressDeadlineExceeded is set promptly even during error retries
Moves deadline requeue logic into requeueForDeadline, called from within
reconcile when probes are still failing
Fixes e2e scenario cleanup to wait for resource deletions to complete
Adds an e2e test that creates a COS with a never-ready deployment, waits for
ProgressDeadlineExceeded, archives the COS, and verifies resource cleanup

Addresses feedback from:
#2610 (comment)

Reviewer Checklist

Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2026-04-11T17:09:02Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`ad0859e`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/69dbcadd73719b0008d52822
😎 Deploy Preview	https://deploy-preview-2643--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

openshift-ci · 2026-04-11T17:09:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR fixes a reconciliation dead-end where ClusterObjectSet (COS) updates were being dropped once a revision hit ProgressDeadlineExceeded, preventing stuck revisions from being archived and cleaned up. It removes the update-blocking predicate, makes progress-deadline handling “sticky” in status updates, and adds a deadline-aware rate limiter plus an E2E scenario to validate archival cleanup.

Changes:

Remove the ProgressDeadlineExceeded-skipping watch predicate and introduce a controller RateLimiter that caps exponential backoff at the progress deadline.
Refactor progress-deadline computation into a shared durationUntilDeadline helper and adjust progressing/retrying status updates to prefer ProgressDeadlineExceeded once exceeded.
Add an E2E scenario (and step helper) that archives a deadline-exceeded COS and verifies resource cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`internal/operator-controller/controllers/clusterobjectset_controller.go`	Removes the skip predicate, adds deadline-aware rate limiting, refactors deadline computation, and changes progressing/retrying condition behavior when the deadline is exceeded.
`test/e2e/steps/steps.go`	Adds a new Godog step to patch a COS lifecycle state to `Archived`.
`test/e2e/features/revision.feature`	Adds an E2E scenario that forces `ProgressDeadlineExceeded`, archives the COS, and asserts resources are removed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/controllers/clusterobjectset_controller.go

camilamacedo86 · 2026-04-11T17:23:29Z

test/e2e/features/revision.feature

+    When ClusterObjectSet "${COS_NAME}" lifecycle is set to "Archived"
+    Then ClusterObjectSet "${COS_NAME}" is archived
+    And resource "configmap/test-configmap" is eventually not found
+    And resource "deployment/test-deployment" is eventually not found


It does not seems to be testing the same scenario @joelanford
Could we ensure the same scenario here?

User installs a ClusterExtension. The CE controller creates COS-rev-1.

COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.

User updates the ClusterExtension. The CE controller creates COS-rev-2.

COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived so the old revision gets cleaned up.

The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.

COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.

Is it not the same? Making a CE that stamps out the COS-1 and COS-2 such that the CE reconciler eventually tries to set COS-1 as archived is the same thing, but just more ceremony around a standalone COS that exceeds the deadline and then is directly archived by the test code.

I'll verify manually to make sure. Is there something else happening with the CE/COS interaction that would make the COS-only test in the PR different?

Copilot

Pull request overview

This PR adjusts ClusterObjectSet (COS) reconciliation behavior so revisions that hit ProgressDeadlineExceeded still reconcile (enabling archival/cleanup), while making the deadline-exceeded state “sticky” to avoid the previous reconcile loop. It also adds a deadline-capped rate limiter and an e2e scenario to verify archival cleans up resources after a progress deadline is exceeded.

Changes:

Remove the watch predicate that dropped COS update events when ProgressDeadlineExceeded was set, allowing spec patches like archival to be reconciled.
Refactor progress-deadline handling (shared durationUntilDeadline, sticky ProgressDeadlineExceeded, requeue-at-deadline behavior) and add a deadline-aware rate limiter to cap backoff until the deadline.
Add an e2e scenario and a new step to patch COS lifecycle to Archived, plus improve e2e cleanup synchronization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/e2e/steps/steps.go	Adds a Godog step to patch a ClusterObjectSet’s `spec.lifecycleState`.
test/e2e/steps/hooks.go	Makes scenario cleanup wait for concurrent deletions to finish.
test/e2e/features/revision.feature	Adds an e2e scenario covering archival + resource cleanup after `ProgressDeadlineExceeded`.
internal/operator-controller/controllers/clusterobjectset_controller.go	Removes the skip predicate; makes deadline handling sticky; adds deadline-aware requeue + custom rate limiter.
internal/operator-controller/controllers/clusterobjectset_controller_test.go	Updates the progress-deadline requeue expectation to match the new scheduling behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/steps/hooks.go

 			if _, err := k8sClient(args...); err != nil {
 				logger.Info("Error deleting resource", "name", res.name, "namespace", res.namespace, "stderr", stderrOutput(err))
 			}
+			wg.Done()
 		}(r)


joelanford · 2026-04-12T12:19:30Z

internal/operator-controller/controllers/clusterobjectset_controller.go

-			return true
-		},
-	}
 	c.Clock = clock.RealClock{}


Not something I changed in this PR, so out-of-scope. Regardless, caller's don't call SetupWithManager in tests, so this isn't a problem in practice.

joelanford · 2026-04-12T12:20:03Z

internal/operator-controller/controllers/clusterobjectset_controller.go

+	cos := &ocv1.ClusterObjectSet{}
+	if err := r.client.Get(context.Background(), item.NamespacedName, cos); err != nil {


Our use of the rate limiter uses the informer cache which is in-memory and does not block.

Copilot

Pull request overview

This PR fixes a reconciliation dead-end for ClusterObjectSet (COS) revisions that hit ProgressDeadlineExceeded, ensuring they can still reconcile (e.g., to process archival) and improving deadline-related requeue behavior so deadline status is set promptly.

Changes:

Removes the update predicate that dropped COS updates when ProgressDeadlineExceeded was set.
Refactors progress-deadline handling into shared helpers (durationUntilDeadline, requeueForDeadline) and adds a deadline-aware rate limiter to cap exponential backoff at the deadline.
Extends e2e coverage with a new scenario that archives a deadline-exceeded COS and verifies resource cleanup; improves scenario cleanup to wait for deletions to complete.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/e2e/steps/steps.go	Adds an e2e step to patch a COS lifecycle state (used for archival in scenarios).
test/e2e/steps/hooks.go	Ensures scenario cleanup waits for concurrent deletion commands to finish.
test/e2e/features/revision.feature	Adds an e2e scenario that drives a COS into `ProgressDeadlineExceeded`, archives it, and asserts resources are removed.
internal/operator-controller/controllers/clusterobjectset_controller.go	Removes the deadline-exceeded predicate; adds deadline-aware requeue/rate-limiting and updates Progressing condition handling.
internal/operator-controller/controllers/clusterobjectset_controller_test.go	Updates expected requeue timing behavior for progress deadline handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/controllers/clusterobjectset_controller.go

+	remaining, hasDeadline := c.durationUntilDeadline(cos)
+	isDeadlineExceeded := hasDeadline && remaining <= 0
+


internal/operator-controller/controllers/clusterobjectset_controller.go

+// deadlineAwareRateLimiter wraps a delegate rate limiter and caps the backoff
+// duration to the time remaining until the COS progress deadline (+2s), ensuring
+// that ProgressDeadlineExceeded is set promptly even during exponential backoff.
+type deadlineAwareRateLimiter struct {
+	delegate workqueue.TypedRateLimiter[ctrl.Request]


Copilot

Pull request overview

Fixes a reconciliation dead-end where ClusterObjectSet (COS) revisions marked ProgressDeadlineExceeded would no longer reconcile, preventing actions like archival/cleanup from being processed.

Changes:

Removes the update predicate that filtered out COS updates after ProgressDeadlineExceeded, and adjusts progressing/deadline logic to avoid the prior status flip-flop loop.
Introduces shared deadline calculation (durationUntilDeadline), deadline-driven requeueing (requeueForDeadline), and a deadline-capped rate limiter to ensure the deadline condition is set promptly even under backoff.
Extends E2E coverage with a scenario that forces ProgressDeadlineExceeded, archives the COS, and asserts underlying resources are removed; adds a step to patch COS lifecycle state.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`internal/operator-controller/controllers/clusterobjectset_controller.go`	Refactors progress-deadline handling, removes the predicate, adds deadline-aware requeue and rate limiting.
`internal/operator-controller/controllers/clusterobjectset_controller_test.go`	Updates expected requeue timing to match new deadline requeue logic.
`test/e2e/steps/steps.go`	Adds a new step to patch `ClusterObjectSet.spec.lifecycleState`.
`test/e2e/steps/hooks.go`	Adjusts scenario cleanup deletion behavior.
`test/e2e/features/revision.feature`	Adds an E2E scenario covering archival/cleanup of a deadline-exceeded COS.

Comments suppressed due to low confidence (1)

test/e2e/steps/hooks.go:205

ScenarioCleanup adds --wait=false to kubectl delete, but the PR description says cleanup now waits for resource deletions to complete. As written, deletions are still fire-and-forget goroutines with explicit non-waiting, so scenarios can finish while resources remain terminating (and any follow-on checks can observe leftovers). If the intent is to wait, run deletes synchronously (or collect goroutines with a WaitGroup) and/or add an explicit kubectl wait --for=delete phase.

	for _, r := range forDeletion {
		go func(res resource) {
			args := []string{"delete", res.kind, res.name, "--ignore-not-found=true", "--wait=false"}
			if res.namespace != "" {
				args = append(args, "-n", res.namespace)
			}
			if _, err := k8sClient(args...); err != nil {
				logger.Info("Error deleting resource", "name", res.name, "namespace", res.namespace, "stderr", stderrOutput(err))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/controllers/clusterobjectset_controller.go

+	pd := cos.Spec.ProgressDeadlineMinutes
+	if pd <= 0 {
+		return -1, false
+	}
+	// Succeeded is a latch — once set, it's never cleared. A revision that
+	// has already succeeded should not be blocked by the deadline, even if
+	// it temporarily goes back to InTransition (e.g., recovery after drift).
+	if meta.IsStatusConditionTrue(cos.Status.Conditions, ocv1.ClusterObjectSetTypeSucceeded) {
+		return -1, false


internal/operator-controller/controllers/clusterobjectset_controller.go

+// requeueForDeadline returns a Result that requeues at the progress deadline
+// if one is configured and has not yet been exceeded. This ensures that
+// ProgressDeadlineExceeded is set promptly even when no object events occur.
+func (c *ClusterObjectSetReconciler) requeueForDeadline(cos *ocv1.ClusterObjectSet) ctrl.Result {
+	if remaining, hasDeadline := c.durationUntilDeadline(cos); hasDeadline && remaining > 0 {
+		return ctrl.Result{RequeueAfter: remaining}


codecov · 2026-04-12T15:05:55Z

Codecov Report

❌ Patch coverage is 89.39394% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.93%. Comparing base (dd57c28) to head (ef08a54).

Files with missing lines	Patch %	Lines
...troller/controllers/clusterobjectset_controller.go	89.39%	6 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2643      +/-   ##
==========================================
+ Coverage   68.92%   68.93%   +0.01%     
==========================================
  Files         140      140              
  Lines        9905     9929      +24     
==========================================
+ Hits         6827     6845      +18     
- Misses       2566     2571       +5     
- Partials      512      513       +1

Flag	Coverage Δ
e2e	`37.71% <0.00%> (-0.11%)`	⬇️
experimental-e2e	`52.43% <78.78%> (+0.03%)`	⬆️
unit	`53.55% <57.57%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Remove the skipProgressDeadlineExceededPredicate that blocked all update events for COS objects with ProgressDeadlineExceeded. This predicate prevented archival of stuck revisions because the lifecycle state patch was dropped as an update event. To prevent the reconcile loop that the predicate was masking, markAsProgressing now sets ProgressDeadlineExceeded instead of RollingOut/Retrying when the deadline has been exceeded. Terminal reasons (Succeeded) always apply. Unregistered reasons panic. Continue reconciling after ProgressDeadlineExceeded rather than clearing the error and stopping requeue. This allows revisions to recover if a transient error resolves itself, even after the deadline was exceeded. Extract durationUntilDeadline as a shared helper for deadline computation. Add a deadlineAwareRateLimiter that caps exponential backoff at the deadline so ProgressDeadlineExceeded is set promptly even during error retries. Move the deadline requeue logic into requeueForDeadline, called from within reconcile when probes are still failing. Add an e2e test that creates a COS with a never-ready deployment, waits for ProgressDeadlineExceeded, archives the COS, and verifies cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 11, 2026 17:08

openshift-ci bot requested review from camilamacedo86 and tmshort April 11, 2026 17:09

Copilot started reviewing on behalf of joelanford April 11, 2026 17:09 View session

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 9dece97 to a586063 Compare April 11, 2026 17:12

joelanford mentioned this pull request Apr 11, 2026

🐛 fix: (boxcutter) Enable archival and spec changes for ProgressDeadlineExceeded revisions #2610

Open

Copilot AI reviewed Apr 11, 2026

View reviewed changes

camilamacedo86 reviewed Apr 11, 2026

View reviewed changes

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from a586063 to 46dcf54 Compare April 11, 2026 17:39

Copilot AI review requested due to automatic review settings April 12, 2026 11:59

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 46dcf54 to aea3d72 Compare April 12, 2026 11:59

Copilot started reviewing on behalf of joelanford April 12, 2026 12:00 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from aea3d72 to e1b56c2 Compare April 12, 2026 12:13

Copilot AI review requested due to automatic review settings April 12, 2026 12:20

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from e1b56c2 to 00c3c75 Compare April 12, 2026 12:20

Copilot started reviewing on behalf of joelanford April 12, 2026 12:21 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch 2 times, most recently from 9768528 to ef08a54 Compare April 12, 2026 14:43

Copilot AI review requested due to automatic review settings April 12, 2026 14:43

Copilot started reviewing on behalf of joelanford April 12, 2026 14:44 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from ef08a54 to 393ff45 Compare April 12, 2026 16:39

joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from 393ff45 to ad0859e Compare April 12, 2026 16:39

		cos := &ocv1.ClusterObjectSet{}
		if err := r.client.Get(context.Background(), item.NamespacedName, cos); err != nil {

		remaining, hasDeadline := c.durationUntilDeadline(cos)
		isDeadlineExceeded := hasDeadline && remaining <= 0

Conversation

joelanford commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer Checklist

Uh oh!

netlify bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

openshift-ci bot commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camilamacedo86 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

joelanford Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

joelanford Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

joelanford Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov bot commented Apr 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joelanford commented Apr 11, 2026 •

edited

Loading

netlify bot commented Apr 11, 2026 •

edited

Loading