Skip to content

OCPBUGS-31521: Don't publish duplicate DNS records#1229

Open
rfredette wants to merge 1 commit intoopenshift:masterfrom
rfredette:no-conflicting-dns
Open

OCPBUGS-31521: Don't publish duplicate DNS records#1229
rfredette wants to merge 1 commit intoopenshift:masterfrom
rfredette:no-conflicting-dns

Conversation

@rfredette
Copy link
Copy Markdown
Contributor

Before attempting to publish a domain to a zone, check if that domain is already being published to the same zone.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2025
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

for _, existingRecord := range records.Items {
// we only care if the domain name is published by a different record, so ignore the matching record if it
// already exists.
// TODO: There's got to be a better way to match the same object
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare by UID instead of name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a relevant comment, but I moved the function, which made github think it's outdated. I'll make this change in my next update

}
} else if isRecordPublished {
condition, err = r.replacePublishedRecord(zones[i], record)
} else if isDomainPublished, err = domainIsAlreadyPublishedInZone(context.Background(), r.cache, record, &zones[i]); err != nil {
Copy link
Copy Markdown
Contributor

@Miciah Miciah May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to declare isDomainPublished outside the else if clause. Better to keep isDomainPublished and err scoped to the else if clauses.

Suggested change
} else if isDomainPublished, err = domainIsAlreadyPublishedInZone(context.Background(), r.cache, record, &zones[i]); err != nil {
} else if isDomainPublished, err := domainIsAlreadyPublishedInZone(context.Background(), r.cache, record, &zones[i]); err != nil {

Edit: Discussed on a call. Line 388 uses the err value. This logic is a bit subtle and could use some refactoring.

Comment thread pkg/operator/controller/dns/controller.go Outdated
@rfredette rfredette force-pushed the no-conflicting-dns branch from 9fece4b to d38f7bf Compare May 15, 2025 17:33
func (r *reconciler) MapOnRecordDelete(ctx context.Context, o client.Object) []reconcile.Request {
deletedRecord, ok := o.(*iov1.DNSRecord)
if !ok {
log.Info("failed to read DNS record")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Info("failed to read DNS record")
log.Infof("Got unexpected object; expected type DNSRecord, got type %T", o)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something more like this:

Suggested change
log.Info("failed to read DNS record")
log.Error(nil, "Got unexpected type of object", "expected", "DNSRecord", "actual", fmt.Sprintf("%T", o))

Comment on lines +86 to +88
// When a DNS record is deleted, there may be a conflicting record that should be published. Watch exclusively for
// deletes, and queue a reconcile request for the appropriate conflicting record, if applicable.
if err := c.Watch(source.Kind[client.Object](operatorCache, &iov1.DNSRecord{}, handler.EnqueueRequestsFromMapFunc(reconciler.MapOnRecordDelete), predicate.Funcs{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the comment a little more explicit that, yes, we have two watches on the same resource (dnsrecords), and the reason is so that we can have a predicate and mapfunc to do something special for deletes.

Comment on lines +86 to +90
// When a DNS record is deleted, there may be a conflicting record that should be published. Watch exclusively for
// deletes, and queue a reconcile request for the appropriate conflicting record, if applicable.
if err := c.Watch(source.Kind[client.Object](operatorCache, &iov1.DNSRecord{}, handler.EnqueueRequestsFromMapFunc(reconciler.MapOnRecordDelete), predicate.Funcs{
CreateFunc: func(e event.CreateEvent) bool { return false },
DeleteFunc: func(e event.DeleteEvent) bool { return true },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment with your findings that the delete event happens when the object is actually deleted, not when it is merely marked for deletion (that is, when deletionTimestamp is set).

Comment thread pkg/operator/controller/dns/controller.go Outdated
Comment thread pkg/operator/controller/dns/controller.go Outdated
Comment thread pkg/operator/controller/dns/controller_test.go Outdated
// Test_publishRecordToZonesMergesStatus verifies that publishRecordToZones
// correctly merges status updates.
func TestPublishRecordToZonesMergesStatus(t *testing.T) {
func Test_publishRecordToZonesMergesStatus(t *testing.T) {
Copy link
Copy Markdown
Contributor

@Miciah Miciah May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestPublishRecordToZonesMergesStatus is an appropriate name for the test as there is no publishRecordToZonesMergesStatus function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I am also missing something here, but is there a new test case that will check if the condition is set?

@rfredette rfredette changed the title Don't publish duplicate DNS records OCPBUGS-31521: Don't publish duplicate DNS records Aug 13, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 13, 2025
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-31521, which is invalid:

  • expected the bug to target either version "4.20." or "openshift-4.20.", but it targets "4.19.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Before attempting to publish a domain to a zone, check if that domain is already being published to the same zone.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rfredette rfredette marked this pull request as ready for review August 13, 2025 21:21
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025
@openshift-ci openshift-ci Bot requested review from Thealisyed and miheer August 13, 2025 21:22
Comment thread pkg/operator/controller/dns/controller.go
Comment thread pkg/operator/controller/dns/controller.go Outdated
iov1.AddToScheme(scheme)
fakeClient := fake.NewClientBuilder().
WithScheme(scheme).
WithIndex(&iov1.DNSRecord{}, dnsRecordIndexFieldName, func(o client.Object) []string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (no need to fix now!): do we care about making this indexer function some sort of utils/specific function that can be used both on the operatorCache.IndexField and on fakeCache to keep consistency?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a util function that the controller logic and test logic would share? That does make sense, though I do caution against re-using controller logic in tests if doing so could mask a defect in the controller logic.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the idea was to make some util/shared function, but also I see your concerns here, so makes sense also to not share and in case something changes on the main reconciliation logic, the test that has a different cache logic will catch the regression.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to use the same index function that was added in lines 117-122. Since the fake cache doesn't go through all the setup steps that the actual one does, it needed to be added manually. In this case, having the logic match what's used in the actual controller probably is the way to go.

@candita
Copy link
Copy Markdown
Contributor

candita commented Aug 20, 2025

/assign @rikatz

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Aug 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rikatz. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Aug 28, 2025

/cc
Re-asking my review, will do it first thing tomorrow morning!

@openshift-ci openshift-ci Bot requested a review from rikatz August 28, 2025 20:06
oldestExistingRecord := iov1.DNSRecord{}
for _, existingRecord := range otherRecords.Items {
// Exclude records that are marked for deletion.
if !existingRecord.DeletionTimestamp.IsZero() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for safety (sorry for realizing this just now): DeletionTimestamp is a nullable field / a pointer (https://github.com/kubernetes/apimachinery/blob/d74026bbe3beeff64c3dc7259a29be7708aa834f/pkg/apis/meta/v1/types.go#L209) and as so, I would recommend checking if it is null, and then checking if it is zero.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsZero method has an nil check on its receiver, so I think the caller can omit the nil check?

// IsZero returns true if the value is nil or time is zero.
func (t *Time) IsZero() bool {
if t == nil {
return true

I would be happy with a unit test case in lieu of a nil check.

@Miciah
Copy link
Copy Markdown
Contributor

Miciah commented Sep 22, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 22, 2025
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-31521, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (iamin@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rhamini3
Copy link
Copy Markdown
Contributor

marking bug as verified, since it is fixed through pre-merge testing

  1. Create a gateway with the same ingress controller domain and check if it is published
  status:
    observedGeneration: 1
    zones:
    - conditions:
      - message: Domain name is already in use
        reason: DomainAlreadyInUse
        status: "False"
        type: Published
      dnsZone:
        tags:
          Name: ci-ln-ti0k73k-76ef8-jkss7-int
          kubernetes.io/cluster/ci-ln-ti0k73k-76ef8-jkss7: owned
    - conditions:
      - message: Domain name is already in use
        reason: DomainAlreadyInUse
        status: "False"
        type: Published
      dnsZone:
        id: Z00287062J1ITQ61DDU2Z
  1. Delete the gateway and confirm that upstream dnsrecord and routes are not affected
% oc get dnsrecord -A
NAMESPACE                    NAME               AGE
openshift-ingress-operator   default-wildcard   145m

% curl -I oauth-openshift.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org
HTTP/1.1 302 Found
content-length: 0
location: https://oauth-openshift.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org/
cache-control: no-cache

iamin@iamin-mac cluster-ingress-operator % curl -I canary-openshift-ingress-canary.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org 
HTTP/1.1 302 Found
content-length: 0
location: https://canary-openshift-ingress-canary.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org/
cache-control: no-cache

/verified by rhamini3

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Sep 22, 2025
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rhamini3: This PR has been marked as verified by rhamini3.

Details

In response to this:

marking bug as verified, since it is fixed through pre-merge testing

  1. Create a gateway with the same ingress controller domain and check if it is published
 status:
   observedGeneration: 1
   zones:
   - conditions:
     - message: Domain name is already in use
       reason: DomainAlreadyInUse
       status: "False"
       type: Published
     dnsZone:
       tags:
         Name: ci-ln-ti0k73k-76ef8-jkss7-int
         kubernetes.io/cluster/ci-ln-ti0k73k-76ef8-jkss7: owned
   - conditions:
     - message: Domain name is already in use
       reason: DomainAlreadyInUse
       status: "False"
       type: Published
     dnsZone:
       id: Z00287062J1ITQ61DDU2Z
  1. Delete the gateway and confirm that upstream dnsrecord and routes are not affected
% oc get dnsrecord -A
NAMESPACE                    NAME               AGE
openshift-ingress-operator   default-wildcard   145m

% curl -I oauth-openshift.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org
HTTP/1.1 302 Found
content-length: 0
location: https://oauth-openshift.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org/
cache-control: no-cache

iamin@iamin-mac cluster-ingress-operator % curl -I canary-openshift-ingress-canary.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org 
HTTP/1.1 302 Found
content-length: 0
location: https://canary-openshift-ingress-canary.apps.ci-ln-ti0k73k-76ef8.aws-2.ci.openshift.org/
cache-control: no-cache

/verified by rhamini3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@candita
Copy link
Copy Markdown
Contributor

candita commented Nov 6, 2025

E2E test failure seen on this, reported by @rikatz in https://issues.redhat.com/browse/OCPBUGS-64675.

 --- FAIL: TestAll/serial/TestGatewayAPI/testGatewayAPIDNS (341.29s)
            --- PASS: TestAll/serial/TestGatewayAPI/testGatewayAPIDNS/multipleGatewaysSameListenerHostname (161.16s)
            --- FAIL: TestAll/serial/TestGatewayAPI/testGatewayAPIDNS/gatewayListenersWithOverlappingHostname (180.09s)

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1e2adcf]

@candita
Copy link
Copy Markdown
Contributor

candita commented Jan 27, 2026

/retest

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

/retest-required

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Mar 19, 2026

we may want to revive this soon.

@candita @rfredette do you folks think this can be worked next week, if Ryan has the capacity? Otherwise I think we should prioritize it for 4.23 as this is long overdue.

thanks!!

@rfredette rfredette force-pushed the no-conflicting-dns branch from 4822baa to 0dbf8fe Compare April 8, 2026 17:37
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 8, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

The DNS controller now indexes DNSRecord.Spec.DNSName and validates domain conflicts before publishing. publishRecordToZones performs pre-publish checks that set DNSRecordPublished to InternalError or DomainAlreadyInUse on failure and avoid publishing when a conflict exists. Delete events use a specialized map function to enqueue the oldest matching non-unmanaged record for reconciliation. Tests were added/updated with a fake controller-runtime cache and conflict-focused scenarios.

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main change: implementing validation to prevent publishing duplicate DNS records to the same zone.
Description check ✅ Passed The description is directly related to the changeset, explaining the domain conflict check before publishing DNS records.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies standard Go test files using native testing package, not Ginkgo. Custom check applies only to Ginkgo test names, so it is not applicable.
Test Structure And Quality ✅ Passed Test code demonstrates solid quality practices with single responsibility, clean setup using helpers, meaningful assertion messages, and consistent patterns.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR; only Go unit tests were added to the DNS controller test file.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This check is not applicable to the provided pull request. The PR adds unit tests using Go's standard testing package, not Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed This PR introduces only DNS record management logic without introducing scheduling constraints, pod affinity, nodeSelectors, tolerations, or topology spread constraints.
Ote Binary Stdout Contract ✅ Passed Pull request contains no process-level stdout writes, logging initialization without stderr redirection, or other violations of OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests, containing only standard Go unit tests for DNS record conflict detection.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/controller/dns/controller.go`:
- Around line 426-453: The current domainIsAlreadyPublishedToZone uses
recordIsAlreadyPublishedToZone (which reads a Published=True status) so races
allow duplicate publishes; change the guard to determine single ownership by
comparing eligible records' identity/order instead of published status: in
domainIsAlreadyPublishedToZone (and when iterating records.Items) ignore records
that are not eligible (e.g., with deletionTimestamp set or that cannot be
authoritative), then find the canonical owner for record.Spec.DNSName by
choosing the record with the earliest CreationTimestamp (tie-break by UID) and
return true (block) if any other eligible record is earlier than the current
record. Replace the call to recordIsAlreadyPublishedToZone with this
order/ownership check so Ensure cannot publish when a prior eligible record
exists.
- Around line 663-680: The current requeue logic selects the oldest non-deleting
record (oldestExistingRecord) from otherRecords.Items without checking whether
that record is eligible to publish (e.g. UnmanagedDNS), which can requeue an
ineligible record and leave a publishable record stranded; update the selection
loop to first filter otherRecords.Items to only include eligible records (skip
those with DeletionTimestamp set and skip records marked UnmanagedDNS or
otherwise known-unable-to-publish), then pick the oldest CreationTimestamp among
that filtered set (the variable oldestExistingRecord), and if no eligible
records remain return an empty []reconcile.Request rather than a request for an
ineligible record. Ensure the logic references otherRecords.Items,
oldestExistingRecord and the UnmanagedDNS/eligibility check used elsewhere in
this controller.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 3dcd6230-2887-4bd2-940f-260c2abd37f8

📥 Commits

Reviewing files that changed from the base of the PR and between b408b27 and 0dbf8fe.

📒 Files selected for processing (2)
  • pkg/operator/controller/dns/controller.go
  • pkg/operator/controller/dns/controller_test.go

Comment on lines +426 to +453
// domainIsAlreadyPublishedToZone returns true if the domain name in the
// provided DNSRecord is already published by another existing dnsRecord.
func domainIsAlreadyPublishedToZone(ctx context.Context, cache cache.Cache, record *iov1.DNSRecord, zone *configv1.DNSZone) (bool, error) {
records := iov1.DNSRecordList{}
if err := cache.List(ctx, &records, client.MatchingFields{dnsRecordIndexFieldName: record.Spec.DNSName}); err != nil {
return false, err
}

if len(records.Items) == 0 {
log.Info(fmt.Sprintf("No existing records found for domain %q", record.Spec.DNSName))
return false, nil
}

for _, existingRecord := range records.Items {
// we only care if the domain name is published by a different record,
// so ignore the matching record if it already exists.
if record.UID == existingRecord.UID {
continue
}
if record.Spec.DNSName != existingRecord.Spec.DNSName {
continue
}
if recordIsAlreadyPublishedToZone(&existingRecord, zone) {
return true, nil
}
}
return false, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This guard still allows duplicate publishes on stale status.

A conflicting record only blocks once its cached zone condition is already Published=True. Because Ensure runs before the later status write/cache update, a second record can still pass this check and publish the same DNS name under back-to-back reconciles. If the intended policy is single ownership per DNS name, the guard needs to key off another eligible record's existence/order, not post-publish status.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/dns/controller.go` around lines 426 - 453, The
current domainIsAlreadyPublishedToZone uses recordIsAlreadyPublishedToZone
(which reads a Published=True status) so races allow duplicate publishes; change
the guard to determine single ownership by comparing eligible records'
identity/order instead of published status: in domainIsAlreadyPublishedToZone
(and when iterating records.Items) ignore records that are not eligible (e.g.,
with deletionTimestamp set or that cannot be authoritative), then find the
canonical owner for record.Spec.DNSName by choosing the record with the earliest
CreationTimestamp (tie-break by UID) and return true (block) if any other
eligible record is earlier than the current record. Replace the call to
recordIsAlreadyPublishedToZone with this order/ownership check so Ensure cannot
publish when a prior eligible record exists.

Comment on lines +663 to +680
oldestExistingRecord := iov1.DNSRecord{}
for _, existingRecord := range otherRecords.Items {
// Exclude records that are marked for deletion.
if !existingRecord.DeletionTimestamp.IsZero() {
continue
}
if oldestExistingRecord.CreationTimestamp.IsZero() || existingRecord.CreationTimestamp.Before(&oldestExistingRecord.CreationTimestamp) {
oldestExistingRecord = existingRecord
}
}
return []reconcile.Request{
{
NamespacedName: types.NamespacedName{
Name: oldestExistingRecord.Name,
Namespace: oldestExistingRecord.Namespace,
},
},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only requeue an eligible replacement record here.

This currently picks the oldest non-deleting record even if it is UnmanagedDNS and can never publish. In a sequence like unmanaged A → published B → blocked managed C, deleting B requeues A and leaves C stranded with no follow-up event. Once you filter to records that can actually publish, also return no requests when none remain.

🐛 Possible fix
-	oldestExistingRecord := iov1.DNSRecord{}
-	for _, existingRecord := range otherRecords.Items {
+	var oldestExistingRecord *iov1.DNSRecord
+	for i := range otherRecords.Items {
+		existingRecord := &otherRecords.Items[i]
 		// Exclude records that are marked for deletion.
 		if !existingRecord.DeletionTimestamp.IsZero() {
 			continue
 		}
-		if oldestExistingRecord.CreationTimestamp.IsZero() || existingRecord.CreationTimestamp.Before(&oldestExistingRecord.CreationTimestamp) {
-			oldestExistingRecord = existingRecord
+		if existingRecord.Spec.DNSManagementPolicy == iov1.UnmanagedDNS {
+			continue
+		}
+		if oldestExistingRecord == nil || existingRecord.CreationTimestamp.Before(&oldestExistingRecord.CreationTimestamp) {
+			oldestExistingRecord = existingRecord
 		}
 	}
+	if oldestExistingRecord == nil {
+		return nil
+	}
 	return []reconcile.Request{
 		{
 			NamespacedName: types.NamespacedName{
 				Name:      oldestExistingRecord.Name,
 				Namespace: oldestExistingRecord.Namespace,
 			},
 		},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/dns/controller.go` around lines 663 - 680, The
current requeue logic selects the oldest non-deleting record
(oldestExistingRecord) from otherRecords.Items without checking whether that
record is eligible to publish (e.g. UnmanagedDNS), which can requeue an
ineligible record and leave a publishable record stranded; update the selection
loop to first filter otherRecords.Items to only include eligible records (skip
those with DeletionTimestamp set and skip records marked UnmanagedDNS or
otherwise known-unable-to-publish), then pick the oldest CreationTimestamp among
that filtered set (the variable oldestExistingRecord), and if no eligible
records remain return an empty []reconcile.Request rather than a request for an
ineligible record. Ensure the logic references otherRecords.Items,
oldestExistingRecord and the UnmanagedDNS/eligibility check used elsewhere in
this controller.

Before attempting to publish a domain to a zone, check if that domain is
already being published to the same zone.
@rfredette rfredette force-pushed the no-conflicting-dns branch from 0dbf8fe to 793a451 Compare April 29, 2026 14:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/operator/controller/dns/controller.go (1)

663-684: ⚠️ Potential issue | 🟡 Minor

Return empty slice when no eligible record exists.

After filtering out records that are being deleted (line 666) or are unmanaged (line 670), if no eligible records remain, oldestExistingRecord stays as a zero-value struct with empty Name and Namespace. Lines 677-684 then return a request for this empty record, which wastes reconciliation resources.

🛡️ Proposed fix
 	oldestExistingRecord := iov1.DNSRecord{}
 	for _, existingRecord := range otherRecords.Items {
 		// Exclude records that are marked for deletion.
 		if !existingRecord.DeletionTimestamp.IsZero() {
 			continue
 		}
 		// Exclude unmanaged DNS records.
 		if existingRecord.Spec.DNSManagementPolicy == iov1.UnmanagedDNS {
 			continue
 		}
 		if oldestExistingRecord.CreationTimestamp.IsZero() || existingRecord.CreationTimestamp.Before(&oldestExistingRecord.CreationTimestamp) {
 			oldestExistingRecord = existingRecord
 		}
 	}
+	// No eligible record found to requeue.
+	if oldestExistingRecord.Name == "" {
+		return []reconcile.Request{}
+	}
 	return []reconcile.Request{
 		{
 			NamespacedName: types.NamespacedName{
 				Name:      oldestExistingRecord.Name,
 				Namespace: oldestExistingRecord.Namespace,
 			},
 		},
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/controller/dns/controller.go` around lines 663 - 684, The loop
that computes oldestExistingRecord (iterating otherRecords.Items and filtering
by DeletionTimestamp and Spec.DNSManagementPolicy) can leave
oldestExistingRecord as the zero-value DNSRecord, causing a reconcile.Request
with empty Name/Namespace to be returned; update the code after the loop to
check whether oldestExistingRecord.CreationTimestamp.IsZero() (or Name/Namespace
empty) and if so return an empty []reconcile.Request{}, otherwise build and
return the reconcile.Request using oldestExistingRecord.Name and .Namespace.
Ensure the check references oldestExistingRecord, otherRecords, and does not
create a request when no eligible record exists.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/operator/controller/dns/controller.go`:
- Around line 663-684: The loop that computes oldestExistingRecord (iterating
otherRecords.Items and filtering by DeletionTimestamp and
Spec.DNSManagementPolicy) can leave oldestExistingRecord as the zero-value
DNSRecord, causing a reconcile.Request with empty Name/Namespace to be returned;
update the code after the loop to check whether
oldestExistingRecord.CreationTimestamp.IsZero() (or Name/Namespace empty) and if
so return an empty []reconcile.Request{}, otherwise build and return the
reconcile.Request using oldestExistingRecord.Name and .Namespace. Ensure the
check references oldestExistingRecord, otherRecords, and does not create a
request when no eligible record exists.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 669de4c2-b3a1-4111-9891-1fb7b543fc2c

📥 Commits

Reviewing files that changed from the base of the PR and between 0dbf8fe and 793a451.

📒 Files selected for processing (2)
  • pkg/operator/controller/dns/controller.go
  • pkg/operator/controller/dns/controller_test.go

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Apr 29, 2026

/cc

will review again soon!

@openshift-ci openshift-ci Bot requested a review from rikatz April 29, 2026 14:58
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

@rfredette: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial 4822baa link true /test e2e-aws-ovn-serial
ci/prow/e2e-gcp-ovn 4822baa link false /test e2e-gcp-ovn
ci/prow/okd-scos-e2e-aws-ovn 4822baa link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-operator-techpreview 793a451 link false /test e2e-aws-operator-techpreview
ci/prow/e2e-hypershift 793a451 link true /test e2e-hypershift
ci/prow/e2e-aws-ovn-upgrade 793a451 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rikatz
Copy link
Copy Markdown
Member

rikatz commented May 5, 2026

@coderabbitai given it has been too long since you reviewed this PR, I want you to review again. You must report any critical or major finding, you must do a review line by line. Please report any possible inconsistency or behavior that can cause disruption, corruption. Report as much findings as possible, do not leave any review to be verified on a future commit, I want you to do a single report with EVERYTHING that must be fixed on a single review shot

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the full PR diff
gh pr diff 1229 --repo openshift/cluster-ingress-operator

Length of output: 15938


🏁 Script executed:

#!/bin/bash
# Get the full controller.go file
cat pkg/operator/controller/dns/controller.go

Length of output: 36999


🏁 Script executed:

#!/bin/bash
# Find and read the testutil FakeCache and FakeClientRecorder
fd -t f . pkg/operator/controller/test/
echo "---FILE LIST ABOVE---"
cat pkg/operator/controller/dns/controller_test.go | head -60
echo "---"
# Find FakeCache definition
rg -n "FakeCache|FakeClientRecorder" pkg/operator/controller/test/ -A 10

Length of output: 7056


@rikatz Here is a comprehensive, line-by-line review of all changes in this PR:


🔴 Critical Bugs

Bug 1 — mapOnRecordDelete emits a reconcile request with an empty Name/Namespace when all candidates are filtered out

File: pkg/operator/controller/dns/controller.go

// mapOnRecordDelete:
oldestExistingRecord := iov1.DNSRecord{}
for _, existingRecord := range otherRecords.Items {
    if !existingRecord.DeletionTimestamp.IsZero() { continue }
    if existingRecord.Spec.DNSManagementPolicy == iov1.UnmanagedDNS { continue }
    if oldestExistingRecord.CreationTimestamp.IsZero() || ... {
        oldestExistingRecord = existingRecord
    }
}
// ⚠️ If ALL records were filtered, oldestExistingRecord is still a zero-value DNSRecord{}
return []reconcile.Request{
    {
        NamespacedName: types.NamespacedName{
            Name:      oldestExistingRecord.Name,      // "" ← BUG
            Namespace: oldestExistingRecord.Namespace, // "" ← BUG
        },
    },
}

If every record with the same domain name is either terminating (DeletionTimestamp != nil) or unmanaged, oldestExistingRecord stays as the zero value, and a reconcile.Request for {Name:"", Namespace:""} is enqueued. This is almost certainly the root cause of the E2E panic (invalid memory address or nil pointer dereference in testGatewayAPIDNS/gatewayListenersWithOverlappingHostname).

Fix: Guard the return:

if oldestExistingRecord.Name == "" {
    return []reconcile.Request{}
}
return []reconcile.Request{{
    NamespacedName: types.NamespacedName{
        Name:      oldestExistingRecord.Name,
        Namespace: oldestExistingRecord.Namespace,
    },
}}

Bug 2 — mapOnRecordDelete does not exclude the deleted record itself by UID

File: pkg/operator/controller/dns/controller.go

for _, existingRecord := range otherRecords.Items {
    if !existingRecord.DeletionTimestamp.IsZero() { continue }
    // ⚠️ No check: existingRecord.UID != deletedRecord.UID
    ...
}

The delete event fires before the informer cache removes the object. If the deleted record had no finalizer (or was force-deleted), it will have no DeletionTimestamp and will appear in the cache list without being filtered. This can cause the deleted record itself to be re-enqueued as the "oldest" candidate, resulting in a reconcile for a non-existent record.

Fix: Add a UID exclusion:

if existingRecord.UID == deletedRecord.UID {
    continue
}

🟠 Major Issues

Issue 3 — DomainAlreadyInUse never sets requeue = true: blocked records can get permanently stuck

File: pkg/operator/controller/dns/controller.go

} else if isDomainPublished {
    log.Info("DNS record not published: domain name already used by another DNS record", ...)
    condition = iov1.DNSZoneCondition{
        Message: "Domain name is already in use",
        Reason:  "DomainAlreadyInUse",
        Status:  string(operatorv1.ConditionFalse),
        Type:    iov1.DNSRecordPublishedConditionType,
    }
    // ⚠️ requeue is NOT set to true here
}

The design relies entirely on mapOnRecordDelete to re-enqueue the blocked record when the conflicting record is removed. If mapOnRecordDelete fails (e.g., r.cache.List returns an error), it silently returns []reconcile.Request{} and the blocked record is stuck forever in DomainAlreadyInUse with no retry mechanism.

A slow requeue (e.g., 5–10 minutes) should be added as a safety net, similar to other error paths that use RequeueAfter: 30 * time.Second.


Issue 4 — context.Background() used instead of propagating the reconcile context

File: pkg/operator/controller/dns/controller.go

} else if isDomainPublished, err := domainIsAlreadyPublishedToZone(context.Background(), r.cache, record, &zones[i]); err != nil {

publishRecordToZones doesn't accept a context parameter, so context.Background() is passed. This means any cancellation or timeout set by the controller-runtime framework on the reconcile context does not propagate to this cache list operation. The function signature of publishRecordToZones should be extended to accept a context.


Issue 5 — Redundant dead-code check in domainIsAlreadyPublishedToZone

File: pkg/operator/controller/dns/controller.go

for _, existingRecord := range records.Items {
    if record.UID == existingRecord.UID { continue }
    if record.Spec.DNSName != existingRecord.Spec.DNSName { continue } // ⚠️ Dead code
    if recordIsAlreadyPublishedToZone(&existingRecord, zone) {
        return true, nil
    }
}

The cache list is already filtered by client.MatchingFields{dnsRecordIndexFieldName: record.Spec.DNSName}, so every record returned from the list will always have existingRecord.Spec.DNSName == record.Spec.DNSName. The second if can never be true and will never continue. This should be removed to avoid confusion.


Issue 6 — Noisy log message emitted on every successful reconcile (hot path)

File: pkg/operator/controller/dns/controller.go

if len(records.Items) == 0 {
    log.Info(fmt.Sprintf("No existing records found for domain %q", record.Spec.DNSName))
    return false, nil
}

This log.Info is emitted on every single reconcile of every DNS record that has no conflicting peers — which is the normal/expected case for the vast majority of records. On a cluster with frequent reconciles, this will flood logs with noise. This should either be removed or changed to a debug-level log.


Issue 7 — No unit test for mapOnRecordDelete at all

File: pkg/operator/controller/dns/controller_test.go

mapOnRecordDelete is the most complex new function and arguably the riskiest (it has the confirmed E2E bug in Bug 1). Yet there is zero unit test coverage for it. Required cases that should be tested:

Scenario Expected result
No matching records []reconcile.Request{}
All matching records are terminating []reconcile.Request{} ← Bug 1 scenario
All matching records are unmanaged []reconcile.Request{} ← Bug 1 scenario
Multiple matching records, pick oldest request for the oldest record
Deleted record itself is still in cache must exclude it

Issue 8 — TOCTOU race condition between conflict check and publish

File: pkg/operator/controller/dns/controller.go

} else if isDomainPublished, err := domainIsAlreadyPublishedToZone(...); err != nil {
    ...
} else if isDomainPublished {
    ...
} else {
    condition, err = r.publishRecord(zones[i], record)
    ...
}

Between the moment domainIsAlreadyPublishedToZone returns false and r.publishRecord is called, another record could be published to the same zone and domain. This is a classic check-then-act race. While the practical frequency is low for DNS record creation, it can lead to duplicate DNS records being published if two gateways/ingresscontrollers are created concurrently with the same domain. This should at minimum be documented, and ideally caught at the DNS provider level.


🟡 Significant Observations

Issue 9 — dnsRecord test helper creates records without a Namespace

File: pkg/operator/controller/dns/controller_test.go

func dnsRecord(name string, ...) *iov1.DNSRecord {
    return &iov1.DNSRecord{
        ObjectMeta: metav1.ObjectMeta{
            Name: fmt.Sprintf("record-%s", name),
            UID:  uuid.NewUUID(),
            // ⚠️ No Namespace set
        },
        ...
    }
}

Real DNS records live in openshift-ingress-operator. The absence of a namespace in test helpers means that cross-namespace conflict scenarios (e.g., two records in different namespaces with the same domain name) are not tested. The index dnsRecordIndexFieldName is namespace-agnostic (it only indexes by Spec.DNSName), so a record in namespace A could block a record in namespace B — this behavior should be explicitly tested.


Issue 10 — mapOnRecordDelete silently swallows cache.List errors

File: pkg/operator/controller/dns/controller.go

if err := r.cache.List(ctx, &otherRecords, client.MatchingFields{dnsRecordIndexFieldName: deletedRecord.Spec.DNSName}); err != nil {
    log.Error(err, "failed to list DNS records")
    return []reconcile.Request{} // ⚠️ Silently returns, no retry
}

When cache.List fails here, the error is logged but no reconcile request is returned. There is no mechanism to retry the lookup. Combined with Issue 3, if this happens when a conflicting record is deleted, the blocked record stays in DomainAlreadyInUse forever. Consider returning a reconcile for the deleted record itself as a fallback, or at minimum logging a more prominent warning.


Issue 11 — UnmanagedDNS condition lost its LastTransitionTime initialization (behavioral regression)

File: pkg/operator/controller/dns/controller.go

-condition = iov1.DNSZoneCondition{
-    ...
-    LastTransitionTime: metav1.Now(),
-}
+condition = iov1.DNSZoneCondition{
+    ...
+    // No LastTransitionTime
+}

The new InternalError and DomainAlreadyInUse conditions also omit LastTransitionTime. While mergeConditions does set it when adding new conditions, this is an inconsistency relative to replacePublishedRecord and publishRecord which initialize it explicitly. This is low risk but increases cognitive overhead.


Issue 12 — No e2e test covering the "conflicting then unblocking" lifecycle

The E2E test testGatewayAPIDNS/gatewayListenersWithOverlappingHostname panicked. Even after Bug 1 is fixed, there should be an explicit end-to-end test that verifies:

  1. Record A publishes successfully.
  2. Record B with the same domain gets DomainAlreadyInUse.
  3. Record A is deleted.
  4. Record B is eventually published (exercising mapOnRecordDelete).

This lifecycle is the entire purpose of the PR and should be covered by at least one E2E test.


Summary Table

# Severity Location Description
1 🔴 Critical mapOnRecordDelete Returns empty-name reconcile request when all candidates filtered → E2E panic
2 🔴 Critical mapOnRecordDelete Does not exclude deleted record by UID
3 🟠 Major publishRecordToZones DomainAlreadyInUse never requeues → stuck records if mapOnRecordDelete fails
4 🟠 Major publishRecordToZones context.Background() instead of reconcile context
5 🟠 Major domainIsAlreadyPublishedToZone Redundant dead-code name check (index already filters by name)
6 🟠 Major domainIsAlreadyPublishedToZone Noisy log.Info on every successful reconcile
7 🟠 Major controller_test.go Zero unit tests for mapOnRecordDelete
8 🟠 Major publishRecordToZones TOCTOU race between conflict check and actual publish
9 🟡 Significant controller_test.go dnsRecord helper has no namespace; cross-namespace conflicts untested
10 🟡 Significant mapOnRecordDelete cache.List errors silently swallowed with no retry
11 🟡 Significant publishRecordToZones LastTransitionTime missing from new conditions
12 🟡 Significant E2E No E2E test for full conflict-then-unblock lifecycle

The immediate blocker is Bug 1 — the missing guard in mapOnRecordDelete when all candidates are filtered is causing the nil-pointer panic in E2E. This must be fixed before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants