Skip to content

IGNITE-28738 Implement basic Rolling Upgrade test using testcontainers#13184

Open
wernerdv wants to merge 26 commits into
apache:masterfrom
wernerdv:testcontainers
Open

IGNITE-28738 Implement basic Rolling Upgrade test using testcontainers#13184
wernerdv wants to merge 26 commits into
apache:masterfrom
wernerdv:testcontainers

Conversation

@wernerdv

Copy link
Copy Markdown
Contributor

Thank you for submitting the pull request to the Apache Ignite.

In order to streamline the review of the contribution
we ask you to ensure the following steps have been taken:

The Contribution Checklist

  • There is a single JIRA ticket related to the pull request.
  • The web-link to the pull request is attached to the JIRA ticket.
  • The JIRA ticket has the Patch Available state.
  • The pull request body describes changes that have been made.
    The description explains WHAT and WHY was made instead of HOW.
  • The pull request title is treated as the final commit message.
    The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
  • A reviewer has been mentioned through the JIRA comments
    (see the Maintainers list)
  • The pull request has been checked by the Teamcity Bot and
    the green visa attached to the JIRA ticket (see tab PR Check at TC.Bot - Instance 1 or TC.Bot - Instance 2)

Notes

If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com #ignite channel.

@wernerdv wernerdv force-pushed the testcontainers branch 4 times, most recently from 0ce5477 to b7493e9 Compare May 29, 2026 05:15
@wernerdv wernerdv changed the title WIP Use testcontainers for RU tests IGNITE-28738 Implement basic Rolling Upgrade test using testcontainers Jun 1, 2026
Comment on lines +161 to +162
ignite = startGrid(configuration(container.nodeId(), container.localWorkDirectory(), addrs.values()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a problem based on the logs from my local run.

Prerequisites:
We have three old nodes, each running in its own Docker container. During rolling upgrade, one old container is stopped and replaced by a new local node outside Docker.

For simplicity:

  • node A: old container node, coordinator
  • node B: another old container node
  • node C: our new local node, started outside Docker

Problem:
Node C cannot join the cluster and fails with Failed to connect to any address from IP finder within join timeout

Reason:
Node C and the Docker container nodes have different views of node B address.

Node C connects to coordinator node A. Node A accepts the join request and prints Added new node to topology

After that, coordinator A sends TcpDiscoveryNodeAddedMessage around the discovery ring. In our case, node C receives this message and has to pass it to node B.

The problem is that node B declares its Docker-internal address, for example 172.19.0.3:47500

This address is valid inside the Docker network, so node A can use it. But node C is running on macOS, outside Docker. From node C's point of view, node B is reachable only through Docker port mapping, for example 192.168.0.101:<mapped-discovery-port>

As a result, node C tries to connect to node B using 172.19.0.3:47500 and cannot reach it. The discovery message cannot complete its round through the ring, so the join process is not completed. Finally, node C hits the join timeout and node A removes it from the topology

I see two solutions here:

  1. More clear. We need to start new node also in container to have common docker network and connect to cluster with thin client to manage nodes. The only problem is that we may want to do more sophisticated operations with server nodes and we won't be able to trigger java api
  2. Somehow set AddressResolver, but for now I'm not sure it will be a clear solution

@wernerdv wernerdv force-pushed the testcontainers branch 2 times, most recently from 4f23dfa to 7c2a550 Compare June 25, 2026 16:28

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing on rolling-upgrade coverage (IEP-132) — driving old-cluster → node-by-node upgrade → data-integrity end to end is valuable, and the inline notes documenting the macOS networking pitfalls are genuinely helpful.

My main concern is that, as committed, this reads more like a manual, macOS-only proof-of-concept than a regression test, and it bundles a core change. Highlights (details inline):

  1. Not reproducible after mergeSOURCE_COMMIT_HASH is an ephemeral WIP commit not on master; post-squash it's unreachable, so the documented build/run flow can't be followed. It also makes the “upgrade” same-branch→same-branch, so no real version gap is tested. Base it on a released version + make it overridable.

  2. No automated regression value yet — the suite is orphaned (referenced nowhere) and needs Docker + a manually built image + a hand-edited hash. Effectively a manual harness. What's the CI plan?

  3. macOS-specific networking — the host-JVM-node ⇄ container topology and all its timeouts/filters (setFilterReachableAddresses, 1s socket timeouts, host.docker.internal, the 0.0.0.0/loopback split) are built around Docker Desktop on macOS. On Linux CI the behavior differs (direct 172.x routing, host.docker.internal not resolvable by default, root-owned bind-mounted persistence files that U.delete may fail to remove). Consider building a “new” image too and running an all-container cluster: uniform networking, portable across Linux/macOS, and it removes most of the bespoke glue (and the discovery change below).

  4. Production change smuggled in — the one-line TcpDiscoverySpi change alters core discovery for everyone (drops unresolved addresses). Likely a valid NPE guard, but it needs its own ticket + unit test, not a test PR.

  5. Code-level — a few real ones inline: ignored waitForCondition result, NPE-on-null unboxing in the verify loop, a debug System.out.println, trailing whitespace, fixed-port flakiness, a misleading “graceful shutdown” log.

Net: great as a PoC that the RU flow can be driven end to end; but I'd rework it toward a released base version + a CI-runnable (probably all-container) topology, and split out the discovery fix, before it lands as a committed test. Happy to help think through the all-container approach.

if (!node.equals(locNode))
addrs.removeIf(addr -> addr.getAddress().isLoopbackAddress() && locNode.socketAddresses().contains(addr));
addrs.removeIf(addr -> addr.getAddress() == null ||
(addr.getAddress().isLoopbackAddress() && locNode.socketAddresses().contains(addr)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core discovery behavior change bundled in a test PR. This is the only production change here, and it alters getEffectiveNodeAddresses for every deployment: it now drops any address whose getAddress() == null (an unresolved InetSocketAddress, e.g. a peer that advertised a hostname this node can't resolve).

The NPE guard itself looks legitimate — without it the original addr.getAddress().isLoopbackAddress() NPEs on an unresolved address. But (a) silently dropping unresolved addresses is a real behavior change for split-DNS / hostname-advertised topologies, and (b) it ships with no dedicated unit test, inside a “basic RU test” PR.

Please extract it into its own JIRA ticket with a focused unit test (a node whose socketAddresses() contains an unresolved address → assert no NPE + correct filtering) and a discovery-maintainer review. Also update the comment on the line above — it still only mentions “own loopback” and no longer describes the unresolved-address case.

);

/** Source commit hash. Used for docker image tag. */
private static final String SOURCE_COMMIT_HASH = "0ad4656eef09acda288cbad96f80f0138732d94a";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not reproducible after merge. SOURCE_COMMIT_HASH points to 0ad4656, an ephemeral WIP commit on this PR branch (3 commits behind HEAD) that is not reachable from master. After a squash-merge that SHA disappears, so nobody can build apacheignite/ignite:0ad4656… and the DEVNOTES workflow can't be followed.

It also makes the “upgrade” run from the same branch, 3 WIP commits back — effectively self→self — so no real cross-version gap is exercised, which is the whole point of a rolling-upgrade test.

Recommend basing the source side on a released version (a published apacheignite/ignite:<GA> image, or a release tag) and making it overridable via a system property instead of a hardcoded constant. The same hash currently has to be hand-synced across DEVNOTES Step 1/2 and this field — another reason to externalize it.


# Perform git checkout to the specified commit
echo -e "\nPerforming git checkout to commit: $COMMIT_HASH"
git checkout "$COMMIT_HASH"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git checkout "$COMMIT_HASH" mutates the live working tree and then runs mvnw clean install in it. If the tree has uncommitted changes the checkout aborts; and if the script is killed (SIGKILL / crash / CI cancel) before the EXIT trap fires, the user is left on a detached old commit with an old build. Safer to build the source version in an isolated git worktree (or a throwaway clone) so the working tree is never touched.

Comment thread parent/pom.xml
<spring.version>5.3.39</spring.version>
<surefire.version>3.5.6</surefire.version>
<tomcat.version>10.0.27</tomcat.version>
<testcontainers.version>2.0.5</testcontainers.version>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things on this new property: (1) 2.0.5 is a brand-new Testcontainers major — the mainstream line is 1.21.x. Please confirm 2.0.x is intentional (its API differs from 1.x) and not a typo for 1.20.5/1.21.x. (2) Minor: it's out of alphabetical order — testcontainers should come before tomcat.

@Suite.SuiteClasses({
IgniteRebalanceOnUpgradeTest.class
})
public class IgniteRollingUpgradeDockerTestSuite {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suite isn't referenced by any other suite or by the TC/CI config (grepped the tree — no references), so nothing runs it automatically; as committed it adds no regression coverage. What's the CI plan? Wiring it into TeamCity will have to handle: a Docker daemon on the agent, a prebuilt source image, the hardcoded SOURCE_COMMIT_HASH, and the macOS-specific networking below (Linux agents route to containers directly and don't resolve host.docker.internal by default).

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically verified on macOS: the test does not pass as-is (10-min timeout). Concrete, environment-independent root cause inline — the injected test-classes jar is missing the plugin provider's anonymous inner class.

jar.deleteOnExit();

try (JarOutputStream out = new JarOutputStream(new FileOutputStream(jar))) {
for (String cls : TEST_CLASSES) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this on macOS (Docker Desktop 28.0.1, JDK 17) — testRollingUpgrade currently fails: it times out after 600s, and the root cause is here, not the network.

This loop packs only the top-level classes listed in TEST_CLASSES, but TestCompatibilityPluginProvider has anonymous inner classes (return new IgnitePlugin() { … }, the Disabled*Processor lambdas, etc.), so TestCompatibilityPluginProvider$1.class is produced at compile time but never added to the injected jar. Every container node then dies on startup:

Failed to start grid: org/apache/ignite/compatibility/testframework/plugins/TestCompatibilityPluginProvider$1
  (java.lang.NoClassDefFoundError)

The container exits with code 255, never logs Node started, so Wait.forLogMessage(".*Node started.*") blocks until the test timeout. All three nodes fail identically — so the cluster never even forms.

getDeclaredClasses() won't help (anonymous classes aren't declared members); the fix has to follow the compiled .class files. For each class, locate its .class resource and also add sibling Name$*.class files, e.g.:

for (String cls : TEST_CLASSES) {
    String pkgPath = cls.substring(0, cls.lastIndexOf('.') + 1).replace('.', '/');
    String simple = cls.substring(cls.lastIndexOf('.') + 1);
    URL url = IgniteContainer.class.getClassLoader().getResource(cls.replace('.', '/') + ".class");
    if (url == null)
        throw new IOException("Class not found on classpath: " + cls);
    File dir = new File(url.toURI()).getParentFile();
    File[] files = dir.listFiles((d, n) -> n.equals(simple + ".class") || n.startsWith(simple + "$"));
    for (File f : files) {
        out.putNextEntry(new JarEntry(pkgPath + f.getName()));
        Files.copy(f.toPath(), out);
        out.closeEntry();
    }
}

(assumes a directory classpath, which holds for target/test-classes). I'm validating this fix locally now. Given the all-WIP history, it's worth double-checking the test was actually run green after TestCompatibilityPluginProvider gained the inner class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update — verified end to end on macOS. The nested-class glob alone isn't enough: TestCompatibilityPluginProvider also references DisabledRollingUpgradeProcessor and DisabledValidationProcessor (new test-framework classes in this same package, absent from the old image), so the next node-startup failure is NoClassDefFoundError: DisabledRollingUpgradeProcessor — container exits 255 again.

With the glob fix plus adding those two classes to TEST_CLASSES, the test passes green:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 485.8 s
BUILD SUCCESS

(all three nodes upgraded sequentially, data verified after rebalance). So the macOS networking actually works once the classes are present — the only blocker was the incomplete injection.

Takeaway: the hand-maintained TEST_CLASSES list is the fragile part — it silently rots whenever the plugin provider gains a class. Safer to inject the transitive closure (e.g. bundle the whole org.apache.ignite.compatibility.testframework.plugins package directory, nested classes included) instead of enumerating classes by hand.

anton-vinogradov and others added 4 commits June 26, 2026 11:39
…path, faster shutdown (#1)

- testClassesJar: inject each listed class together with its nested classes (the
  plugin provider's anonymous classes) and add DisabledRollingUpgradeProcessor /
  DisabledValidationProcessor. Without this the containerized nodes fail to start
  (NoClassDefFoundError) and the test times out after 600s on a clean checkout.

- OS-aware networking. Keep the macOS path (host<->container via published ports +
  ContainerAddressResolver + host.docker.internal) and add a direct Linux path:
  containers are reached at their Docker bridge IP, the host node is reached at the
  Docker network gateway. Removes the Docker-Desktop VM-proxy latency on Linux/CI.

- Container shutdown wait 60s -> 10s. Graceful SIGTERM does not complete (the image
  entrypoint does not forward it), so the node is force-stopped anyway; this saved
  ~50s per node (~423s -> ~250s end to end on macOS).

- SOURCE_COMMIT_HASH overridable via -Dru.source.commit.hash.

- assertTrue on the post-upgrade topology waitForCondition (was silently ignored);
  compare verified values as boxed Integer to avoid an NPE on a missing key.

Verified green on macOS. The Linux path compiles and its addressing primitives are
validated, but a full Linux/CI run is still required.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
- IgniteContainer.stop(): drop the unconditional "shut down gracefully"
  log. The image entrypoint does not forward SIGTERM, so the await times
  out and the node is force-stopped via super.stop(); the message
  contradicted the preceding "Proceeding with forceful stop." warning.

- IgniteContainer.testClassesJar(): null-check File.listFiles() (returns
  null on an unreadable dir) so a missing class directory fails with a
  clear IOException instead of an NPE in the loop.

- IgniteRebalanceOnUpgradeTest: compare the post-put value as a boxed
  Integer too (the get(1000)-loop already does), so a missing key fails
  the assertion instead of NPE-ing on unboxing.

- Scope the "bind communication to loopback" comment to macOS (on Linux
  it binds to 0.0.0.0) and drop a trailing-whitespace blank line.

No behavioral change on the macOS path.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

@anton-vinogradov anton-vinogradov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review. Built locally on Java 17: the compatibility module test-compiles against testcontainers 2.0.5 and passes strict checkstyle (-Pcheckstyle, 0 errors); RAT is fine (it treats .sh as binary). I could not run the end-to-end test (no Docker on my machine, and the source image has to be built manually).

Inline I've left a handful of ready-to-apply suggestions (unused param, a misleading comment, an unclosed Files.list stream, and the missing ASF header in build_docker_image.sh) plus a few higher-level points: a unit test + log for the core TcpDiscoverySpi change, a Docker-availability guard so a stray mvn test doesn't fail hard, the WIP default for ru.source.commit.hash, and stale jars in the in-place libs swap. Nothing blocking — mostly polish. Thanks for the thorough DEVNOTES!

int port = addr.getPort();

// Each sequentially started host node binds the next port in the discovery (48500+) and
// communication (47100+) ranges; map them all to the Docker host address so the containers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The host node's communication range is 49100+ (setLocalPort(49100 + idx) a few lines above), not 47100+ — the comment is misleading.

Suggested change
// communication (47100+) ranges; map them all to the Docker host address so the containers
// communication (49100+) ranges; map them all to the Docker host address so the containers

import java.util.jar.JarEntry;
import java.util.jar.JarOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import for the try-with-resources fix below.

Suggested change
import java.util.regex.Pattern;
import java.util.regex.Pattern;
import java.util.stream.Stream;

Comment on lines +235 to +237
for (Path file : Files.list(targetLibsHostDir).toArray(Path[]::new))
if (Files.isRegularFile(file))
copyFileToContainer(forHostPath(file.toAbsolutePath().toString()), LIBS_DIR_PATH + file.getFileName().toString());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files.list(...) returns a Stream that holds an open directory handle and must be closed (per its javadoc). Wrap it in try-with-resources (needs the java.util.stream.Stream import suggested above).

Suggested change
for (Path file : Files.list(targetLibsHostDir).toArray(Path[]::new))
if (Files.isRegularFile(file))
copyFileToContainer(forHostPath(file.toAbsolutePath().toString()), LIBS_DIR_PATH + file.getFileName().toString());
try (Stream<Path> files = Files.list(targetLibsHostDir)) {
for (Path file : files.toArray(Path[]::new))
if (Files.isRegularFile(file))
copyFileToContainer(forHostPath(file.toAbsolutePath().toString()), LIBS_DIR_PATH + file.getFileName().toString());
}

@@ -0,0 +1,156 @@
#!/bin/bash

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is missing the ASF license header (the sibling build_distrib.sh has it). RAT classifies .sh as binary so it won't fail the build, but ASF source files are expected to carry the header.

Suggested change
#!/bin/bash
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

if (!node.equals(locNode))
addrs.removeIf(addr -> addr.getAddress().isLoopbackAddress() && locNode.socketAddresses().contains(addr));
addrs.removeIf(addr -> addr.getAddress() == null ||
(addr.getAddress().isLoopbackAddress() && locNode.socketAddresses().contains(addr)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only production change in the PR, and it's a sensible NPE guard for unresolved addresses. A few asks:

  1. It's currently exercised only by the Docker smoke test, which isn't part of RunAll — could you add a focused unit test for getEffectiveNodeAddresses with an unresolved InetSocketAddress (getAddress() == null)?
  2. Dropping the address silently makes a fully-unresolvable node hard to diagnose; a log.debug(...) on the removed address would help.
  3. Since the rest of the PR is test infrastructure, consider landing this core change as its own commit / sub-ticket so it's reviewed independently.

/** */
@BeforeClass
public static void beforeClass() {
U.delete(LOCAL_WORK_DIR);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent surefire config uses <include>**/*.java</include>, so a plain mvn test on this module will pick up this *Test and fail hard when Docker (or the pre-built image) is unavailable, instead of being skipped. Consider guarding it so it self-skips:

import org.junit.Assume;
import org.testcontainers.DockerClientFactory;
// ...
@BeforeClass
public static void beforeClass() {
    Assume.assumeTrue("Docker is required for this test", DockerClientFactory.instance().isDockerAvailable());
    U.delete(LOCAL_WORK_DIR);
}


/** Source version image tag, overridable via {@code -Dru.source.commit.hash}. */
private static final String SOURCE_COMMIT_HASH = System.getProperty("ru.source.commit.hash",
"0ad4656eef09acda288cbad96f80f0138732d94a");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default ru.source.commit.hash points at a WIP commit of this branch rather than a released version, and apacheignite/ignite:<hash> must be built locally first. Consider defaulting to a released version tag (or clearly documenting that this baseline is throwaway).


for (Path file : Files.list(targetLibsHostDir).toArray(Path[]::new))
if (Files.isRegularFile(file))
copyFileToContainer(forHostPath(file.toAbsolutePath().toString()), LIBS_DIR_PATH + file.getFileName().toString());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (no change required for the baseline used here): target jars are copied over the source libs/ without removing source-only jars. For a real version gap, renamed/removed jars from the old version would linger and could cause classpath conflicts. Worth a comment or an explicit libs/ cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants