Skip to content

Xcat-Merge Stage Fails due to JVM Hard-Coded Limitation on Array Sizes Regardless of Available RAM #215

@neeshjaa

Description

@neeshjaa

On a moderately-sized dataset (40,000 rows subsampled with 95 columns) and sample_count = 16, the xcat-merge stage fails with a curious exception:

Execution error (OutOfMemoryError) at jdk.internal.util.ArraysSupport/hugeLength (ArraysSupport.java:649).
Required array length 2147483640 + 10 is too large

Full report at:
/tmp/clojure-1762883991130383889.edn
ERROR: failed to reproduce 'xcat-merge': failed to run: clj -X gensql.structure-learning.clojurecat/merge :models data/xcat/complete :out data/xcat/xcat.merged.edn, exited with 1

The numbers involved suggest that the JVM extends arrays in units of 10 and has an array size limit bounded by a signed 32-bit integer. Note this is a limitation on the number of array elements, not the amount of memory that can be allocated to an array. Apparently even the most current JVM specification (JVMS/SE22) has an explicit array size limitation based on 32-bit architectures (see the definition of int here), and I guess they've either seen no reason to update it or have been afraid to because of compatibility issues.

As a result, the array size limitation is not configurable, which seems like it will lead to scalability problems in the future. I can't help wondering if someone has a version of the JVM that has been rebuilt with 64-bit integers in mind. We can't be the first people to have ever encountered this, but poking around on StackOverflow, etc, still shows most people running into virtual memory limits, not the array size limit, though they sometimes confound the two.

Another option would be to redesign gensql.structure-learning.clojurecat/merge. I'm not sure how difficult that would be, but it seems we're going to hit this limit more often as larger datasets become normal, or (maybe) when we start working with long-lived, incrementally updated LPMs and need to do some kind of rejuvenation process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions