On a moderately-sized dataset (40,000 rows subsampled with 95 columns) and sample_count = 16, the xcat-merge stage fails with a curious exception:
Execution error (OutOfMemoryError) at jdk.internal.util.ArraysSupport/hugeLength (ArraysSupport.java:649).
Required array length 2147483640 + 10 is too large
Full report at:
/tmp/clojure-1762883991130383889.edn
ERROR: failed to reproduce 'xcat-merge': failed to run: clj -X gensql.structure-learning.clojurecat/merge :models data/xcat/complete :out data/xcat/xcat.merged.edn, exited with 1
The numbers involved suggest that the JVM extends arrays in units of 10 and has an array size limit bounded by a signed 32-bit integer. Note this is a limitation on the number of array elements, not the amount of memory that can be allocated to an array. Apparently even the most current JVM specification (JVMS/SE22) has an explicit array size limitation based on 32-bit architectures (see the definition of int here), and I guess they've either seen no reason to update it or have been afraid to because of compatibility issues.
As a result, the array size limitation is not configurable, which seems like it will lead to scalability problems in the future. I can't help wondering if someone has a version of the JVM that has been rebuilt with 64-bit integers in mind. We can't be the first people to have ever encountered this, but poking around on StackOverflow, etc, still shows most people running into virtual memory limits, not the array size limit, though they sometimes confound the two.
Another option would be to redesign gensql.structure-learning.clojurecat/merge. I'm not sure how difficult that would be, but it seems we're going to hit this limit more often as larger datasets become normal, or (maybe) when we start working with long-lived, incrementally updated LPMs and need to do some kind of rejuvenation process.
On a moderately-sized dataset (40,000 rows subsampled with 95 columns) and
sample_count = 16, thexcat-mergestage fails with a curious exception:The numbers involved suggest that the JVM extends arrays in units of 10 and has an array size limit bounded by a signed 32-bit integer. Note this is a limitation on the number of array elements, not the amount of memory that can be allocated to an array. Apparently even the most current JVM specification (JVMS/SE22) has an explicit array size limitation based on 32-bit architectures (see the definition of
inthere), and I guess they've either seen no reason to update it or have been afraid to because of compatibility issues.As a result, the array size limitation is not configurable, which seems like it will lead to scalability problems in the future. I can't help wondering if someone has a version of the JVM that has been rebuilt with 64-bit integers in mind. We can't be the first people to have ever encountered this, but poking around on StackOverflow, etc, still shows most people running into virtual memory limits, not the array size limit, though they sometimes confound the two.
Another option would be to redesign
gensql.structure-learning.clojurecat/merge. I'm not sure how difficult that would be, but it seems we're going to hit this limit more often as larger datasets become normal, or (maybe) when we start working with long-lived, incrementally updated LPMs and need to do some kind of rejuvenation process.