While investigating #215 I found that even if I could avoid the hard-coded JVM maximum array elements issue, I continue to get out-of-memory errors from xcat-merge. The exception occurs despite the fact that the process is free to allocate up to 704GB to heap memory:
clj -J-Xms704G -J-Xmx704G -X gensql.structure-learning.clojurecat/merge
yet at the point of the exception is only using 92GB of RAM on a system with 768GB available, so less that 15%:
Execution error (OutOfMemoryError) at java.util.Arrays/copyOf (Arrays.java:3537).
Requested array size exceeds VM limit
Full report at:
/tmp/clojure-17116085737351908381.edn
ERROR: failed to reproduce 'xcat-merge': failed to run: clj -J-Xms704G -J-Xmx704G -X gensql.structure-learning.clojurecat/merge :models data/xcat/complete :out data/xcat/xcat.merged.edn, exited with 1
The initial cause appeared to be the ulimit(1) parameter max locked memory, which limits the amount of physical RAM that a process can lock into memory. Apparently when the JVM allocates objects to its heap storage, it attempts to lock the allocated memory into physical RAM. If it can't, it fails with an out-of-memory error. The max locked memory value was 98GB (1/8 of total RAM, a default that probably comes from the kernel's initial setting for RLIMIT_MEMLOCK, though I'm not sure where):
ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -H -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3062112
max locked memory (kbytes, -l) 97989500 <----- HERE
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 3062112
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
and that number is close enough to the observed total memory usage that this seemed a likely cause. However increasing this limit by editing /etc/security/limits.conf to add:
* - memlock 256000000
and rebooting the instance did not help, despite the fact that both the hard and soft max locked memory (memlock) parameter change is confirmed:
ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -H -l
256000000
ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -S -l
256000000
That is, xcat-merge still fails with the above memory error from the JVM.
So it appears that something else is limiting the amount of RAM that the JVM is being allowed to dynamically allocate to heap objects, despite the fact that the JVM options and the system configuration allow vastly more to be used. It's not yet clear who the culprit is here, but I'm adding this issue to record the situation before spending the rest of the day in meetings.
One reason it's important to figure this out is that, up to now, the intuitive response to an out-of-memory situation has been to scale up, buying hours on instances with more RAM, but this situation indicates that the real problem is some sort of setting that is limiting RAM usage to a fraction of what's available on the system, yet proportional to total RAM. Scaling up therefore "looks" like a solution, but in fact we've probably just been overestimating the RAM requirements for structure learning, and spending money unnecessarily.
While investigating #215 I found that even if I could avoid the hard-coded JVM maximum array elements issue, I continue to get out-of-memory errors from
xcat-merge. The exception occurs despite the fact that the process is free to allocate up to 704GB to heap memory:clj -J-Xms704G -J-Xmx704G -X gensql.structure-learning.clojurecat/mergeyet at the point of the exception is only using 92GB of RAM on a system with 768GB available, so less that 15%:
The initial cause appeared to be the ulimit(1) parameter
max locked memory, which limits the amount of physical RAM that a process can lock into memory. Apparently when the JVM allocates objects to its heap storage, it attempts to lock the allocated memory into physical RAM. If it can't, it fails with an out-of-memory error. Themax locked memoryvalue was 98GB (1/8 of total RAM, a default that probably comes from the kernel's initial setting forRLIMIT_MEMLOCK, though I'm not sure where):and that number is close enough to the observed total memory usage that this seemed a likely cause. However increasing this limit by editing
/etc/security/limits.confto add:* - memlock 256000000and rebooting the instance did not help, despite the fact that both the hard and soft
max locked memory(memlock) parameter change is confirmed:That is,
xcat-mergestill fails with the above memory error from the JVM.So it appears that something else is limiting the amount of RAM that the JVM is being allowed to dynamically allocate to heap objects, despite the fact that the JVM options and the system configuration allow vastly more to be used. It's not yet clear who the culprit is here, but I'm adding this issue to record the situation before spending the rest of the day in meetings.
One reason it's important to figure this out is that, up to now, the intuitive response to an out-of-memory situation has been to scale up, buying hours on instances with more RAM, but this situation indicates that the real problem is some sort of setting that is limiting RAM usage to a fraction of what's available on the system, yet proportional to total RAM. Scaling up therefore "looks" like a solution, but in fact we've probably just been overestimating the RAM requirements for structure learning, and spending money unnecessarily.