Skip to content

Unexpected Out-of-Memory Exception in Xcat-Merge #216

@neeshjaa

Description

@neeshjaa

While investigating #215 I found that even if I could avoid the hard-coded JVM maximum array elements issue, I continue to get out-of-memory errors from xcat-merge. The exception occurs despite the fact that the process is free to allocate up to 704GB to heap memory:
clj -J-Xms704G -J-Xmx704G -X gensql.structure-learning.clojurecat/merge
yet at the point of the exception is only using 92GB of RAM on a system with 768GB available, so less that 15%:

Execution error (OutOfMemoryError) at java.util.Arrays/copyOf (Arrays.java:3537).
Requested array size exceeds VM limit

Full report at:
/tmp/clojure-17116085737351908381.edn
ERROR: failed to reproduce 'xcat-merge': failed to run: clj -J-Xms704G -J-Xmx704G -X gensql.structure-learning.clojurecat/merge :models data/xcat/complete :out data/xcat/xcat.merged.edn, exited with 1

The initial cause appeared to be the ulimit(1) parameter max locked memory, which limits the amount of physical RAM that a process can lock into memory. Apparently when the JVM allocates objects to its heap storage, it attempts to lock the allocated memory into physical RAM. If it can't, it fails with an out-of-memory error. The max locked memory value was 98GB (1/8 of total RAM, a default that probably comes from the kernel's initial setting for RLIMIT_MEMLOCK, though I'm not sure where):

ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -H -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 3062112
max locked memory           (kbytes, -l) 97989500   <----- HERE
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 3062112
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

and that number is close enough to the observed total memory usage that this seemed a likely cause. However increasing this limit by editing /etc/security/limits.conf to add:
* - memlock 256000000
and rebooting the instance did not help, despite the fact that both the hard and soft max locked memory (memlock) parameter change is confirmed:

ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -H -l
256000000
ubuntu@ip-172-31-20-126:~/GenSQL.structure-learning$ ulimit -S -l
256000000

That is, xcat-merge still fails with the above memory error from the JVM.

So it appears that something else is limiting the amount of RAM that the JVM is being allowed to dynamically allocate to heap objects, despite the fact that the JVM options and the system configuration allow vastly more to be used. It's not yet clear who the culprit is here, but I'm adding this issue to record the situation before spending the rest of the day in meetings.

One reason it's important to figure this out is that, up to now, the intuitive response to an out-of-memory situation has been to scale up, buying hours on instances with more RAM, but this situation indicates that the real problem is some sort of setting that is limiting RAM usage to a fraction of what's available on the system, yet proportional to total RAM. Scaling up therefore "looks" like a solution, but in fact we've probably just been overestimating the RAM requirements for structure learning, and spending money unnecessarily.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions