This workspace benchmarks PageRank memory usage on the LiveJournal graph stored
as CSR arrays in livejournal-csr.duckdb.
The benchmark expects this DuckDB file in the repository root:
livejournal-csr.duckdb
You can download it from huggingface. Use xz to uncompress.
It contains:
nodes: 3,997,962
edges: 69,362,378
directed: false
CSR tables used by the benchmark:
livejournal_metadata
livejournal_indptr_edges
livejournal_indices_edges
livejournal_mapping_user
livejournal_nodes_user
Python dependencies are managed with uv.
uv syncGraphFrames also needs Java and Spark. The runs below used:
Java: /usr/lib/jvm/java-21-openjdk-amd64
Spark: /home/ubuntu/comparison/spark-4.1.1-bin-hadoop3
GraphFrames Python package: graphframes-py 0.11.0
GraphFrames JVM package: io.graphframes:graphframes-spark4_2.13:0.11.0
PySpark: 4.1.1
icebug: 12.6
Set these before running GraphFrames:
export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
export SPARK_HOME=/home/ubuntu/comparison/spark-4.1.1-bin-hadoop3
export PATH="$SPARK_HOME/bin:$PATH"The benchmark script is:
benchmark_pagerank_memory.py
It launches each engine in a fresh child process and samples RSS for the whole process tree. This is important for GraphFrames because most memory is held by Spark's JVM, not the Python process.
Default PageRank settings:
resetProbability = 0.15
tol = 0.01
maxIter = unset
For icebug/NetworKit, the damping factor is set to 1 - resetProbability.
uv run python benchmark_pagerank_memory.py compare \
--engine icebug \
--output icebug-memory-results.csvTo constrain DuckDB buffering while loading the Arrow CSR arrays, pass
--duckdb-memory-limit:
uv run python benchmark_pagerank_memory.py compare \
--engine icebug \
--duckdb-memory-limit 200MB \
--output icebug-memory-duckdb-200mb-results.csvuv run python benchmark_pagerank_memory.py compare \
--engine graphframes-load \
--spark-driver-memory 16g \
--output graphframes-load-results.csvThis builds GraphFrame(v, e) from Spark DataFrames and then forces Spark work
with:
graph.vertices.count()
graph.edges.count()
graph.degrees.count()
graph.degrees.selectExpr("sum(degree)").collect()These commands call the GraphFrames PageRank path. This is the relevant mode
for Pregel/BSP-style graph algorithms. The graphframes-load mode above is a
separate load/materialization benchmark.
uv run python benchmark_pagerank_memory.py compare \
--engine graphframes \
--spark-driver-memory 16g \
--output graphframes-memory-results.csvuv run python benchmark_pagerank_memory.py compare \
--engine graphframes \
--spark-driver-memory 4g \
--output graphframes-memory-4g-results.csvuv run python benchmark_pagerank_memory.py compare \
--engine graphframes \
--spark-driver-memory 8g \
--output graphframes-memory-8g-results.csvAll rows use the same LiveJournal graph. icebug and GraphFrames PageRank rows
use resetProbability=0.15, tol=0.01, and no maxIter. The
graphframes-load row is a separate load/materialization benchmark.
| Engine | Driver heap | Status | Peak RSS | Total time | PageRank time | Notes |
|---|---|---|---|---|---|---|
| icebug/NetworKit | n/a | ok | 3.32 GiB | 28.93s | 0.25s | default DuckDB memory |
| icebug/NetworKit | n/a | ok | 2.18 GiB | 30.86s | 0.28s | DuckDB memory_limit=1GB |
| icebug/NetworKit | n/a | ok | 1.48 GiB | 32.65s | 0.24s | DuckDB memory_limit=200MB |
| GraphFrames load/materialize | 16g | ok | 11.97 GiB | 24.42s | n/a | counts plus degree aggregation |
| GraphFrames PageRank / Pregel-BSP | 16g | ok | 17.50 GiB | 231.75s | 218.37s | algorithm mode |
| GraphFrames | 8g | failed | 12.04 GiB | n/a | n/a | Java heap OOM |
| GraphFrames | 4g | failed | 12.04 GiB | n/a | n/a | Java heap OOM |
Successful run details:
| Engine | Load/prepare time | Build time | Nodes | Edges/result edges |
|---|---|---|---|---|
| icebug/NetworKit, default DuckDB memory | 1.88s load | 26.79s | 3,997,962 | 69,362,378 |
icebug/NetworKit, DuckDB memory_limit=1GB |
4.17s load | 26.42s | 3,997,962 | 69,362,378 |
icebug/NetworKit, DuckDB memory_limit=200MB |
5.67s load | 26.74s | 3,997,962 | 69,362,378 |
| GraphFrames load/materialize 16g | 7.73s prepare | 5.71s setup, 10.98s materialize | 3,997,962 | 69,362,378 |
GraphFrames load/materialization details:
vertex count: 3,997,962
edge count: 69,362,378
degree rows: 3,997,962
degree sum: 138,724,756
count time: 1.46s
degree aggregation time: 9.52s
The lower-memory GraphFrames attempts failed with:
java.lang.OutOfMemoryError: Java heap space
The OOM occurred during GraphFrames/GraphX edge partition construction and PageRank startup, before PageRank results were produced.
Computed result CSVs:
icebug-memory-results.csv
icebug-memory-results-rerun.csv
icebug-memory-duckdb-1gb-results.csv
icebug-memory-duckdb-200mb-results.csv
graphframes-load-results.csv
graphframes-memory-results.csv
graphframes-memory-4g-results.csv
graphframes-memory-8g-results.csv
The 16g GraphFrames run created temporary Parquet files under /tmp for the
converted vertices and edges. The benchmark cleans up those temporary files at
the end of the run unless --keep-edge-parquet is passed.
GraphFrames does not consume the CSR arrays directly. The benchmark first
exports vertices and edges from the CSR DuckDB tables to Parquet, then loads
those Parquet files into Spark DataFrames and builds a GraphFrame.
There are two GraphFrames modes:
graphframes PageRank / Pregel-BSP-style algorithm benchmark
graphframes-load GraphFrame load/materialization benchmark
graphframes-load does not call PageRank. It creates GraphFrame(v, e) and
then forces Spark execution with counts and degree aggregation so lazy DataFrame
planning does not hide the load and graph materialization cost.
icebug/NetworKit constructs the graph directly from the Arrow CSR buffers. The benchmark keeps those Arrow buffers alive for the lifetime of the NetworKit graph.
DuckDB's default memory behavior materially affects peak RSS while loading the
Arrow arrays. Lowering DuckDB's memory limit reduced peak icebug RSS from
3.32 GiB to 1.48 GiB, with CSR load time increasing from 1.88s to
5.67s.