GraphFrames vs icebug PageRank Memory Benchmark

This workspace benchmarks PageRank memory usage on the LiveJournal graph stored as CSR arrays in livejournal-csr.duckdb.

Data

The benchmark expects this DuckDB file in the repository root:

livejournal-csr.duckdb

You can download it from huggingface. Use xz to uncompress.

It contains:

nodes: 3,997,962
edges: 69,362,378
directed: false

CSR tables used by the benchmark:

livejournal_metadata
livejournal_indptr_edges
livejournal_indices_edges
livejournal_mapping_user
livejournal_nodes_user

Environment

Python dependencies are managed with uv.

uv sync

GraphFrames also needs Java and Spark. The runs below used:

Java: /usr/lib/jvm/java-21-openjdk-amd64
Spark: /home/ubuntu/comparison/spark-4.1.1-bin-hadoop3
GraphFrames Python package: graphframes-py 0.11.0
GraphFrames JVM package: io.graphframes:graphframes-spark4_2.13:0.11.0
PySpark: 4.1.1
icebug: 12.6

Set these before running GraphFrames:

export JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64
export SPARK_HOME=/home/ubuntu/comparison/spark-4.1.1-bin-hadoop3
export PATH="$SPARK_HOME/bin:$PATH"

Benchmark Driver

The benchmark script is:

benchmark_pagerank_memory.py

It launches each engine in a fresh child process and samples RSS for the whole process tree. This is important for GraphFrames because most memory is held by Spark's JVM, not the Python process.

Default PageRank settings:

resetProbability = 0.15
tol = 0.01
maxIter = unset

For icebug/NetworKit, the damping factor is set to 1 - resetProbability.

Reproduce

icebug

uv run python benchmark_pagerank_memory.py compare \
  --engine icebug \
  --output icebug-memory-results.csv

To constrain DuckDB buffering while loading the Arrow CSR arrays, pass --duckdb-memory-limit:

uv run python benchmark_pagerank_memory.py compare \
  --engine icebug \
  --duckdb-memory-limit 200MB \
  --output icebug-memory-duckdb-200mb-results.csv

GraphFrames, 16g driver heap

uv run python benchmark_pagerank_memory.py compare \
  --engine graphframes-load \
  --spark-driver-memory 16g \
  --output graphframes-load-results.csv

This builds GraphFrame(v, e) from Spark DataFrames and then forces Spark work with:

graph.vertices.count()
graph.edges.count()
graph.degrees.count()
graph.degrees.selectExpr("sum(degree)").collect()

GraphFrames PageRank / Pregel-BSP mode

These commands call the GraphFrames PageRank path. This is the relevant mode for Pregel/BSP-style graph algorithms. The graphframes-load mode above is a separate load/materialization benchmark.

uv run python benchmark_pagerank_memory.py compare \
  --engine graphframes \
  --spark-driver-memory 16g \
  --output graphframes-memory-results.csv

uv run python benchmark_pagerank_memory.py compare \
  --engine graphframes \
  --spark-driver-memory 4g \
  --output graphframes-memory-4g-results.csv

uv run python benchmark_pagerank_memory.py compare \
  --engine graphframes \
  --spark-driver-memory 8g \
  --output graphframes-memory-8g-results.csv

Results

All rows use the same LiveJournal graph. icebug and GraphFrames PageRank rows use resetProbability=0.15, tol=0.01, and no maxIter. The graphframes-load row is a separate load/materialization benchmark.

Engine	Driver heap	Status	Peak RSS	Total time	PageRank time	Notes
icebug/NetworKit	n/a	ok	3.32 GiB	28.93s	0.25s	default DuckDB memory
icebug/NetworKit	n/a	ok	2.18 GiB	30.86s	0.28s	DuckDB `memory_limit=1GB`
icebug/NetworKit	n/a	ok	1.48 GiB	32.65s	0.24s	DuckDB `memory_limit=200MB`
GraphFrames load/materialize	16g	ok	11.97 GiB	24.42s	n/a	counts plus degree aggregation
GraphFrames PageRank / Pregel-BSP	16g	ok	17.50 GiB	231.75s	218.37s	algorithm mode
GraphFrames	8g	failed	12.04 GiB	n/a	n/a	Java heap OOM
GraphFrames	4g	failed	12.04 GiB	n/a	n/a	Java heap OOM

Successful run details:

Engine	Load/prepare time	Build time	Nodes	Edges/result edges
icebug/NetworKit, default DuckDB memory	1.88s load	26.79s	3,997,962	69,362,378
icebug/NetworKit, DuckDB `memory_limit=1GB`	4.17s load	26.42s	3,997,962	69,362,378
icebug/NetworKit, DuckDB `memory_limit=200MB`	5.67s load	26.74s	3,997,962	69,362,378
GraphFrames load/materialize 16g	7.73s prepare	5.71s setup, 10.98s materialize	3,997,962	69,362,378

GraphFrames load/materialization details:

vertex count: 3,997,962
edge count: 69,362,378
degree rows: 3,997,962
degree sum: 138,724,756
count time: 1.46s
degree aggregation time: 9.52s

The lower-memory GraphFrames attempts failed with:

java.lang.OutOfMemoryError: Java heap space

The OOM occurred during GraphFrames/GraphX edge partition construction and PageRank startup, before PageRank results were produced.

Output Files

Computed result CSVs:

icebug-memory-results.csv
icebug-memory-results-rerun.csv
icebug-memory-duckdb-1gb-results.csv
icebug-memory-duckdb-200mb-results.csv
graphframes-load-results.csv
graphframes-memory-results.csv
graphframes-memory-4g-results.csv
graphframes-memory-8g-results.csv

The 16g GraphFrames run created temporary Parquet files under /tmp for the converted vertices and edges. The benchmark cleans up those temporary files at the end of the run unless --keep-edge-parquet is passed.

Notes

GraphFrames does not consume the CSR arrays directly. The benchmark first exports vertices and edges from the CSR DuckDB tables to Parquet, then loads those Parquet files into Spark DataFrames and builds a GraphFrame.

There are two GraphFrames modes:

graphframes      PageRank / Pregel-BSP-style algorithm benchmark
graphframes-load GraphFrame load/materialization benchmark

graphframes-load does not call PageRank. It creates GraphFrame(v, e) and then forces Spark execution with counts and degree aggregation so lazy DataFrame planning does not hide the load and graph materialization cost.

icebug/NetworKit constructs the graph directly from the Arrow CSR buffers. The benchmark keeps those Arrow buffers alive for the lifetime of the NetworKit graph.

DuckDB's default memory behavior materially affects peak RSS while loading the Arrow arrays. Lowering DuckDB's memory limit reduced peak icebug RSS from 3.32 GiB to 1.48 GiB, with CSR load time increasing from 1.88s to 5.67s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphFrames vs icebug PageRank Memory Benchmark

Data

Environment

Benchmark Driver

Reproduce

icebug

GraphFrames, 16g driver heap

GraphFrames PageRank / Pregel-BSP mode

Results

Output Files

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
benchmark_pagerank_memory.py		benchmark_pagerank_memory.py
graphframes-load-results.csv		graphframes-load-results.csv
graphframes-memory-4g-results.csv		graphframes-memory-4g-results.csv
graphframes-memory-8g-results.csv		graphframes-memory-8g-results.csv
graphframes-memory-results.csv		graphframes-memory-results.csv
icebug-memory-duckdb-1gb-results.csv		icebug-memory-duckdb-1gb-results.csv
icebug-memory-duckdb-200mb-results.csv		icebug-memory-duckdb-200mb-results.csv
icebug-memory-results-rerun.csv		icebug-memory-results-rerun.csv
icebug-memory-results.csv		icebug-memory-results.csv
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

GraphFrames vs icebug PageRank Memory Benchmark

Data

Environment

Benchmark Driver

Reproduce

icebug

GraphFrames, 16g driver heap

GraphFrames PageRank / Pregel-BSP mode

Results

Output Files

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages