Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 58 additions & 13 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,6 @@ You can also invoke the helper directly if you need to customise arguments furth
./benchmarks/compile_profile.py --profiles dev release --data /path/to/tpch_sf1
```


## Benchmark with modified configurations

### Select join algorithm
Expand Down Expand Up @@ -147,6 +146,19 @@ To verify that datafusion picked up your configuration, run the benchmarks with

## Comparing performance of main and a branch

For TPC-H
```shell
./benchmarks/compare_tpch.sh main mybranch
```

For TPC-DS.
To get data in `DATA_DIR` for TPCDS, please follow instructions in `./benchmarks/bench.sh data tcpds`
```shell
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/compare_tpcds.sh main mybranch
```

Alternatively you can compare manually followng the example velor

```shell
git checkout main

Expand Down Expand Up @@ -299,7 +311,6 @@ This will produce output like:
└──────────────┴──────────────┴──────────────┴───────────────┘
```


# Benchmark Runner

The `dfbench` program contains subcommands to run the various
Expand Down Expand Up @@ -339,24 +350,28 @@ FLAGS:
```

# Profiling Memory Stats for each benchmark query

The `mem_profile` program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.

Subcommands supported by mem_profile are the subset of those in `dfbench`.
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch, TPCDS

Before running benchmarks, `mem_profile` automatically compiles the benchmark binary (`dfbench`) using `cargo build`. Note that the build profile used for `dfbench` is not tied to the profile used for running `mem_profile` itself. We can explicitly specify the desired build profile using the `--bench-profile` option (e.g. release-nonlto). By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.

Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.

Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.

The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.

The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
Example:

Example:
```shell
datafusion$ cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
```

Example Output:

```
Query Time (ms) Peak RSS Peak Commit Major Page Faults
----------------------------------------------------------------
Expand Down Expand Up @@ -385,19 +400,21 @@ Query Time (ms) Peak RSS Peak Commit Major Page Faults
```

## Reported Metrics

When running benchmarks, `mem_profile` collects several memory-related statistics using the mimalloc API:

- Peak RSS (Resident Set Size):
The maximum amount of physical memory used by the process.
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
- Peak RSS (Resident Set Size):
The maximum amount of physical memory used by the process.
This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.

- Peak Commit:
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.

- Major Page Faults:
The number of major page faults triggered during execution.
This metric is obtained from the operating system and is not mimalloc-specific.
The number of major page faults triggered during execution.
This metric is obtained from the operating system and is not mimalloc-specific.

# Writing a new benchmark

## Creating or downloading data outside of the benchmark
Expand Down Expand Up @@ -586,6 +603,34 @@ This benchmarks is derived from the [TPC-H][1] version
[2]: https://github.com/databricks/tpch-dbgen.git,
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf

## TPCDS

Run the tpcds benchmark.

For data please clone `datafusion-benchmarks` repo which contains the predefined parquet data with SF1.

```shell
git clone https://github.com/apache/datafusion-benchmarks
```

Then run the benchmark with the following command:

```shell
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds
```

Alternatively benchmark the specific query

```shell
DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds 30
```

More help

```shell
cargo run --release --bin dfbench -- tpcds --help
```

## External Aggregation

Run the benchmark for aggregations with limited memory.
Expand Down
59 changes: 59 additions & 0 deletions benchmarks/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,9 @@ tpch10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB),
tpch_csv10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single csv file per table, hash join
tpch_mem10: TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory

# TPC-DS Benchmarks
tpcds: TPCDS inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join

# Extended TPC-H Benchmarks
sort_tpch: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=1)
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
Expand Down Expand Up @@ -220,6 +223,9 @@ main() {
tpch_csv10)
data_tpch "10" "csv"
;;
tpcds)
data_tpcds
;;
clickbench_1)
data_clickbench_1
;;
Expand Down Expand Up @@ -388,6 +394,7 @@ main() {
run_external_aggr
run_nlj
run_hj
run_tpcds
;;
tpch)
run_tpch "1" "parquet"
Expand All @@ -407,6 +414,9 @@ main() {
tpch_mem10)
run_tpch_mem "10"
;;
tpcds)
run_tpcds
;;
cancellation)
run_cancellation
;;
Expand Down Expand Up @@ -601,6 +611,24 @@ data_tpch() {
exit 1
}

# Points to TPCDS data generation instructions
data_tpcds() {
TPCDS_DIR="${DATA_DIR}"

# Check if TPCDS data directory exists
if [ ! -d "${TPCDS_DIR}" ]; then
echo ""
echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
echo " git clone https://github.com/apache/datafusion-benchmarks"
echo ""
return 1
fi

echo ""
echo "TPC-DS data already exists in ${TPCDS_DIR}"
echo ""
}

# Runs the tpch benchmark
run_tpch() {
SCALE_FACTOR=$1
Expand Down Expand Up @@ -634,6 +662,37 @@ run_tpch_mem() {
debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
}

# Runs the tpcds benchmark
run_tpcds() {
TPCDS_DIR="${DATA_DIR}"

# Check if TPCDS data directory exists
if [ ! -d "${TPCDS_DIR}" ]; then
echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2
echo "" >&2
echo "Please prepare TPC-DS data first by following instructions:" >&2
echo " ./bench.sh data tpcds" >&2
echo "" >&2
exit 1
fi

# Check if directory contains parquet files
if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then
echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2
echo "" >&2
echo "Please prepare TPC-DS data first by following instructions:" >&2
echo " ./bench.sh data tpcds" >&2
echo "" >&2
exit 1
fi

RESULTS_FILE="${RESULTS_DIR}/tpcds_sf1.json"
echo "RESULTS_FILE: ${RESULTS_FILE}"
echo "Running tpcds benchmark..."

debug_run $CARGO_COMMAND --bin dfbench -- tpcds --iterations 5 --path "${TPCDS_DIR}" --query_path "../datafusion/core/tests/tpc-ds" --prefer_hash_join "${PREFER_HASH_JOIN}" -o "${RESULTS_FILE}" ${QUERY_ARG}
}

# Runs the compile profile benchmark helper
run_compile_profile() {
local profiles=("$@")
Expand Down
58 changes: 58 additions & 0 deletions benchmarks/compare_tpcds.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Compare TPC-DS benchmarks between two branches

set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

usage() {
echo "Usage: $0 <branch1> <branch2>"
echo ""
echo "Example: $0 main dev2"
echo ""
echo "Note: TPC-DS benchmarks are not currently implemented in bench.sh"
exit 1
}

BRANCH1=${1:-""}
BRANCH2=${2:-""}

if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
usage
fi

# Store current branch
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)

echo "Comparing TPC-DS benchmarks: ${BRANCH1} vs ${BRANCH2}"

# Run benchmark on first branch
git checkout "$BRANCH1"
./benchmarks/bench.sh run tpcds

# Run benchmark on second branch
git checkout "$BRANCH2"
./benchmarks/bench.sh run tpcds

# Compare results
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"

# Return to original branch
git checkout "$CURRENT_BRANCH"
56 changes: 56 additions & 0 deletions benchmarks/compare_tpch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# Compare TPC-H benchmarks between two branches

set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

usage() {
echo "Usage: $0 <branch1> <branch2>"
echo ""
echo "Example: $0 main dev2"
exit 1
}

BRANCH1=${1:-""}
BRANCH2=${2:-""}

if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
usage
fi

# Store current branch
CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)

echo "Comparing TPC-H benchmarks: ${BRANCH1} vs ${BRANCH2}"

# Run benchmark on first branch
git checkout "$BRANCH1"
./benchmarks/bench.sh run tpch

# Run benchmark on second branch
git checkout "$BRANCH2"
./benchmarks/bench.sh run tpch

# Compare results
./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"

# Return to original branch
git checkout "$CURRENT_BRANCH"
4 changes: 3 additions & 1 deletion benchmarks/src/bin/dfbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;

use datafusion_benchmarks::{
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpch,
cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpcds, tpch,
};

#[derive(Debug, StructOpt)]
Expand All @@ -48,6 +48,7 @@ enum Options {
Nlj(nlj::RunOpt),
SortTpch(sort_tpch::RunOpt),
Tpch(tpch::RunOpt),
Tpcds(tpcds::RunOpt),
}

// Main benchmark runner entrypoint
Expand All @@ -64,5 +65,6 @@ pub async fn main() -> Result<()> {
Options::Nlj(opt) => opt.run().await,
Options::SortTpch(opt) => opt.run().await,
Options::Tpch(opt) => Box::pin(opt.run()).await,
Options::Tpcds(opt) => Box::pin(opt.run()).await,
}
}
1 change: 1 addition & 0 deletions benchmarks/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,6 @@ pub mod hj;
pub mod imdb;
pub mod nlj;
pub mod sort_tpch;
pub mod tpcds;
pub mod tpch;
pub mod util;
Loading