apache · comphead · Dec 8, 2025 · Nov 28, 2025 · Dec 1, 2025 · Dec 1, 2025
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -119,7 +119,6 @@ You can also invoke the helper directly if you need to customise arguments furth
 ./benchmarks/compile_profile.py --profiles dev release --data /path/to/tpch_sf1
 ```
 
-
 ## Benchmark with modified configurations
 
 ### Select join algorithm
@@ -147,6 +146,19 @@ To verify that datafusion picked up your configuration, run the benchmarks with
 
 ## Comparing performance of main and a branch
 
+For TPC-H
+```shell
+./benchmarks/compare_tpch.sh main mybranch
+```
+
+For TPC-DS. 
+To get data in `DATA_DIR` for TPCDS, please follow instructions in `./benchmarks/bench.sh data tcpds` 
+```shell
+DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/compare_tpcds.sh main mybranch
+```
+
+Alternatively you can compare manually followng the example velor
+
 ```shell
 git checkout main
 
@@ -299,7 +311,6 @@ This will produce output like:
 └──────────────┴──────────────┴──────────────┴───────────────┘
 ```
 
-
 # Benchmark Runner
 
 The `dfbench` program contains subcommands to run the various
@@ -339,24 +350,28 @@ FLAGS:
 ```
 
 # Profiling Memory Stats for each benchmark query
+
 The `mem_profile` program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.
 
 Subcommands supported by mem_profile are the subset of those in `dfbench`.
-Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
+Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch, TPCDS
 
 Before running benchmarks, `mem_profile` automatically compiles the benchmark binary (`dfbench`) using `cargo build`. Note that the build profile used for `dfbench` is not tied to the profile used for running `mem_profile` itself. We can explicitly specify the desired build profile using the `--bench-profile` option (e.g. release-nonlto). By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.
 
 Currently, `mem_profile` only supports `mimalloc` as the memory allocator, since it relies on `mimalloc`'s API to collect memory statistics.
 
-Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located. 
+Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
+
+The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand.
 
-The benchmark subcommand (e.g., `tpch`) and all following arguments are passed directly to `dfbench`. Be sure to specify `--bench-profile` before the benchmark subcommand. 
+Example:
 
-Example: 
 ```shell
 datafusion$ cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
 ```
+
 Example Output:
+
 ```
 Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
 ----------------------------------------------------------------
@@ -385,19 +400,21 @@ Query     Time (ms)     Peak RSS  Peak Commit  Major Page Faults
 ```
 
 ## Reported Metrics
+
 When running benchmarks, `mem_profile` collects several memory-related statistics using the mimalloc API:
 
-- Peak RSS (Resident Set Size): 
-The maximum amount of physical memory used by the process.
-This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
+- Peak RSS (Resident Set Size):
+  The maximum amount of physical memory used by the process.
+  This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
 
 - Peak Commit:
-The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
-This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
+  The peak amount of memory committed by the allocator (i.e., total virtual memory reserved).
+  This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
 
 - Major Page Faults:
-The number of major page faults triggered during execution.
-This metric is obtained from the operating system and is not mimalloc-specific.
+  The number of major page faults triggered during execution.
+  This metric is obtained from the operating system and is not mimalloc-specific.
+
 # Writing a new benchmark
 
 ## Creating or downloading data outside of the benchmark
@@ -586,6 +603,34 @@ This benchmarks is derived from the [TPC-H][1] version
 [2]: https://github.com/databricks/tpch-dbgen.git,
 [2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
 
+## TPCDS
+
+Run the tpcds benchmark.
+
+For data please clone `datafusion-benchmarks` repo which contains the predefined parquet data with SF1.
+
+```shell
+git clone https://github.com/apache/datafusion-benchmarks
+```
+
+Then run the benchmark with the following command:
+
+```shell
+DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds
+```
+
+Alternatively benchmark the specific query
+
+```shell
+DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/ ./benchmarks/bench.sh run tpcds 30
+```
+
+More help
+
+```shell
+cargo run --release --bin dfbench -- tpcds --help
+```
+
 ## External Aggregation
 
 Run the benchmark for aggregations with limited memory.

diff --git a/benchmarks/bench.sh b/benchmarks/bench.sh
@@ -87,6 +87,9 @@ tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB),
 tpch_csv10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single csv file per table, hash join
 tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
 
+# TPC-DS Benchmarks
+tpcds:                  TPCDS inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
+
 # Extended TPC-H Benchmarks
 sort_tpch:              Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=1)
 sort_tpch10:            Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
@@ -220,6 +223,9 @@ main() {
                 tpch_csv10)
                     data_tpch "10" "csv"
                     ;;
+                tpcds)
+                    data_tpcds
+                    ;;
                 clickbench_1)
                     data_clickbench_1
                     ;;
@@ -388,6 +394,7 @@ main() {
                     run_external_aggr
                     run_nlj
                     run_hj
+                    run_tpcds
                     ;;
                 tpch)
                     run_tpch "1" "parquet"
@@ -407,6 +414,9 @@ main() {
                 tpch_mem10)
                     run_tpch_mem "10"
                     ;;
+                tpcds)
+                    run_tpcds
+                    ;;
                 cancellation)
                     run_cancellation
                     ;;
@@ -601,6 +611,24 @@ data_tpch() {
     exit 1
 }
 
+# Points to TPCDS data generation instructions
+data_tpcds() {
+    TPCDS_DIR="${DATA_DIR}"
+
+    # Check if TPCDS data directory exists
+    if [ ! -d "${TPCDS_DIR}" ]; then
+        echo ""
+        echo "For TPC-DS data generation, please clone the datafusion-benchmarks repository:"
+        echo "  git clone https://github.com/apache/datafusion-benchmarks"
+        echo ""
+        return 1
+    fi
+
+    echo ""
+    echo "TPC-DS data already exists in ${TPCDS_DIR}"
+    echo ""
+}
+
 # Runs the tpch benchmark
 run_tpch() {
     SCALE_FACTOR=$1
@@ -634,6 +662,37 @@ run_tpch_mem() {
     debug_run $CARGO_COMMAND --bin dfbench -- tpch --iterations 5 --path "${TPCH_DIR}" --prefer_hash_join "${PREFER_HASH_JOIN}" -m --format parquet -o "${RESULTS_FILE}" ${QUERY_ARG}
 }
 
+# Runs the tpcds benchmark
+run_tpcds() {
+    TPCDS_DIR="${DATA_DIR}"
+
+    # Check if TPCDS data directory exists
+    if [ ! -d "${TPCDS_DIR}" ]; then
+        echo "Error: TPC-DS data directory does not exist: ${TPCDS_DIR}" >&2
+        echo "" >&2
+        echo "Please prepare TPC-DS data first by following instructions:" >&2
+        echo "  ./bench.sh data tpcds" >&2
+        echo "" >&2
+        exit 1
+    fi
+
+    # Check if directory contains parquet files
+    if ! find "${TPCDS_DIR}" -name "*.parquet" -print -quit | grep -q .; then
+        echo "Error: TPC-DS data directory exists but contains no parquet files: ${TPCDS_DIR}" >&2
+        echo "" >&2
+        echo "Please prepare TPC-DS data first by following instructions:" >&2
+        echo "  ./bench.sh data tpcds" >&2
+        echo "" >&2
+        exit 1
+    fi
+
+    RESULTS_FILE="${RESULTS_DIR}/tpcds_sf1.json"
+    echo "RESULTS_FILE: ${RESULTS_FILE}"
+    echo "Running tpcds benchmark..."
+
+    debug_run $CARGO_COMMAND --bin dfbench -- tpcds --iterations 5 --path "${TPCDS_DIR}" --query_path "../datafusion/core/tests/tpc-ds" --prefer_hash_join "${PREFER_HASH_JOIN}" -o "${RESULTS_FILE}" ${QUERY_ARG}
+}
+
 # Runs the compile profile benchmark helper
 run_compile_profile() {
     local profiles=("$@")

diff --git a/benchmarks/compare_tpcds.sh b/benchmarks/compare_tpcds.sh
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Compare TPC-DS benchmarks between two branches
+
+set -e
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+usage() {
+    echo "Usage: $0 <branch1> <branch2>"
+    echo ""
+    echo "Example: $0 main dev2"
+    echo ""
+    echo "Note: TPC-DS benchmarks are not currently implemented in bench.sh"
+    exit 1
+}
+
+BRANCH1=${1:-""}
+BRANCH2=${2:-""}
+
+if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
+    usage
+fi
+
+# Store current branch
+CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
+
+echo "Comparing TPC-DS benchmarks: ${BRANCH1} vs ${BRANCH2}"
+
+# Run benchmark on first branch
+git checkout "$BRANCH1"
+./benchmarks/bench.sh run tpcds
+
+# Run benchmark on second branch
+git checkout "$BRANCH2"
+./benchmarks/bench.sh run tpcds
+
+# Compare results
+./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
+
+# Return to original branch
+git checkout "$CURRENT_BRANCH"
diff --git a/benchmarks/compare_tpch.sh b/benchmarks/compare_tpch.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Compare TPC-H benchmarks between two branches
+
+set -e
+
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+
+usage() {
+    echo "Usage: $0 <branch1> <branch2>"
+    echo ""
+    echo "Example: $0 main dev2"
+    exit 1
+}
+
+BRANCH1=${1:-""}
+BRANCH2=${2:-""}
+
+if [ -z "$BRANCH1" ] || [ -z "$BRANCH2" ]; then
+    usage
+fi
+
+# Store current branch
+CURRENT_BRANCH=$(git rev-parse --abbrev-ref HEAD)
+
+echo "Comparing TPC-H benchmarks: ${BRANCH1} vs ${BRANCH2}"
+
+# Run benchmark on first branch
+git checkout "$BRANCH1"
+./benchmarks/bench.sh run tpch
+
+# Run benchmark on second branch
+git checkout "$BRANCH2"
+./benchmarks/bench.sh run tpch
+
+# Compare results
+./benchmarks/bench.sh compare "$BRANCH1" "$BRANCH2"
+
+# Return to original branch
+git checkout "$CURRENT_BRANCH"
diff --git a/benchmarks/src/bin/dfbench.rs b/benchmarks/src/bin/dfbench.rs
@@ -34,7 +34,7 @@ static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
 static ALLOC: mimalloc::MiMalloc = mimalloc::MiMalloc;
 
 use datafusion_benchmarks::{
-    cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpch,
+    cancellation, clickbench, h2o, hj, imdb, nlj, sort_tpch, tpcds, tpch,
 };
 
 #[derive(Debug, StructOpt)]
@@ -48,6 +48,7 @@ enum Options {
     Nlj(nlj::RunOpt),
     SortTpch(sort_tpch::RunOpt),
     Tpch(tpch::RunOpt),
+    Tpcds(tpcds::RunOpt),
 }
 
 // Main benchmark runner entrypoint
@@ -64,5 +65,6 @@ pub async fn main() -> Result<()> {
         Options::Nlj(opt) => opt.run().await,
         Options::SortTpch(opt) => opt.run().await,
         Options::Tpch(opt) => Box::pin(opt.run()).await,
+        Options::Tpcds(opt) => Box::pin(opt.run()).await,
     }
 }
diff --git a/benchmarks/src/lib.rs b/benchmarks/src/lib.rs
@@ -23,5 +23,6 @@ pub mod hj;
 pub mod imdb;
 pub mod nlj;
 pub mod sort_tpch;
+pub mod tpcds;
 pub mod tpch;
 pub mod util;