Skip to content

Add metrics configs#320

Merged
lzxddz merged 1 commit into
eloqdata:mainfrom
lzxddz:add-metrics
Nov 20, 2025
Merged

Add metrics configs#320
lzxddz merged 1 commit into
eloqdata:mainfrom
lzxddz:add-metrics

Conversation

@lzxddz

@lzxddz lzxddz commented Nov 19, 2025

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features
    • Enhanced metrics monitoring capabilities across storage configurations
    • Added comprehensive metrics tracking including memory usage, cache hit rates, transaction performance, remote requests, and log service analytics
    • Metrics are fully configurable with adjustable collection intervals and performance thresholds

@lzxddz lzxddz requested a review from xiexiaoy November 19, 2025 11:02
@lzxddz lzxddz linked an issue Nov 19, 2025 that may be closed by this pull request
@coderabbitai

coderabbitai Bot commented Nov 19, 2025

Copy link
Copy Markdown

Walkthrough

The PR adds comprehensive metrics configuration across multiple EloQ DSS RocksDB deployment variants and extends the EloqGlobalOptions class with a new enableLogServiceMetrics flag. Configuration files introduce new metrics blocks with toggles for memory usage, cache hit rate, and transaction metrics. The KV engine conditionally includes log service metrics headers based on compile-time guards and initializes the flag at runtime.

Changes

Cohort / File(s) Change Summary
EloQ DSS Configuration Files - Metrics Addition
concourse/artifact/ELOQDSS_ROCKSDB/eloqdoc.conf, concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf, concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf, concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf, concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf
Added metrics configuration blocks with enableMetrics, metricsPort, enableMemoryUsage, collectMemoryUsageRound, enableCacheHitRate, enableTxMetrics, collectTxDurationRound, enableBusyRoundMetrics, busyRoundThreshold, enableRemoteRequestMetrics, and enableLogServiceMetrics options. Also added forkHostManager: false option to relevant sections.
EloqGlobalOptions Header
src/mongo/db/modules/eloq/src/eloq_global_options.h
Added public boolean data member enableLogServiceMetrics with default value true.
EloqGlobalOptions Implementation
src/mongo/db/modules/eloq/src/eloq_global_options.cpp
Added parsing logic for storage.eloq.metrics.enableLogServiceMetrics option and set enableLogServiceMetrics to enableMetrics && value in the store() method.
EloqKVEngine Implementation
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp
Added conditional include guard for log_service_metrics.h (selecting between log_service and eloq_log_service based on OPEN_LOG_SERVICE define). Added runtime initialization of metrics::enable_log_service_metrics from eloqGlobalOptions.enableLogServiceMetrics in the EloqKVEngine constructor.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Review areas requiring extra attention:
    • Verify conditional include logic in eloq_kv_engine.cpp for correct preprocessor guard handling and ensure OPEN_LOG_SERVICE is consistently defined across build configurations
    • Confirm metrics configuration values (ports, thresholds, collection intervals) are consistent and reasonable across all five configuration files
    • Ensure enableLogServiceMetrics default value alignment with existing metrics enable patterns

Possibly related PRs

Suggested reviewers

  • xiexiaoy
  • yi-xmu

Poem

🐰 Metrics bloom in config files bright,
Log service flags now shine with light,
RocksDB clusters hop with glee,
Performance data runs wild and free,
One hop, two hop—collect we go! 📊

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add metrics configs' accurately describes the main change across all modified files, which consistently add new metrics configuration blocks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf (1)

53-65: Same metrics port conflict concern as cluster_a.

This node also uses metricsPort: 18081. See the comment on eloqdoc_cluster_a.conf regarding potential port conflicts in single-machine cluster deployments.

🧹 Nitpick comments (1)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf (1)

53-65: Consider adding inline documentation for metrics options.

The new metrics section lacks the detailed inline comments present in other sections of this configuration file (e.g., the txService section at lines 20-47 has extensive documentation). Adding comments would improve maintainability and help operators understand what each metric option does.

Example documentation style to match existing sections:

   metrics:
     # Metrics options.
     enableMetrics: true
+    # Port to expose Prometheus-compatible metrics endpoint.
     metricsPort: 18081
+    # Collect memory usage statistics.
     enableMemoryUsage: true
+    # Frequency of memory usage collection (in rounds/iterations).
     collectMemoryUsageRound: 10000
+    # Track RocksDB cache hit rate.
     enableCacheHitRate: true
+    # Collect transaction processing metrics.
     enableTxMetrics: true
+    # Frequency of transaction duration collection (in rounds/iterations).
     collectTxDurationRound: 100
+    # Track rounds that exceed the busy threshold.
     enableBusyRoundMetrics: true
+    # Threshold in milliseconds to consider a round as 'busy'.
     busyRoundThreshold: 10
+    # Collect metrics for remote DSS requests.
     enableRemoteRequestMetrics: true
+    # Collect metrics from the log service layer.
     enableLogServiceMetrics: true
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 539d5ea and bd814d5.

📒 Files selected for processing (8)
  • concourse/artifact/ELOQDSS_ROCKSDB/eloqdoc.conf (1 hunks)
  • concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf (2 hunks)
  • concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf (1 hunks)
  • concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf (1 hunks)
  • concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf (1 hunks)
  • src/mongo/db/modules/eloq/src/eloq_global_options.cpp (2 hunks)
  • src/mongo/db/modules/eloq/src/eloq_global_options.h (1 hunks)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_log_service.cmake:116-119
Timestamp: 2025-09-25T11:58:50.446Z
Learning: The build_log_service.cmake file is specifically for "open log" functionality and only supports the ROCKSDB log state type (LOG_STATE_TYPE_RKDB). The full log state configuration with cloud variants (ROCKSDB_CLOUD_S3, ROCKSDB_CLOUD_GCS) is handled in build_eloq_log_service.cmake.
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_eloq_log_service.cmake:26-80
Timestamp: 2025-09-25T12:24:06.434Z
Learning: The user githubzilla implemented a fix in commit fe98aaf to address the MEMORY state incorrectly triggering RocksDB discovery in build_eloq_log_service.cmake. This was a control flow issue where MEMORY builds were falling into an else branch that still attempted RocksDB discovery.
📚 Learning: 2025-09-25T11:58:50.446Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_log_service.cmake:116-119
Timestamp: 2025-09-25T11:58:50.446Z
Learning: The build_log_service.cmake file is specifically for "open log" functionality and only supports the ROCKSDB log state type (LOG_STATE_TYPE_RKDB). The full log state configuration with cloud variants (ROCKSDB_CLOUD_S3, ROCKSDB_CLOUD_GCS) is handled in build_eloq_log_service.cmake.

Applied to files:

  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp
  • src/mongo/db/modules/eloq/src/eloq_global_options.h
📚 Learning: 2025-09-25T12:09:01.276Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp:95-115
Timestamp: 2025-09-25T12:09:01.276Z
Learning: The log states (LOG_STATE_TYPE_RKDB, LOG_STATE_TYPE_RKDB_S3, LOG_STATE_TYPE_RKDB_GCS) in the EloqDoc codebase are mutually exclusive - only one can be active at a time. The LOG_STATE_TYPE_RKDB_CLOUD macro is a helper that should only be defined for cloud variants (S3/GCS) and not when regular RKDB is active.

Applied to files:

  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp
📚 Learning: 2025-09-25T12:24:06.434Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_eloq_log_service.cmake:26-80
Timestamp: 2025-09-25T12:24:06.434Z
Learning: The user githubzilla implemented a fix in commit fe98aaf to address the MEMORY state incorrectly triggering RocksDB discovery in build_eloq_log_service.cmake. This was a control flow issue where MEMORY builds were falling into an else branch that still attempted RocksDB discovery.

Applied to files:

  • concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf
🧬 Code graph analysis (1)
src/mongo/db/modules/eloq/src/eloq_global_options.cpp (1)
src/mongo/db/modules/eloq/src/eloq_global_options.h (1)
  • enableLogServiceMetrics (161-162)
🔇 Additional comments (9)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf (1)

58-58: Clarify units for collection intervals and thresholds.

The numeric values for collectMemoryUsageRound, collectTxDurationRound, and busyRoundThreshold lack unit specifications in comments, making it difficult to assess whether the values are appropriate for this deployment.

Please confirm:

  • What units do collectMemoryUsageRound (10000) and collectTxDurationRound (100) use? Are these iteration counts, time intervals, or transaction counts?
  • What units does busyRoundThreshold (10) use? Milliseconds, microseconds, or something else?
  • Are these values appropriate for a cloud S3-backed deployment with nodeMemoryLimitMB: 16384?

Also applies to: 61-61, 63-63

concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf (1)

61-73: LGTM! Comprehensive metrics configuration added.

The metrics block is well-structured with clear options for different metric types. The enableLogServiceMetrics flag aligns with the code changes in EloqKVEngine and EloqGlobalOptions.

src/mongo/db/modules/eloq/src/eloq_global_options.h (1)

161-161: LGTM! New log service metrics flag added.

The new enableLogServiceMetrics option follows the same pattern as other metrics flags and has a sensible default value.

src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (2)

124-128: LGTM! Conditional header inclusion correctly implemented.

The guarded include properly selects between log_service and eloq_log_service paths based on the OPEN_LOG_SERVICE build configuration, which aligns with the mutually exclusive log state architecture.

Based on learnings


531-531: LGTM! Log service metrics flag correctly initialized.

The runtime initialization of metrics::enable_log_service_metrics from the global options is properly placed alongside other metrics configuration and follows the established pattern.

concourse/artifact/ELOQDSS_ROCKSDB/eloqdoc.conf (2)

27-28: LGTM! Cluster configuration option added.

The forkHostManager option is consistently added across deployment configurations.


33-45: LGTM! Metrics configuration block added.

The metrics configuration provides comprehensive control over different metric types and aligns with the new enableLogServiceMetrics global option.

src/mongo/db/modules/eloq/src/eloq_global_options.cpp (2)

648-653: LGTM! Option definition follows established pattern.

The new enableLogServiceMetrics option is properly defined with appropriate description and default value, consistent with other metrics options.


1130-1133: LGTM! Parsing logic correctly implements master switch behavior.

The enableMetrics && value pattern ensures that enableLogServiceMetrics is only active when the master enableMetrics flag is also enabled, which is consistent with how other metrics options are parsed (e.g., lines 1094-1095, 1103-1104).

Comment on lines +53 to +65
metrics:
# Metrics options.
enableMetrics: true
metricsPort: 18081
enableMemoryUsage: true
collectMemoryUsageRound: 10000
enableCacheHitRate: true
enableTxMetrics: true
collectTxDurationRound: 100
enableBusyRoundMetrics: true
busyRoundThreshold: 10
enableRemoteRequestMetrics: true
enableLogServiceMetrics: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Verify metrics port allocation for multi-node cluster deployments.

All cluster configuration files (cluster_a, cluster_b, cluster_c) use metricsPort: 18081. If these nodes run on the same machine, this will cause port conflicts. Consider using distinct ports (e.g., 18081, 18082, 18083) for each cluster node.


🏁 Script executed:

#!/bin/bash
# Check metrics port configuration across all cluster config files
echo "Checking metricsPort in cluster configuration files..."
rg -n "metricsPort:" concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster*.conf

Length of output: 339


Verification confirms the observation is accurate. Let me check if there's architectural context indicating whether these nodes are deployed on the same machine or distributed:


🏁 Script executed:

#!/bin/bash
# Check for deployment instructions and architecture documentation
echo "Looking for deployment-related files and documentation..."
fd -e md -e txt -e yaml -e yml . concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/ | head -20

echo -e "\n\nSearching for any deployment or setup instructions..."
rg -n "cluster_a|cluster_b|cluster_c|deployment|distributed" concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/ --type=md --type=txt -C2 | head -40

Length of output: 330


Let me search for the cluster config files more broadly:


🏁 Script executed:

#!/bin/bash
# Search for cluster config files across the repository
echo "Searching for cluster config files..."
fd -e conf "cluster" . 2>/dev/null

echo -e "\n\nSearching for eloqdoc files..."
fd "eloqdoc" . 2>/dev/null

echo -e "\n\nRepository root structure..."
ls -la | head -20

Length of output: 1968


Let me check documentation and deployment context:


🏁 Script executed:

#!/bin/bash
# Check README for deployment context
echo "=== README content (first 100 lines) ==="
head -100 README.md

echo -e "\n\n=== Looking for deployment/setup scripts ==="
find concourse -type f \( -name "*.sh" -o -name "Dockerfile" -o -name "docker-compose*" \) 2>/dev/null

echo -e "\n\n=== Checking base eloqdoc.conf for context ==="
head -60 concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf | tail -20

Length of output: 7544


Let me check for deployment documentation or examples:


🏁 Script executed:

#!/bin/bash
# Check for documentation about cluster usage
echo "=== Files in concourse directory ==="
find concourse -type f -name "*.md" -o -name "README*" -o -name "*.txt" | head -20

echo -e "\n\n=== Checking concourse artifacts directory structure ==="
ls -la concourse/artifact/

echo -e "\n\n=== Check if there's any comment in cluster configs about usage ==="
head -20 concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf

echo -e "\n\n=== Check all three cluster configs for any differences ==="
echo "cluster_a ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf
echo -e "\ncluster_b ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf
echo -e "\ncluster_c ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf

Length of output: 2851


Update metrics port allocation across all three cluster configuration files to prevent conflicts.

Verification confirms the concern: all three cluster configs use metricsPort: 18081 and reference dssPeerNode: 127.0.0.1:9100 (localhost), with distinct local directories (eloqdoc-cloud-a, eloqdoc-cloud-b, eloqdoc-cloud-c). This indicates same-machine deployment, where identical metrics ports will cause binding conflicts.

Assign distinct ports to each cluster node:

  • eloqdoc_cluster_a.conf: 18081
  • eloqdoc_cluster_b.conf: 18082
  • eloqdoc_cluster_c.conf: 18083
🤖 Prompt for AI Agents
In concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf around
lines 53-65 the metricsPort is set to 18081 which conflicts with the other two
cluster configs when deployed on the same host; ensure this file keeps
metricsPort: 18081 and update the other two cluster config files so they use
unique ports (eloqdoc_cluster_b.conf -> metricsPort: 18082 and
eloqdoc_cluster_c.conf -> metricsPort: 18083) to avoid binding conflicts on the
same machine.

@lzxddz lzxddz merged commit d65b5ce into eloqdata:main Nov 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add metrics

2 participants