Add metrics configs#320
Conversation
WalkthroughThe PR adds comprehensive metrics configuration across multiple EloQ DSS RocksDB deployment variants and extends the EloqGlobalOptions class with a new enableLogServiceMetrics flag. Configuration files introduce new metrics blocks with toggles for memory usage, cache hit rate, and transaction metrics. The KV engine conditionally includes log service metrics headers based on compile-time guards and initializes the flag at runtime. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf (1)
53-65: Same metrics port conflict concern as cluster_a.This node also uses
metricsPort: 18081. See the comment oneloqdoc_cluster_a.confregarding potential port conflicts in single-machine cluster deployments.
🧹 Nitpick comments (1)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf (1)
53-65: Consider adding inline documentation for metrics options.The new metrics section lacks the detailed inline comments present in other sections of this configuration file (e.g., the txService section at lines 20-47 has extensive documentation). Adding comments would improve maintainability and help operators understand what each metric option does.
Example documentation style to match existing sections:
metrics: # Metrics options. enableMetrics: true + # Port to expose Prometheus-compatible metrics endpoint. metricsPort: 18081 + # Collect memory usage statistics. enableMemoryUsage: true + # Frequency of memory usage collection (in rounds/iterations). collectMemoryUsageRound: 10000 + # Track RocksDB cache hit rate. enableCacheHitRate: true + # Collect transaction processing metrics. enableTxMetrics: true + # Frequency of transaction duration collection (in rounds/iterations). collectTxDurationRound: 100 + # Track rounds that exceed the busy threshold. enableBusyRoundMetrics: true + # Threshold in milliseconds to consider a round as 'busy'. busyRoundThreshold: 10 + # Collect metrics for remote DSS requests. enableRemoteRequestMetrics: true + # Collect metrics from the log service layer. enableLogServiceMetrics: true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
concourse/artifact/ELOQDSS_ROCKSDB/eloqdoc.conf(1 hunks)concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf(2 hunks)concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf(1 hunks)concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf(1 hunks)concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf(1 hunks)src/mongo/db/modules/eloq/src/eloq_global_options.cpp(2 hunks)src/mongo/db/modules/eloq/src/eloq_global_options.h(1 hunks)src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_log_service.cmake:116-119
Timestamp: 2025-09-25T11:58:50.446Z
Learning: The build_log_service.cmake file is specifically for "open log" functionality and only supports the ROCKSDB log state type (LOG_STATE_TYPE_RKDB). The full log state configuration with cloud variants (ROCKSDB_CLOUD_S3, ROCKSDB_CLOUD_GCS) is handled in build_eloq_log_service.cmake.
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_eloq_log_service.cmake:26-80
Timestamp: 2025-09-25T12:24:06.434Z
Learning: The user githubzilla implemented a fix in commit fe98aaf to address the MEMORY state incorrectly triggering RocksDB discovery in build_eloq_log_service.cmake. This was a control flow issue where MEMORY builds were falling into an else branch that still attempted RocksDB discovery.
📚 Learning: 2025-09-25T11:58:50.446Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_log_service.cmake:116-119
Timestamp: 2025-09-25T11:58:50.446Z
Learning: The build_log_service.cmake file is specifically for "open log" functionality and only supports the ROCKSDB log state type (LOG_STATE_TYPE_RKDB). The full log state configuration with cloud variants (ROCKSDB_CLOUD_S3, ROCKSDB_CLOUD_GCS) is handled in build_eloq_log_service.cmake.
Applied to files:
src/mongo/db/modules/eloq/src/eloq_kv_engine.cppsrc/mongo/db/modules/eloq/src/eloq_global_options.h
📚 Learning: 2025-09-25T12:09:01.276Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp:95-115
Timestamp: 2025-09-25T12:09:01.276Z
Learning: The log states (LOG_STATE_TYPE_RKDB, LOG_STATE_TYPE_RKDB_S3, LOG_STATE_TYPE_RKDB_GCS) in the EloqDoc codebase are mutually exclusive - only one can be active at a time. The LOG_STATE_TYPE_RKDB_CLOUD macro is a helper that should only be defined for cloud variants (S3/GCS) and not when regular RKDB is active.
Applied to files:
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp
📚 Learning: 2025-09-25T12:24:06.434Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 211
File: src/mongo/db/modules/eloq/cmake/build_eloq_log_service.cmake:26-80
Timestamp: 2025-09-25T12:24:06.434Z
Learning: The user githubzilla implemented a fix in commit fe98aaf to address the MEMORY state incorrectly triggering RocksDB discovery in build_eloq_log_service.cmake. This was a control flow issue where MEMORY builds were falling into an else branch that still attempted RocksDB discovery.
Applied to files:
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf
🧬 Code graph analysis (1)
src/mongo/db/modules/eloq/src/eloq_global_options.cpp (1)
src/mongo/db/modules/eloq/src/eloq_global_options.h (1)
enableLogServiceMetrics(161-162)
🔇 Additional comments (9)
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.conf (1)
58-58: Clarify units for collection intervals and thresholds.The numeric values for
collectMemoryUsageRound,collectTxDurationRound, andbusyRoundThresholdlack unit specifications in comments, making it difficult to assess whether the values are appropriate for this deployment.Please confirm:
- What units do
collectMemoryUsageRound(10000) andcollectTxDurationRound(100) use? Are these iteration counts, time intervals, or transaction counts?- What units does
busyRoundThreshold(10) use? Milliseconds, microseconds, or something else?- Are these values appropriate for a cloud S3-backed deployment with nodeMemoryLimitMB: 16384?
Also applies to: 61-61, 63-63
concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf (1)
61-73: LGTM! Comprehensive metrics configuration added.The metrics block is well-structured with clear options for different metric types. The
enableLogServiceMetricsflag aligns with the code changes inEloqKVEngineandEloqGlobalOptions.src/mongo/db/modules/eloq/src/eloq_global_options.h (1)
161-161: LGTM! New log service metrics flag added.The new
enableLogServiceMetricsoption follows the same pattern as other metrics flags and has a sensible default value.src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (2)
124-128: LGTM! Conditional header inclusion correctly implemented.The guarded include properly selects between
log_serviceandeloq_log_servicepaths based on theOPEN_LOG_SERVICEbuild configuration, which aligns with the mutually exclusive log state architecture.Based on learnings
531-531: LGTM! Log service metrics flag correctly initialized.The runtime initialization of
metrics::enable_log_service_metricsfrom the global options is properly placed alongside other metrics configuration and follows the established pattern.concourse/artifact/ELOQDSS_ROCKSDB/eloqdoc.conf (2)
27-28: LGTM! Cluster configuration option added.The
forkHostManageroption is consistently added across deployment configurations.
33-45: LGTM! Metrics configuration block added.The metrics configuration provides comprehensive control over different metric types and aligns with the new
enableLogServiceMetricsglobal option.src/mongo/db/modules/eloq/src/eloq_global_options.cpp (2)
648-653: LGTM! Option definition follows established pattern.The new
enableLogServiceMetricsoption is properly defined with appropriate description and default value, consistent with other metrics options.
1130-1133: LGTM! Parsing logic correctly implements master switch behavior.The
enableMetrics && valuepattern ensures thatenableLogServiceMetricsis only active when the masterenableMetricsflag is also enabled, which is consistent with how other metrics options are parsed (e.g., lines 1094-1095, 1103-1104).
| metrics: | ||
| # Metrics options. | ||
| enableMetrics: true | ||
| metricsPort: 18081 | ||
| enableMemoryUsage: true | ||
| collectMemoryUsageRound: 10000 | ||
| enableCacheHitRate: true | ||
| enableTxMetrics: true | ||
| collectTxDurationRound: 100 | ||
| enableBusyRoundMetrics: true | ||
| busyRoundThreshold: 10 | ||
| enableRemoteRequestMetrics: true | ||
| enableLogServiceMetrics: true |
There was a problem hiding this comment.
🧩 Analysis chain
Verify metrics port allocation for multi-node cluster deployments.
All cluster configuration files (cluster_a, cluster_b, cluster_c) use metricsPort: 18081. If these nodes run on the same machine, this will cause port conflicts. Consider using distinct ports (e.g., 18081, 18082, 18083) for each cluster node.
🏁 Script executed:
#!/bin/bash
# Check metrics port configuration across all cluster config files
echo "Checking metricsPort in cluster configuration files..."
rg -n "metricsPort:" concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster*.confLength of output: 339
Verification confirms the observation is accurate. Let me check if there's architectural context indicating whether these nodes are deployed on the same machine or distributed:
🏁 Script executed:
#!/bin/bash
# Check for deployment instructions and architecture documentation
echo "Looking for deployment-related files and documentation..."
fd -e md -e txt -e yaml -e yml . concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/ | head -20
echo -e "\n\nSearching for any deployment or setup instructions..."
rg -n "cluster_a|cluster_b|cluster_c|deployment|distributed" concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/ --type=md --type=txt -C2 | head -40Length of output: 330
Let me search for the cluster config files more broadly:
🏁 Script executed:
#!/bin/bash
# Search for cluster config files across the repository
echo "Searching for cluster config files..."
fd -e conf "cluster" . 2>/dev/null
echo -e "\n\nSearching for eloqdoc files..."
fd "eloqdoc" . 2>/dev/null
echo -e "\n\nRepository root structure..."
ls -la | head -20Length of output: 1968
Let me check documentation and deployment context:
🏁 Script executed:
#!/bin/bash
# Check README for deployment context
echo "=== README content (first 100 lines) ==="
head -100 README.md
echo -e "\n\n=== Looking for deployment/setup scripts ==="
find concourse -type f \( -name "*.sh" -o -name "Dockerfile" -o -name "docker-compose*" \) 2>/dev/null
echo -e "\n\n=== Checking base eloqdoc.conf for context ==="
head -60 concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc.conf | tail -20Length of output: 7544
Let me check for deployment documentation or examples:
🏁 Script executed:
#!/bin/bash
# Check for documentation about cluster usage
echo "=== Files in concourse directory ==="
find concourse -type f -name "*.md" -o -name "README*" -o -name "*.txt" | head -20
echo -e "\n\n=== Checking concourse artifacts directory structure ==="
ls -la concourse/artifact/
echo -e "\n\n=== Check if there's any comment in cluster configs about usage ==="
head -20 concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf
echo -e "\n\n=== Check all three cluster configs for any differences ==="
echo "cluster_a ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf
echo -e "\ncluster_b ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_b.conf
echo -e "\ncluster_c ports section:"
sed -n '50,70p' concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_c.confLength of output: 2851
Update metrics port allocation across all three cluster configuration files to prevent conflicts.
Verification confirms the concern: all three cluster configs use metricsPort: 18081 and reference dssPeerNode: 127.0.0.1:9100 (localhost), with distinct local directories (eloqdoc-cloud-a, eloqdoc-cloud-b, eloqdoc-cloud-c). This indicates same-machine deployment, where identical metrics ports will cause binding conflicts.
Assign distinct ports to each cluster node:
eloqdoc_cluster_a.conf: 18081eloqdoc_cluster_b.conf: 18082eloqdoc_cluster_c.conf: 18083
🤖 Prompt for AI Agents
In concourse/artifact/ELOQDSS_ROCKSDB_CLOUD_S3/eloqdoc_cluster_a.conf around
lines 53-65 the metricsPort is set to 18081 which conflicts with the other two
cluster configs when deployed on the same host; ensure this file keeps
metricsPort: 18081 and update the other two cluster config files so they use
unique ports (eloqdoc_cluster_b.conf -> metricsPort: 18082 and
eloqdoc_cluster_c.conf -> metricsPort: 18083) to avoid binding conflicts on the
same machine.
Summary by CodeRabbit