Skip to content

fix(pd): preserve dtype in Paddle RepFlow dynamic aggregation#5712

Open
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pd-aggregate-dtype
Open

fix(pd): preserve dtype in Paddle RepFlow dynamic aggregation#5712
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pd-aggregate-dtype

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Problem

The Paddle aggregate() helper in deepmd/pd/model/network/utils.py allocated its output with paddle.zeros([num_owner, data.shape[1]]) without specifying a dtype, so it fell back to Paddle's default floating dtype (float32). It then cast the input data to output.dtype before index_add_. For float64 Paddle RepFlow/DPA3 models, the dynamic-selection aggregation therefore accumulated descriptor updates in float32, silently downcasting intermediate values before returning them to the descriptor path.

Fix

Allocate the output with dtype=data.dtype so the input precision is preserved (and the subsequent data.astype(output.dtype) becomes a no-op for matching dtypes).

Test

A new test aggregates float64 input and asserts the result stays float64 (and has the expected values). On the current code the output is float32; after the fix it is float64. This exercises the shared output = paddle.zeros(..., dtype=data.dtype) allocation via the summation path.

Note on verification

Verified locally with paddlepaddle==3.3.1. The CI target is a newer nightly (paddlepaddle==3.4.0.dev20260310), but the behavior fixed here is version-agnostic: paddle.zeros without dtype defaults to float32 in all versions, and dtype=data.dtype corrects it regardless. The test is scoped to the summation path to keep it independent of an unrelated Tensor.where API difference in the older local Paddle.

Fix #5688

Summary by CodeRabbit

  • Bug Fixes

    • Preserved the original numeric dtype when aggregating values, preventing unintended dtype fallback during summation.
    • Kept aggregation output on the same device as the input data for more consistent behavior.
  • Tests

    • Added coverage to verify aggregation keeps float64 precision and produces the expected summed results.

aggregate() allocated its output with paddle.zeros(...) without a dtype, so it
fell back to paddle's default float (float32) and then cast the input data to
that dtype before index_add_. For float64 RepFlow/DPA3 models this silently
downcast descriptor updates to float32.

Allocate the output with dtype=data.dtype so the input precision is preserved.
Add a test that the summation path keeps float64.

Fix deepmodeling#5688
@wanghan-iapcm wanghan-iapcm requested a review from njzjz July 2, 2026 06:39
@dosubot dosubot Bot added the bug label Jul 2, 2026
@github-actions github-actions Bot added the Python label Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The aggregate function in the Paddle network utils now explicitly initializes its output tensor with dtype=data.dtype instead of relying on the default dtype. A new unit test verifies that float64 dtype is preserved through the aggregation operation.

Changes

Dtype Preservation Fix

Layer / File(s) Summary
Preserve dtype in aggregate output
deepmd/pd/model/network/utils.py, source/tests/pd/model/test_aggregate.py
The zero-initialized output tensor in aggregate now explicitly uses dtype=data.dtype to avoid falling back to a default dtype; a new test module verifies aggregate preserves float64 dtype and produces correct summed values.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Related issues: #5688 — Preserve dtype in Paddle RepFlow dynamic aggregation, addressed by ensuring the aggregate output tensor inherits the dtype of the input data.

Suggested labels: bug, paddle, tests

Suggested reviewers: (none identified from provided context)

🐰 A tensor once forgot its type,
Defaulting quietly, out of sight,
Now dtype travels hand in hand,
float64 sums exactly as planned,
A tiny fix, a test to boot—
The bug is squashed, dtype takes root!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately describes the main change: preserving dtype in Paddle RepFlow aggregation.
Linked Issues check ✅ Passed The code change and new test satisfy issue #5688 by preserving input dtype during aggregation and verifying float64 behavior.
Out of Scope Changes check ✅ Passed The only code and test changes are directly related to dtype preservation in aggregate(), with no unrelated additions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pd/model/network/utils.py`:
- Around line 42-44: The `output = paddle.zeros(...)` allocation in `utils.py`
does not explicitly guarantee it is created on the same device as `data` and
`owners`, so update this helper to create `output` inside the matching device
context used by `index_add_` (or otherwise bind it to the input tensor’s device)
while preserving `data.dtype`; if you do not enforce that, revise the
surrounding comment to remove the “same device” claim.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5aaf4211-1938-4e86-9c67-35cf3547050a

📥 Commits

Reviewing files that changed from the base of the PR and between 9bd1f16 and 5a6c07f.

📒 Files selected for processing (2)
  • deepmd/pd/model/network/utils.py
  • source/tests/pd/model/test_aggregate.py

Comment on lines +42 to +44
# make sure this operation is done on the same device of data and owners,
# and preserve the input dtype instead of falling back to the default float
output = paddle.zeros([num_owner, data.shape[1]], dtype=data.dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does paddle.zeros accept a "place" keyword argument to control output tensor device in current stable Paddle releases?

💡 Result:

No, the standard stable releases of PaddlePaddle do not provide a "place" keyword argument for the paddle.zeros function to control the output tensor device [1][2][3][4]. In current stable versions, paddle.zeros accepts the arguments: shape, dtype, and name [2][3][4]. The function creates the tensor on the currently set global device, which can be managed using paddle.device.set_device [5][6]. While some internal development branches (represented in recent development documentation) may show an expanded signature including a "device" parameter [5], this is not part of the standard stable API. To create a tensor on a specific device in stable Paddle, it is common practice to use context management (e.g., paddle.device.device_guard) [7] or to create the tensor on the default device and then use the Tensor.to method to move it to the desired device [8].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the target file and show the relevant section with line numbers.
git ls-files deepmd/pd/model/network/utils.py
echo '---'
sed -n '1,140p' deepmd/pd/model/network/utils.py

echo '--- SEARCH: device guards / set_device usage near this utility'
rg -n "set_device|device_guard|paddle\.zeros|index_add_" deepmd/pd/model/network -S

Repository: deepmodeling/deepmd-kit

Length of output: 5831


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the file with line numbers if present.
if [ -f deepmd/pd/model/network/utils.py ]; then
  nl -ba deepmd/pd/model/network/utils.py | sed -n '1,140p'
fi

Repository: deepmodeling/deepmd-kit

Length of output: 201


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the target function and nearby device-context handling.
nl -ba deepmd/pd/model/network/utils.py | sed -n '1,180p'

echo '---'
rg -n "set_device|device_guard|place=|paddle\.zeros|index_add_" deepmd/pd/model/network/utils.py deepmd/pd/model/network -S

Repository: deepmodeling/deepmd-kit

Length of output: 201


Make the device guarantee explicit paddle.zeros here only follows the current device; it does not enforce a match with data/owners. If this helper can be called with tensors on a non-default device, create output under the matching device context before index_add_, or drop the “same device” wording from the comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deepmd/pd/model/network/utils.py` around lines 42 - 44, The `output =
paddle.zeros(...)` allocation in `utils.py` does not explicitly guarantee it is
created on the same device as `data` and `owners`, so update this helper to
create `output` inside the matching device context used by `index_add_` (or
otherwise bind it to the input tensor’s device) while preserving `data.dtype`;
if you do not enforce that, revise the surrounding comment to remove the “same
device” claim.

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.11%. Comparing base (0e5c170) to head (5a6c07f).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5712      +/-   ##
==========================================
- Coverage   81.97%   81.11%   -0.86%     
==========================================
  Files         959      981      +22     
  Lines      105748   109859    +4111     
  Branches     4102     4235     +133     
==========================================
+ Hits        86684    89113    +2429     
- Misses      17573    19220    +1647     
- Partials     1491     1526      +35     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Code scan] Preserve dtype in Paddle RepFlow dynamic aggregation

2 participants