fix(pd): preserve dtype in Paddle RepFlow dynamic aggregation by wanghan-iapcm · Pull Request #5712 · deepmodeling/deepmd-kit

wanghan-iapcm · 2026-07-02T06:39:53Z

Problem

The Paddle aggregate() helper in deepmd/pd/model/network/utils.py allocated its output with paddle.zeros([num_owner, data.shape[1]]) without specifying a dtype, so it fell back to Paddle's default floating dtype (float32). It then cast the input data to output.dtype before index_add_. For float64 Paddle RepFlow/DPA3 models, the dynamic-selection aggregation therefore accumulated descriptor updates in float32, silently downcasting intermediate values before returning them to the descriptor path.

Fix

Allocate the output with dtype=data.dtype so the input precision is preserved (and the subsequent data.astype(output.dtype) becomes a no-op for matching dtypes).

Test

A new test aggregates float64 input and asserts the result stays float64 (and has the expected values). On the current code the output is float32; after the fix it is float64. This exercises the shared output = paddle.zeros(..., dtype=data.dtype) allocation via the summation path.

Note on verification

Verified locally with paddlepaddle==3.3.1. The CI target is a newer nightly (paddlepaddle==3.4.0.dev20260310), but the behavior fixed here is version-agnostic: paddle.zeros without dtype defaults to float32 in all versions, and dtype=data.dtype corrects it regardless. The test is scoped to the summation path to keep it independent of an unrelated Tensor.where API difference in the older local Paddle.

Fix #5688

Summary by CodeRabbit

Bug Fixes
- Preserved the original numeric dtype when aggregating values, preventing unintended dtype fallback during summation.
- Kept aggregation output on the same device as the input data for more consistent behavior.
Tests
- Added coverage to verify aggregation keeps float64 precision and produces the expected summed results.

aggregate() allocated its output with paddle.zeros(...) without a dtype, so it fell back to paddle's default float (float32) and then cast the input data to that dtype before index_add_. For float64 RepFlow/DPA3 models this silently downcast descriptor updates to float32. Allocate the output with dtype=data.dtype so the input precision is preserved. Add a test that the summation path keeps float64. Fix deepmodeling#5688

coderabbitai · 2026-07-02T06:42:56Z

📝 Walkthrough

Walkthrough

The aggregate function in the Paddle network utils now explicitly initializes its output tensor with dtype=data.dtype instead of relying on the default dtype. A new unit test verifies that float64 dtype is preserved through the aggregation operation.

Changes

Dtype Preservation Fix

Layer / File(s)	Summary
Preserve dtype in aggregate output `deepmd/pd/model/network/utils.py`, `source/tests/pd/model/test_aggregate.py`	The zero-initialized output tensor in `aggregate` now explicitly uses `dtype=data.dtype` to avoid falling back to a default dtype; a new test module verifies `aggregate` preserves `float64` dtype and produces correct summed values.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Related issues: #5688 — Preserve dtype in Paddle RepFlow dynamic aggregation, addressed by ensuring the aggregate output tensor inherits the dtype of the input data.

Suggested labels: bug, paddle, tests

Suggested reviewers: (none identified from provided context)

🐰 A tensor once forgot its type,
Defaulting quietly, out of sight,
Now dtype travels hand in hand,
float64 sums exactly as planned,
A tiny fix, a test to boot—
The bug is squashed, dtype takes root!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is concise and accurately describes the main change: preserving dtype in Paddle RepFlow aggregation.
Linked Issues check	✅ Passed	The code change and new test satisfy issue `#5688` by preserving input dtype during aggregation and verifying float64 behavior.
Out of Scope Changes check	✅ Passed	The only code and test changes are directly related to dtype preservation in aggregate(), with no unrelated additions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@deepmd/pd/model/network/utils.py`:
- Around line 42-44: The `output = paddle.zeros(...)` allocation in `utils.py`
does not explicitly guarantee it is created on the same device as `data` and
`owners`, so update this helper to create `output` inside the matching device
context used by `index_add_` (or otherwise bind it to the input tensor’s device)
while preserving `data.dtype`; if you do not enforce that, revise the
surrounding comment to remove the “same device” claim.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5aaf4211-1938-4e86-9c67-35cf3547050a

📥 Commits

Reviewing files that changed from the base of the PR and between 9bd1f16 and 5a6c07f.

📒 Files selected for processing (2)

deepmd/pd/model/network/utils.py
source/tests/pd/model/test_aggregate.py

coderabbitai · 2026-07-02T06:42:59Z

+    # make sure this operation is done on the same device of data and owners,
+    # and preserve the input dtype instead of falling back to the default float
+    output = paddle.zeros([num_owner, data.shape[1]], dtype=data.dtype)


🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does paddle.zeros accept a "place" keyword argument to control output tensor device in current stable Paddle releases?

💡 Result:

No, the standard stable releases of PaddlePaddle do not provide a "place" keyword argument for the paddle.zeros function to control the output tensor device [1][2][3][4]. In current stable versions, paddle.zeros accepts the arguments: shape, dtype, and name [2][3][4]. The function creates the tensor on the currently set global device, which can be managed using paddle.device.set_device [5][6]. While some internal development branches (represented in recent development documentation) may show an expanded signature including a "device" parameter [5], this is not part of the standard stable API. To create a tensor on a specific device in stable Paddle, it is common practice to use context management (e.g., paddle.device.device_guard) [7] or to create the tensor on the default device and then use the Tensor.to method to move it to the desired device [8].

Citations:

1: https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/zeros_en.html

2: https://www.paddlepaddle.org.cn/documentation/docs/en/3.0/api/paddle/zeros_en.html

3: https://paddlepaddle-static.cdn.bcebos.com/documentation/docs/en/api/paddle/zeros_en.html

4: https://www.paddlepaddle.org.cn/documentation/docs/en/2.6/api/paddle/zeros_en.html

5: https://www.paddlepaddle.org.cn/documentation/docs/en/develop/api/paddle/zeros_en.html

6: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/device/__init__.py

7: https://github.com/PaddlePaddle/Paddle/wiki/PaddlePaddle-3.1.1-Release-Note-EN

8: https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/Tensor/Overview_en.html

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the target file and show the relevant section with line numbers. git ls-files deepmd/pd/model/network/utils.py echo '---' sed -n '1,140p' deepmd/pd/model/network/utils.py echo '--- SEARCH: device guards / set_device usage near this utility' rg -n "set_device|device_guard|paddle\.zeros|index_add_" deepmd/pd/model/network -S

Repository: deepmodeling/deepmd-kit

Length of output: 5831

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the file with line numbers if present. if [ -f deepmd/pd/model/network/utils.py ]; then nl -ba deepmd/pd/model/network/utils.py | sed -n '1,140p' fi

Repository: deepmodeling/deepmd-kit

Length of output: 201

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the target function and nearby device-context handling. nl -ba deepmd/pd/model/network/utils.py | sed -n '1,180p' echo '---' rg -n "set_device|device_guard|place=|paddle\.zeros|index_add_" deepmd/pd/model/network/utils.py deepmd/pd/model/network -S

Repository: deepmodeling/deepmd-kit

Length of output: 201

Make the device guarantee explicit paddle.zeros here only follows the current device; it does not enforce a match with data/owners. If this helper can be called with tensors on a non-default device, create output under the matching device context before index_add_, or drop the “same device” wording from the comment.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepmd/pd/model/network/utils.py` around lines 42 - 44, The `output = paddle.zeros(...)` allocation in `utils.py` does not explicitly guarantee it is created on the same device as `data` and `owners`, so update this helper to create `output` inside the matching device context used by `index_add_` (or otherwise bind it to the input tensor’s device) while preserving `data.dtype`; if you do not enforce that, revise the surrounding comment to remove the “same device” claim.

codecov · 2026-07-02T07:46:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.11%. Comparing base (0e5c170) to head (5a6c07f).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5712      +/-   ##
==========================================
- Coverage   81.97%   81.11%   -0.86%     
==========================================
  Files         959      981      +22     
  Lines      105748   109859    +4111     
  Branches     4102     4235     +133     
==========================================
+ Hits        86684    89113    +2429     
- Misses      17573    19220    +1647     
- Partials     1491     1526      +35

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wanghan-iapcm requested a review from njzjz July 2, 2026 06:39

dosubot Bot added the bug label Jul 2, 2026

github-actions Bot added the Python label Jul 2, 2026

wanghan-iapcm requested a review from HydrogenSulfate July 2, 2026 06:42

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

HydrogenSulfate approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pd): preserve dtype in Paddle RepFlow dynamic aggregation#5712

fix(pd): preserve dtype in Paddle RepFlow dynamic aggregation#5712
wanghan-iapcm wants to merge 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-pd-aggregate-dtype

wanghan-iapcm commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

codecov Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wanghan-iapcm commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Test

Note on verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 2, 2026

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanghan-iapcm commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

codecov Bot commented Jul 2, 2026 •

edited

Loading