update-index: add --refresh-stat-only#2125
Conversation
When refresh_index() is invoked with REFRESH_REALLY (e.g. via "git update-index --really-refresh"), the documented behaviour is that the "assume unchanged" bit on cache entries is disregarded so that stale stat data on those entries is still refreshed. The preload pass runs before the single-threaded refresh loop and is intended to mark up-to-date entries quickly so the slow path only has to deal with the leftovers. However, preload_thread() unconditionally called ie_match_stat() with CE_MATCH_RACY_IS_DIRTY|CE_MATCH_IGNORE_FSMONITOR and never with CE_MATCH_IGNORE_VALID, so it honoured the "assume unchanged" bit. When a modified file's entry was marked assume-unchanged, preload would conclude the entry was clean and call ce_mark_uptodate(); the subsequent --really-refresh loop would then skip the entry (because ce_uptodate(ce) is true) and never report it as needing an update. This only manifests when preload is active, so it has been latent in default configurations. It is observable today via GIT_TEST_PRELOAD_INDEX=1. Plumb the refresh flags through to the preload threads via a new refresh_flags field on struct thread_data, and have preload_thread() add CE_MATCH_IGNORE_VALID to its match options when REFRESH_REALLY is in effect. Update refresh_index() to pass "flags & REFRESH_REALLY" to preload_index() instead of a bare 0. Add a regression test under t2106 that forces preload on and confirms that "update-index --really-refresh" reports a modified assume-unchanged entry as needing update. Signed-off-by: George Giorgidze <giorgidze@meta.com>
When a working tree is copied from another machine, or restored from
a tarball, container image, or CI cache on the same machine, the
files may be byte-for-byte identical while cached stat data in the
index no longer matches. Backup and sync tools can preserve mtimes,
but fields like inode and device numbers are filesystem-local, so
large repositories can still end up paying for expensive refresh
checks on every "git status".
Git already has runtime configuration for reducing which stat fields
are checked, such as core.checkStat=<minimal|default>. That affects
how future checks interpret cached stat data, but it does not provide
a one-shot way to update the index's cached stat data to match the
current filesystem without also rehashing file contents. Setting
core.checkStat=minimal is "sticky": it weakens every subsequent
operation in the repository for the duration of the configuration,
rather than performing a single, bounded correction at a well-defined
point.
A similar idea was discussed on the list in January 2017 under the
name "--assume-content-unchanged"; see the thread starting at
<20170105112359.GN8116@chrystal.oracle.com>. The concern raised there
was that exposing a way to update cached stat data without content
comparison opens the index to abuse: an interactive user could skip a
slow refresh, lie to Git about the worktree, then file a bug after a
later merge corrupts a file. That concern is taken seriously here,
and this proposal is deliberately narrower than the 2017 one:
* It is a one-shot action, not a sticky configuration or per-entry
bit. The name --refresh-stat-only reflects that: it describes
what the command does in a single invocation, not a trust state
attached to entries (contrast with --assume-unchanged).
* The trust assertion is intended for closed-loop callers (CI cache
restore, container provisioning, backup/restore tooling) where
the worktree and the index were produced or verified together by
the same process. It is not a knob for interactive users to reach
for when "git status" feels slow.
* The failure mode is named directly in the documentation: if the
worktree does not in fact match the index, affected entries will
appear clean while the recorded object ID remains stale. The user
must type the flag, having read the warning. This is a narrower
contract than core.checkStat=minimal, which silently affects
every subsequent operation.
Container-based CI has become the dominant deployment model in the
years since that 2017 discussion. The current workaround -- setting
core.checkStat=minimal in every job step, or accepting the cost of
full content rehashing -- is operationally fragile: it requires every
step in every pipeline to set and preserve the configuration, and it
permanently weakens stat semantics for every command those steps
run. A single explicit invocation at restore time is a tighter, more
local fix.
Teach git update-index --refresh-stat-only to refresh only cached
stat information. It follows the existing refresh machinery, but
skips ie_modified() and treats racy entries as dirty by stat instead
of resolving them by content. Like --really-refresh, it ignores the
"assume unchanged" setting, so stale stat data on those entries is
still updated; that behaviour is documented alongside the flag.
The preload pass is extended to recognise REFRESH_STAT_ONLY (on top
of REFRESH_REALLY, which was wired up in the preceding commit) so
that assume-unchanged entries are not marked uptodate before the main
refresh path can update them.
Add tests covering object ID preservation, missing-file handling with
and without --ignore-missing, assume-unchanged override, and quiet
output.
Signed-off-by: George Giorgidze <giorgidze@meta.com>
Welcome to GitGitGadgetHi @giorgidze, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that either:
You can CC potential reviewers by adding a footer to the PR description with the following syntax: NOTE: DO NOT copy/paste your CC list from a previous GGG PR's description, Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txtTo iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description): To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
|
/allow |
|
/preview |
|
Error: User giorgidze is not yet permitted to use GitGitGadget |
This two-patch series adds "git update-index --refresh-stat-only", a
one-shot way to update the index's cached stat data to match the
current filesystem without rehashing file contents.
When a working tree is produced or restored by means other than a
normal checkout -- a CI cache restore, container provisioning, a
tarball extraction, or a copy from another machine -- the files may
be byte-for-byte identical while filesystem-local stat fields like
inode and device numbers no longer match. Today the available
workarounds are to (a) pay for full content rehashing on the next
"git status", or (b) set core.checkStat=minimal, which is sticky
and weakens every subsequent operation. Neither composes well with
modern container-based CI, where every job step would otherwise
need to set and preserve the configuration.
A similar idea ("--assume-content-unchanged") was discussed in
January 2017; see the thread starting at
20170105112359.GN8116@chrystal.oracle.com. The concern raised
there was that exposing a way to update cached stat data without
content comparison opens the index to abuse. The flag in this
series is deliberately narrower than the 2017 proposal:
attached to entries (contrast --assume-unchanged);
provisioning, backup/restore tooling) that produced or verified
the worktree atomically;
the next content check -- is named directly in the docs, and
the flag must be typed explicitly.
The series is organised so the bug fix is reviewable on its own:
1/2 preload-index: respect --really-refresh override of
assume-unchanged
2/2 update-index: add --refresh-stat-only
CC: Junio C Hamano gitster@pobox.com