Skip to content

Improve speed of header deid with lookup tables and caching#289

Merged
vsoch merged 13 commits intopydicom:masterfrom
ReeceStevens:improve-deid-speed
Oct 7, 2025
Merged

Improve speed of header deid with lookup tables and caching#289
vsoch merged 13 commits intopydicom:masterfrom
ReeceStevens:improve-deid-speed

Conversation

@ReeceStevens
Copy link
Contributor

Description

Related issues: None

I have been using deid in a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:

  1. Regex generation was taking up a significant amount of time
  2. Running field.name_contains was taking up a very large portion of the overall runtime (the inner loop within case 2 of expand_field_expression)
  3. get_fields was being run over and over

This PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.

Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:

  • origin/master: 35.589 seconds
  • After pre-compiling regexes: 26.727 seconds
  • After using lookup tables for get_fields: 7.344 seconds
  • After adding caching to get_fields: 5.304 seconds

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • My code follows the style guidelines of this project

Open questions

In order to get caching working, I had to add a __hash__ property to pydicom.FileDataset. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.

During profiling, it was identified that repeated regex lookups inside
loops were taking a significant amount of time. To reduce this, regex
expressions were pre-compiled outside of loops, and plain string
comparison was used when possible to sidestep the performance overhead
of regex matching for simple things like string equality comparison.
The `get_fields_with_lookup` function was added to augment the
`get_fields` function with a lookup table that allows quick
identification of exact tag matches.

This optimization significantly reduced the amount of time spent in the
exact matching stage of `expand_field_expression`. For top-level DICOM
tag searches, the search in "Case 2" would call `name_contains` on every
single tag in the DICOM dataset. With lookup tables, we can look up a
contender field based on the values for which we're checking exact
matches-- this becomes a key lookup problem rather than searching all
fields for a matching identifier.
Copy link
Member

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

# only way to enable the use of caching without incurring significant
# performance overhead. Note that adding a proxy class around this
# decreases performance substantially (50% slowdown measured).
FileDataset.__hash__ = lambda self: id(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id() as their hash method until you define an __eq__ method on the class, at which point you then have to define your own hash.

I can't think of any reasonable drawback to just using id() as the hash in practice, except that using datasets as dictionary keys might be a little funny:

some_dict = {}
ds1 = pydicom.dcmread('./somefile.dcm')
some_dict[ds1] = ds1.filename

# Later, read the same file again
ds2 = pydicom.dcmread('./somefile.dcm')
ds2 in some_dict # evaluates to False

It could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely can. Ping @darcymason to discuss if there is interest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed that I was pinged on this until just now. Just skimmed the conversation briefly so let me know if I've missed other discussion with my comment:

My understanding is that in Python only immutable objects should be hashable, and is why objects like list are not usable as dict keys. So Dataset/FileDataset shouldn't have that in pydicom at least as they are clearly mutable. Perhaps a FrozenDataset subclass could be made but seems like a rare case to bother adding in pydicom. It would require all data elements to be defined on instantiation, and right now that can't be done nicely with keyword/value pairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darcymason thank you for the insight and thoughts here! That makes sense to me, and I agree it doesn't make sense to make FileDataset hashable generally.

My concern with the approach in this PR is that this hash workaround will "leak" for people who use this library, and it could potentially cause some confusion or at least inconsistent behavior of FileDatasets. I pushed a change (726c3c8) which limits the scope of this change just to when we call the cached function, so hopefully this addresses the leakage concerns here. I'm aware this is a "hack" of sorts, but unfortunately constructing another dataset around this to more gracefully manage this problem appears to really hurt performance. Hopefully this change will at least just encapsulate it so it doesn't leak outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(switched the override to a context manager for better error handling 01f564c)

# Contains

def name_contains(self, expression, whole_string=False):
def name_contains(self, expression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this refactor - the whole_string update was not to my liking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ReeceStevens please remove whole_string from the documentation, too.

I'm just curious, is there a way to specify that the expression is a regex or a regular string? If everything is regex then speed may be an issue (maybe it is not bad if the expression is precompiled) and there may be complications with special characters. For example, private tags contain special characters in their names (parenthesis, dot, plus, etc. in the private creator string), which would break confuse regular expressions.

It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.

This discussion should not hold up the PR, but I just wanted to get some feedback before I submit a new issue (just in case I missed something obvious).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.

The question I have is what does a typical user search for? If it's a few letters, that works for regular expression or string. If it's "match this pattern" we want regular expression. If it's "match this exactly, and it looks like a pattern" is the case not well handled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lassoan good catch, updated in 8edf5d6.

- ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element)
"""
regexp_expression = f"^{expression}$" if whole_string else expression
if type(expression) is str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not use isinstance(expression, str) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just an oversight on my part-- I'll switch to using isinstance here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in dc6fc50

or re.search(regexp_expression, self.stripped_tag)
or re.search(regexp_expression, self.element.name)
or re.search(regexp_expression, self.element.keyword)
expression.search(self.name.lower())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much easier to read with the compiled regex.

# if no contenders provided, use top level of dicom headers
if contenders is None:
contenders = get_fields(dicom)
contenders, contender_lookup_tables = get_fields_with_lookup(dicom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0843c36. It was nice to remove all these lookup table variables lying around

if expander.lower() in ["endswith", "startswith", "contains"]:
if field.name_contains(expression):
fields[uid] = field
if type(field) is str and string_matches_expander(expander, expression, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance here again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return fields

def field_matches_expander(expander, expression_string, expression_re, field):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make sure all functions have docstrings (you can easily convert the comments I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 37adf90

"""
skip = skip or []
seen = seen or []
fields, new_seen, new_skip = get_fields_inner(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly another opportunity for a class or Dataclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup function.

"element_keyword": defaultdict(list),
}
for uid, field in fields.items():
if type(field) is not DicomField:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not self.fields:
self.fields = get_fields(
if not self.fields or not self.fields_by_name:
self.fields, self.lookup_tables = get_fields_with_lookup(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a part of 0843c36!

@ReeceStevens
Copy link
Contributor Author

Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.

If you like we can merge the other with a bump to the version, and release both changes under that version.

That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR.

@ReeceStevens ReeceStevens requested a review from vsoch September 24, 2025 10:57
Copy link
Member

@vsoch vsoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll want to do another pass - but here are some thoughts for early discussion/review!

if field.lower() in expanders:
if field.lower() == "all":
fields = contenders
fields = contenders.fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check - the intention here isn't to make a copy that we can edit and have the class (ontenders) fields not change? If the attribute is mutable I think it will still. E.g.,

class Test:
    def __init__(self, fields):
        self.fields = fields
   

myfields = {'1': 1, '2':[2,2,2], '3': '3', '4': {1:1}}
test = Test(myfields)
fields = test.fields

fields['2'].pop()
# 2

# Note list with 2 is smaller
Itest.fields
{'1': 1, '2': [2, 2], '3': '3'}

# note dictionary updated
fields['4'][2] = 2
test.fields
{'1': 1, '2': [2, 2, 2], '3': '3', '4': {1: 1, 2: 2}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not necessarily intend for this to be a deep copy-- I was trying to produce the same behavior as the previous implementation, which did pass along the direct reference to the contenders. I'm happy to put a deepcopy here if you'd like, though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the decision depends on whether it is safe to potentially modify the fields that were passed. Perhaps the deepcopy would be a safe thing to add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. I didn't notice any performance hit from adding this either.

self.lookup_tables[table_name][key].append(field)

def remove(self, uid):
if uid in self.fields:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the pattern I usually prefer (to avoid a level of nesting) is:

if uid not in self.fields:
    return
# same logic with one fewer nested level

def remove(self, uid):
if uid in self.fields:
field = self.fields[uid]
del self.fields[uid]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this deletion line after everything in case something goes wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good call!

del self.fields[uid]
for table_name, lookup_keys in self._get_field_lookup_keys(field).items():
for key in lookup_keys:
if field in self.lookup_tables[table_name][key]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern here - if the field isn't found, then continue (remove a level of nesting).

"""
skip = skip or []
seen = seen or []
fields, new_seen, new_skip = _get_fields_inner(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this outer wrapper needed to support the function having the cache wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right-- the function decorated with @cache requires all its inputs and its return type to be hashable, so we do this conversion of lists to tuples in the wrapper function.

@ReeceStevens
Copy link
Contributor Author

Thanks @vsoch -- pushed some updates in response to your feedback. Will wait to hear back on any other thoughts you've got.

Comment on lines 244 to 247
if field.lower() in expanders:
if field.lower() == "all":
fields = contenders
fields = deepcopy(contenders.fields)
return fields
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch I'm noticing here that fields may be undefined if we pass the first conditional (field.lower() in expanders) but not the second (field.lower() == "all"). I was going to update this but wasn't sure of the intention here. Should this case return with an empty dictionary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of that block was to support:

len(dicom.dir())
# 34   (--- we have 34 fields
fields = expand_field_expression('all', dicom)
len(fields)
# 34   (--- we selected all 34 fields

Albeit the check is redundant since there is just one expander. We might want to do:

if field.lower() in expanders and field.lower() == 'all':
    return deepcopy(contenders.fields)

I might have had the idea to support other expanders (to explain the list), but right now "all" is it. We certainly don't need the check for "in expanders" but I wonder how we might check if the user provided a string there that doesn't make sense (we'd want to raise an error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. It seems challenging to detect if a user passed in something that doesn't make sense here, since we're technically grabbing anything in a field that precedes a ":" character and interpreting it as an expander, right? So if a user actually intended on passing in a value that includes a colon, we would see an unexpected "expander" value but we shouldn't throw an error?

At any rate, when I left this comment I hadn't zeroed in on the fact that the expanders list only contained the value of "all", so that negates any concern I had about this particular code block-- and your explanation about future no-arg expanders makes sense.

@vsoch
Copy link
Member

vsoch commented Sep 28, 2025

@ReeceStevens this PR LGTM. Do you have a second person that might be able to review?

@ReeceStevens
Copy link
Contributor Author

ReeceStevens commented Sep 28, 2025

Thanks @vsoch ! I don’t have any other folks on my end available to review right now, unfortunately. I’ll be testing this in production really soon though so I’ll definitely make sure any feedback gets routed this way.

Let me know if you have any other requirements you want fulfilled before merging!

@vsoch
Copy link
Member

vsoch commented Sep 28, 2025

That would work for me @ReeceStevens - if you want to test in production and report back, if it goes smoothly we can merge here. Does that work?

@ReeceStevens
Copy link
Contributor Author

@vsoch sure, that's fine by me. I'll get back to you at the end of this week and let you know what the production run reveals!

@Simlomb
Copy link
Contributor

Simlomb commented Sep 29, 2025

@ReeceStevens I was very interested by this PR and I tested things too. It worked quite well, with a great optimization of the speed! I just noticed something which I wasn't able to fully explain.
I used a recipe that I had used many times without issues. I had some private tags specified with lower letters in my recipe and they didn't get processed. I modified the recipe to have only capital letters (for the group and element) and it worked fine.
It is something that can be tested easily by modifying the recipe at deid/examples/deid/deid.dicom-private-creator-syntax and the test is test_remove_single_tag_private_creator_syntax_3 at deid/tests/test_remove_action.py

@ReeceStevens
Copy link
Contributor Author

@Simlomb thank you for checking this out and highlighting this issue! You're right, it looks like when I implemented this lookup table approach I did not preserve the case-insensitive nature of the field lookup. Thank you also for providing an easy-to-run test case :) I've pushed another commit which resolves this issue-- I'm now seeing the test pass even if the private tag values are lower-case in the recipe.

@ReeceStevens
Copy link
Contributor Author

@vsoch Just checking in, after a week of running this in production I haven't had any errors reported (other than the one @Simlomb pointed out, which is now fixed). Let me know if there's anything else you need prior to merge!

@vsoch
Copy link
Member

vsoch commented Oct 3, 2025

Great news! We just need a bump to the version (maybe a larger one this time) and the corresponding entry to the changelog.

@Simlomb
Copy link
Contributor

Simlomb commented Oct 6, 2025

@Simlomb thank you for checking this out and highlighting this issue! You're right, it looks like when I implemented this lookup table approach I did not preserve the case-insensitive nature of the field lookup. Thank you also for providing an easy-to-run test case :) I've pushed another commit which resolves this issue-- I'm now seeing the test pass even if the private tag values are lower-case in the recipe.

@ReeceStevens That's great! Thanks for implementing a fix so rapidly!

@lassoan
Copy link
Contributor

lassoan commented Oct 6, 2025

It would be great to get this merged.

We just need a bump to the version (maybe a larger one this time) and the corresponding entry to the changelog

@ReeceStevens could you do this? It seems the last two things that hold up the merge.

@ReeceStevens
Copy link
Contributor Author

@vsoch I think we already did this in the prior PR (see 86e7b68). Do you need anything additional?

@ReeceStevens
Copy link
Contributor Author

This was considered a "patch" change since there was no interface change, but I'm fine switching to a 0.5.0 release instead of 0.4.7 if you'd prefer that.

@lassoan
Copy link
Contributor

lassoan commented Oct 6, 2025

I guess what @vsoch would like is an update to CHANGELOG.md and deid/version.py. These files are not updated yet in this PR. Bumping the version to 0.4.8 sounds appropriate for a patch (no need for 0.5.0).

@vsoch
Copy link
Member

vsoch commented Oct 6, 2025

You are both right! We don't need to explicitly bump the version (apologies I forgot about that) but we should have a CHANGELOG note under the current.

@ReeceStevens
Copy link
Contributor Author

Thanks @lassoan and @vsoch. Just wanted to clarify because I had included a changelog note for this PR in the previous PR-- see line 18 in the commit I linked to previously:

 - Improve performance of header deid with caching and lookup tables [#289](https://github.com/pydicom/deid/pull/289)

I did this because I thought you had asked for it explicitly in your original PR review comment here, but now I realize I might have misread it:

Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

Either way, is there any specific implementation note you'd like on the changelog beyond what's already there?

@ReeceStevens
Copy link
Contributor Author

@vsoch I pushed a couple of tweak commits here in response to feedback from other folks in this thread, but all tests are still passing and performance improvements are holding steady.

@vsoch vsoch merged commit 95c5612 into pydicom:master Oct 7, 2025
3 checks passed
@vsoch
Copy link
Member

vsoch commented Oct 7, 2025

Thank you for your excellent work!

I did this because I thought you had asked for it explicitly in your original PR review comment here, but now I realize I might have misread it:

You didn't misread it - you got exactly what I had intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants