Improve speed of header deid with lookup tables and caching by ReeceStevens · Pull Request #289 · pydicom/deid

ReeceStevens · 2025-09-23T16:20:07Z

Description

Related issues: None

I have been using deid in a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:

Regex generation was taking up a significant amount of time
Running field.name_contains was taking up a very large portion of the overall runtime (the inner loop within case 2 of expand_field_expression)
get_fields was being run over and over

This PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.

Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:

origin/master: 35.589 seconds
After pre-compiling regexes: 26.727 seconds
After using lookup tables for get_fields: 7.344 seconds
After adding caching to get_fields: 5.304 seconds

Checklist

I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
My code follows the style guidelines of this project

Open questions

In order to get caching working, I had to add a __hash__ property to pydicom.FileDataset. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.

During profiling, it was identified that repeated regex lookups inside loops were taking a significant amount of time. To reduce this, regex expressions were pre-compiled outside of loops, and plain string comparison was used when possible to sidestep the performance overhead of regex matching for simple things like string equality comparison.

The `get_fields_with_lookup` function was added to augment the `get_fields` function with a lookup table that allows quick identification of exact tag matches. This optimization significantly reduced the amount of time spent in the exact matching stage of `expand_field_expression`. For top-level DICOM tag searches, the search in "Case 2" would call `name_contains` on every single tag in the DICOM dataset. With lookup tables, we can look up a contender field based on the values for which we're checking exact matches-- this becomes a key lookup problem rather than searching all fields for a matching identifier.

vsoch

This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

vsoch · 2025-09-24T01:44:41Z

deid/dicom/fields.py

+# only way to enable the use of caching without incurring significant
+# performance overhead. Note that adding a proxy class around this
+# decreases performance substantially (50% slowdown measured).
+FileDataset.__hash__ = lambda self: id(self)


Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?

It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id() as their hash method until you define an __eq__ method on the class, at which point you then have to define your own hash.

I can't think of any reasonable drawback to just using id() as the hash in practice, except that using datasets as dictionary keys might be a little funny:

some_dict = {} ds1 = pydicom.dcmread('./somefile.dcm') some_dict[ds1] = ds1.filename # Later, read the same file again ds2 = pydicom.dcmread('./somefile.dcm') ds2 in some_dict # evaluates to False

It could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.

We definitely can. Ping @darcymason to discuss if there is interest.

Sorry, missed that I was pinged on this until just now. Just skimmed the conversation briefly so let me know if I've missed other discussion with my comment:

My understanding is that in Python only immutable objects should be hashable, and is why objects like list are not usable as dict keys. So Dataset/FileDataset shouldn't have that in pydicom at least as they are clearly mutable. Perhaps a FrozenDataset subclass could be made but seems like a rare case to bother adding in pydicom. It would require all data elements to be defined on instantiation, and right now that can't be done nicely with keyword/value pairs.

@darcymason thank you for the insight and thoughts here! That makes sense to me, and I agree it doesn't make sense to make FileDataset hashable generally.

My concern with the approach in this PR is that this hash workaround will "leak" for people who use this library, and it could potentially cause some confusion or at least inconsistent behavior of FileDatasets. I pushed a change (726c3c8) which limits the scope of this change just to when we call the cached function, so hopefully this addresses the leakage concerns here. I'm aware this is a "hack" of sorts, but unfortunately constructing another dataset around this to more gracefully manage this problem appears to really hurt performance. Hopefully this change will at least just encapsulate it so it doesn't leak outside.

(switched the override to a context manager for better error handling 01f564c)

vsoch · 2025-09-24T05:12:31Z

deid/dicom/fields.py

    # Contains

-    def name_contains(self, expression, whole_string=False):
+    def name_contains(self, expression):


I appreciate this refactor - the whole_string update was not to my liking.

@ReeceStevens please remove whole_string from the documentation, too.

I'm just curious, is there a way to specify that the expression is a regex or a regular string? If everything is regex then speed may be an issue (maybe it is not bad if the expression is precompiled) and there may be complications with special characters. For example, private tags contain special characters in their names (parenthesis, dot, plus, etc. in the private creator string), which would break confuse regular expressions.

It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.

This discussion should not hold up the PR, but I just wanted to get some feedback before I submit a new issue (just in case I missed something obvious).

It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.

The question I have is what does a typical user search for? If it's a few letters, that works for regular expression or string. If it's "match this pattern" we want regular expression. If it's "match this exactly, and it looks like a pattern" is the case not well handled.

@lassoan good catch, updated in 8edf5d6.

vsoch · 2025-09-24T05:12:54Z

deid/dicom/fields.py

        - ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element)
        """
-        regexp_expression = f"^{expression}$" if whole_string else expression
+        if type(expression) is str:


Any reason to not use isinstance(expression, str) here?

This was just an oversight on my part-- I'll switch to using isinstance here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)

Addressed in dc6fc50

vsoch · 2025-09-24T05:13:50Z

deid/dicom/fields.py

-            or re.search(regexp_expression, self.stripped_tag)
-            or re.search(regexp_expression, self.element.name)
-            or re.search(regexp_expression, self.element.keyword)
+            expression.search(self.name.lower())


This is much easier to read with the compiled regex.

vsoch · 2025-09-24T05:15:48Z

deid/dicom/fields.py

    # if no contenders provided, use top level of dicom headers
    if contenders is None:
-        contenders = get_fields(dicom)
+        contenders, contender_lookup_tables = get_fields_with_lookup(dicom)


Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.

That's a good idea!

Added in 0843c36. It was nice to remove all these lookup table variables lying around

vsoch · 2025-09-24T05:16:41Z

deid/dicom/fields.py

-        if expander.lower() in ["endswith", "startswith", "contains"]:
-            if field.name_contains(expression):
-                fields[uid] = field
+        if type(field) is str and string_matches_expander(expander, expression, field):


isinstance here again?

vsoch · 2025-09-24T05:17:53Z

deid/dicom/fields.py


-    return fields

+def field_matches_expander(expander, expression_string, expression_re, field):


Could you please make sure all functions have docstrings (you can easily convert the comments I think).

Resolved in 37adf90

vsoch · 2025-09-24T05:18:30Z

deid/dicom/fields.py

    """
-    skip = skip or []
-    seen = seen or []
+    fields, new_seen, new_skip = get_fields_inner(


Possibly another opportunity for a class or Dataclass.

I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup function.

vsoch · 2025-09-24T05:20:19Z

deid/dicom/fields.py

+        "element_keyword": defaultdict(list),
+    }
+    for uid, field in fields.items():
+        if type(field) is not DicomField:


isinstance? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃

vsoch · 2025-09-24T05:23:58Z

deid/dicom/parser.py

-        if not self.fields:
-            self.fields = get_fields(
+        if not self.fields or not self.fields_by_name:
+            self.fields, self.lookup_tables = get_fields_with_lookup(


We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).

Added as a part of 0843c36!

ReeceStevens · 2025-09-24T10:37:25Z

Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.

If you like we can merge the other with a bump to the version, and release both changes under that version.

That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR.

vsoch

I'll want to do another pass - but here are some thoughts for early discussion/review!

vsoch · 2025-09-25T12:09:55Z

deid/dicom/fields.py

    if field.lower() in expanders:
        if field.lower() == "all":
-            fields = contenders
+            fields = contenders.fields


Just to check - the intention here isn't to make a copy that we can edit and have the class (ontenders) fields not change? If the attribute is mutable I think it will still. E.g.,

class Test: def __init__(self, fields): self.fields = fields myfields = {'1': 1, '2':[2,2,2], '3': '3', '4': {1:1}} test = Test(myfields) fields = test.fields fields['2'].pop() # 2 # Note list with 2 is smaller Itest.fields {'1': 1, '2': [2, 2], '3': '3'} # note dictionary updated fields['4'][2] = 2 test.fields {'1': 1, '2': [2, 2, 2], '3': '3', '4': {1: 1, 2: 2}}

I did not necessarily intend for this to be a deep copy-- I was trying to produce the same behavior as the previous implementation, which did pass along the direct reference to the contenders. I'm happy to put a deepcopy here if you'd like, though!

I think the decision depends on whether it is safe to potentially modify the fields that were passed. Perhaps the deepcopy would be a safe thing to add?

I think that makes sense. I didn't notice any performance hit from adding this either.

vsoch · 2025-09-26T19:12:43Z

deid/dicom/fields.py

+                self.lookup_tables[table_name][key].append(field)
+
+    def remove(self, uid):
+        if uid in self.fields:


Nit: the pattern I usually prefer (to avoid a level of nesting) is:

if uid not in self.fields: return # same logic with one fewer nested level

vsoch · 2025-09-26T19:13:35Z

deid/dicom/fields.py

+    def remove(self, uid):
+        if uid in self.fields:
+            field = self.fields[uid]
+            del self.fields[uid]


I would put this deletion line after everything in case something goes wrong.

Yeah, that's a good call!

vsoch · 2025-09-26T19:14:20Z

deid/dicom/fields.py

+            del self.fields[uid]
+            for table_name, lookup_keys in self._get_field_lookup_keys(field).items():
+                for key in lookup_keys:
+                    if field in self.lookup_tables[table_name][key]:


Same pattern here - if the field isn't found, then continue (remove a level of nesting).

vsoch · 2025-09-26T19:15:38Z

deid/dicom/fields.py

    """
-    skip = skip or []
-    seen = seen or []
+    fields, new_seen, new_skip = _get_fields_inner(


Is this outer wrapper needed to support the function having the cache wrapper?

That's right-- the function decorated with @cache requires all its inputs and its return type to be hashable, so we do this conversion of lists to tuples in the wrapper function.

ReeceStevens · 2025-09-26T19:50:52Z

Thanks @vsoch -- pushed some updates in response to your feedback. Will wait to hear back on any other thoughts you've got.

ReeceStevens · 2025-09-26T19:53:25Z

deid/dicom/fields.py

    if field.lower() in expanders:
        if field.lower() == "all":
-            fields = contenders
+            fields = deepcopy(contenders.fields)
        return fields


@vsoch I'm noticing here that fields may be undefined if we pass the first conditional (field.lower() in expanders) but not the second (field.lower() == "all"). I was going to update this but wasn't sure of the intention here. Should this case return with an empty dictionary?

The intention of that block was to support:

len(dicom.dir()) # 34 (--- we have 34 fields fields = expand_field_expression('all', dicom) len(fields) # 34 (--- we selected all 34 fields

Albeit the check is redundant since there is just one expander. We might want to do:

if field.lower() in expanders and field.lower() == 'all': return deepcopy(contenders.fields)

I might have had the idea to support other expanders (to explain the list), but right now "all" is it. We certainly don't need the check for "in expanders" but I wonder how we might check if the user provided a string there that doesn't make sense (we'd want to raise an error).

Ah, that makes sense. It seems challenging to detect if a user passed in something that doesn't make sense here, since we're technically grabbing anything in a field that precedes a ":" character and interpreting it as an expander, right? So if a user actually intended on passing in a value that includes a colon, we would see an unexpected "expander" value but we shouldn't throw an error?

At any rate, when I left this comment I hadn't zeroed in on the fact that the expanders list only contained the value of "all", so that negates any concern I had about this particular code block-- and your explanation about future no-arg expanders makes sense.

vsoch · 2025-09-28T15:34:16Z

@ReeceStevens this PR LGTM. Do you have a second person that might be able to review?

ReeceStevens · 2025-09-28T17:36:17Z

Thanks @vsoch ! I don’t have any other folks on my end available to review right now, unfortunately. I’ll be testing this in production really soon though so I’ll definitely make sure any feedback gets routed this way.

Let me know if you have any other requirements you want fulfilled before merging!

vsoch · 2025-09-28T18:23:38Z

That would work for me @ReeceStevens - if you want to test in production and report back, if it goes smoothly we can merge here. Does that work?

ReeceStevens · 2025-09-29T14:06:07Z

@vsoch sure, that's fine by me. I'll get back to you at the end of this week and let you know what the production run reveals!

Simlomb · 2025-09-29T14:47:53Z

@ReeceStevens I was very interested by this PR and I tested things too. It worked quite well, with a great optimization of the speed! I just noticed something which I wasn't able to fully explain.
I used a recipe that I had used many times without issues. I had some private tags specified with lower letters in my recipe and they didn't get processed. I modified the recipe to have only capital letters (for the group and element) and it worked fine.
It is something that can be tested easily by modifying the recipe at deid/examples/deid/deid.dicom-private-creator-syntax and the test is test_remove_single_tag_private_creator_syntax_3 at deid/tests/test_remove_action.py

ReeceStevens · 2025-09-30T14:03:11Z

@Simlomb thank you for checking this out and highlighting this issue! You're right, it looks like when I implemented this lookup table approach I did not preserve the case-insensitive nature of the field lookup. Thank you also for providing an easy-to-run test case :) I've pushed another commit which resolves this issue-- I'm now seeing the test pass even if the private tag values are lower-case in the recipe.

ReeceStevens · 2025-10-03T19:51:07Z

@vsoch Just checking in, after a week of running this in production I haven't had any errors reported (other than the one @Simlomb pointed out, which is now fixed). Let me know if there's anything else you need prior to merge!

vsoch · 2025-10-03T22:34:48Z

Great news! We just need a bump to the version (maybe a larger one this time) and the corresponding entry to the changelog.

Simlomb · 2025-10-06T07:42:34Z

@Simlomb thank you for checking this out and highlighting this issue! You're right, it looks like when I implemented this lookup table approach I did not preserve the case-insensitive nature of the field lookup. Thank you also for providing an easy-to-run test case :) I've pushed another commit which resolves this issue-- I'm now seeing the test pass even if the private tag values are lower-case in the recipe.

@ReeceStevens That's great! Thanks for implementing a fix so rapidly!

lassoan · 2025-10-06T13:31:09Z

It would be great to get this merged.

We just need a bump to the version (maybe a larger one this time) and the corresponding entry to the changelog

@ReeceStevens could you do this? It seems the last two things that hold up the merge.

ReeceStevens · 2025-10-06T13:53:37Z

@vsoch I think we already did this in the prior PR (see 86e7b68). Do you need anything additional?

ReeceStevens · 2025-10-06T13:54:39Z

This was considered a "patch" change since there was no interface change, but I'm fine switching to a 0.5.0 release instead of 0.4.7 if you'd prefer that.

lassoan · 2025-10-06T14:43:16Z

I guess what @vsoch would like is an update to CHANGELOG.md and deid/version.py. These files are not updated yet in this PR. Bumping the version to 0.4.8 sounds appropriate for a patch (no need for 0.5.0).

vsoch · 2025-10-06T15:33:52Z

You are both right! We don't need to explicitly bump the version (apologies I forgot about that) but we should have a CHANGELOG note under the current.

ReeceStevens · 2025-10-07T09:27:38Z

Thanks @lassoan and @vsoch. Just wanted to clarify because I had included a changelog note for this PR in the previous PR-- see line 18 in the commit I linked to previously:

 - Improve performance of header deid with caching and lookup tables [#289](https://github.com/pydicom/deid/pull/289)

I did this because I thought you had asked for it explicitly in your original PR review comment here, but now I realize I might have misread it:

Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.

Either way, is there any specific implementation note you'd like on the changelog beyond what's already there?

ReeceStevens · 2025-10-07T09:48:11Z

@vsoch I pushed a couple of tweak commits here in response to feedback from other folks in this thread, but all tests are still passing and performance improvements are holding steady.

vsoch · 2025-10-07T10:30:36Z

Thank you for your excellent work!

I did this because I thought you had asked for it explicitly in your original PR review comment here, but now I realize I might have misread it:

You didn't misread it - you got exactly what I had intended.

ReeceStevens added 4 commits September 19, 2025 05:51

Add cache layer around field retrieval

0130c93

Fix test failures

2737eb4

vsoch reviewed Sep 24, 2025

View reviewed changes

ReeceStevens added 3 commits September 24, 2025 04:45

Use isinstance for type comparison

dc6fc50

Add docstrings to expander functions

37adf90

Use a wrapper class for field lookup tables

0843c36

ReeceStevens requested a review from vsoch September 24, 2025 10:57

vsoch reviewed Sep 26, 2025

View reviewed changes

Respond to PR feedback

6529acc

ReeceStevens commented Sep 26, 2025

View reviewed changes

Fix formatting

edf77ca

Remove case-sensitivity of fields in lookup tables

c99ed3b

Remove mentions of whole_string kwarg in docs

8edf5d6

ReeceStevens added 2 commits October 7, 2025 04:38

Move scope of hash function override to local function

726c3c8

Switch to context manager for cleaner override

01f564c

vsoch approved these changes Oct 7, 2025

View reviewed changes

vsoch merged commit 95c5612 into pydicom:master Oct 7, 2025
3 checks passed

vsoch mentioned this pull request Jan 29, 2026

AttributError in fields.py for Function #298

Open


		return fields

		def field_matches_expander(expander, expression_string, expression_re, field):

Uh oh!

Conversation

ReeceStevens commented Sep 23, 2025

Description

Checklist

Open questions

Uh oh!

vsoch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ReeceStevens commented Sep 24, 2025

Uh oh!

vsoch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ReeceStevens commented Sep 28, 2025 •

edited

Loading