Improve speed of header deid with lookup tables and caching#289
Improve speed of header deid with lookup tables and caching#289vsoch merged 13 commits intopydicom:masterfrom
Conversation
During profiling, it was identified that repeated regex lookups inside loops were taking a significant amount of time. To reduce this, regex expressions were pre-compiled outside of loops, and plain string comparison was used when possible to sidestep the performance overhead of regex matching for simple things like string equality comparison.
The `get_fields_with_lookup` function was added to augment the `get_fields` function with a lookup table that allows quick identification of exact tag matches. This optimization significantly reduced the amount of time spent in the exact matching stage of `expand_field_expression`. For top-level DICOM tag searches, the search in "Case 2" would call `name_contains` on every single tag in the DICOM dataset. With lookup tables, we can look up a contender field based on the values for which we're checking exact matches-- this becomes a key lookup problem rather than searching all fields for a matching identifier.
vsoch
left a comment
There was a problem hiding this comment.
This is excellent! See my questions and comments. Akin to the other, we will need to bump the version and changelog. If you like we can merge the other with a bump to the version, and release both changes under that version.
deid/dicom/fields.py
Outdated
| # only way to enable the use of caching without incurring significant | ||
| # performance overhead. Note that adding a proxy class around this | ||
| # decreases performance substantially (50% slowdown measured). | ||
| FileDataset.__hash__ = lambda self: id(self) |
There was a problem hiding this comment.
Is this something we should suggest for upstream (in that it might help other projects), or just appropriate to put here?
There was a problem hiding this comment.
It's a good question... from my understanding, hash functions are typically supposed to be based on the value of an object (as opposed to its identity). From what I've read, objects by default use their id() as their hash method until you define an __eq__ method on the class, at which point you then have to define your own hash.
I can't think of any reasonable drawback to just using id() as the hash in practice, except that using datasets as dictionary keys might be a little funny:
some_dict = {}
ds1 = pydicom.dcmread('./somefile.dcm')
some_dict[ds1] = ds1.filename
# Later, read the same file again
ds2 = pydicom.dcmread('./somefile.dcm')
ds2 in some_dict # evaluates to FalseIt could be worth proposing upstream to pydicom and we could just see if the maintainer is open to the change. But maybe we could just have it here for now until we figure out the next steps on the pydicom side.
There was a problem hiding this comment.
We definitely can. Ping @darcymason to discuss if there is interest.
There was a problem hiding this comment.
Sorry, missed that I was pinged on this until just now. Just skimmed the conversation briefly so let me know if I've missed other discussion with my comment:
My understanding is that in Python only immutable objects should be hashable, and is why objects like list are not usable as dict keys. So Dataset/FileDataset shouldn't have that in pydicom at least as they are clearly mutable. Perhaps a FrozenDataset subclass could be made but seems like a rare case to bother adding in pydicom. It would require all data elements to be defined on instantiation, and right now that can't be done nicely with keyword/value pairs.
There was a problem hiding this comment.
@darcymason thank you for the insight and thoughts here! That makes sense to me, and I agree it doesn't make sense to make FileDataset hashable generally.
My concern with the approach in this PR is that this hash workaround will "leak" for people who use this library, and it could potentially cause some confusion or at least inconsistent behavior of FileDatasets. I pushed a change (726c3c8) which limits the scope of this change just to when we call the cached function, so hopefully this addresses the leakage concerns here. I'm aware this is a "hack" of sorts, but unfortunately constructing another dataset around this to more gracefully manage this problem appears to really hurt performance. Hopefully this change will at least just encapsulate it so it doesn't leak outside.
There was a problem hiding this comment.
(switched the override to a context manager for better error handling 01f564c)
| # Contains | ||
|
|
||
| def name_contains(self, expression, whole_string=False): | ||
| def name_contains(self, expression): |
There was a problem hiding this comment.
I appreciate this refactor - the whole_string update was not to my liking.
There was a problem hiding this comment.
@ReeceStevens please remove whole_string from the documentation, too.
I'm just curious, is there a way to specify that the expression is a regex or a regular string? If everything is regex then speed may be an issue (maybe it is not bad if the expression is precompiled) and there may be complications with special characters. For example, private tags contain special characters in their names (parenthesis, dot, plus, etc. in the private creator string), which would break confuse regular expressions.
It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.
This discussion should not hold up the PR, but I just wanted to get some feedback before I submit a new issue (just in case I missed something obvious).
There was a problem hiding this comment.
It might make sense to interpret all strings as simple strings by default and use a "regex:" expander to indicate that the string should be interpreted as a regular expression.
The question I have is what does a typical user search for? If it's a few letters, that works for regular expression or string. If it's "match this pattern" we want regular expression. If it's "match this exactly, and it looks like a pattern" is the case not well handled.
deid/dicom/fields.py
Outdated
| - ELEMENT_OFFSET: 2-digit hexadecimal element number (last 8 bits of full element) | ||
| """ | ||
| regexp_expression = f"^{expression}$" if whole_string else expression | ||
| if type(expression) is str: |
There was a problem hiding this comment.
Any reason to not use isinstance(expression, str) here?
There was a problem hiding this comment.
This was just an oversight on my part-- I'll switch to using isinstance here for the type comparisons! (I've been switching between TypeScript and Python a lot recently)
| or re.search(regexp_expression, self.stripped_tag) | ||
| or re.search(regexp_expression, self.element.name) | ||
| or re.search(regexp_expression, self.element.keyword) | ||
| expression.search(self.name.lower()) |
There was a problem hiding this comment.
This is much easier to read with the compiled regex.
deid/dicom/fields.py
Outdated
| # if no contenders provided, use top level of dicom headers | ||
| if contenders is None: | ||
| contenders = get_fields(dicom) | ||
| contenders, contender_lookup_tables = get_fields_with_lookup(dicom) |
There was a problem hiding this comment.
Usually needing to return >1 related thing is a pattern or sign for a class. In the future we might consider a class here that has easy accessibility to the tables and then getting a particular item.
There was a problem hiding this comment.
That's a good idea!
There was a problem hiding this comment.
Added in 0843c36. It was nice to remove all these lookup table variables lying around
deid/dicom/fields.py
Outdated
| if expander.lower() in ["endswith", "startswith", "contains"]: | ||
| if field.name_contains(expression): | ||
| fields[uid] = field | ||
| if type(field) is str and string_matches_expander(expander, expression, field): |
|
|
||
| return fields | ||
|
|
||
| def field_matches_expander(expander, expression_string, expression_re, field): |
There was a problem hiding this comment.
Could you please make sure all functions have docstrings (you can easily convert the comments I think).
deid/dicom/fields.py
Outdated
| """ | ||
| skip = skip or [] | ||
| seen = seen or [] | ||
| fields, new_seen, new_skip = get_fields_inner( |
There was a problem hiding this comment.
Possibly another opportunity for a class or Dataclass.
There was a problem hiding this comment.
I changed this function to be a "private" function and kept the interface as-is for simplicity, since this is only used within the get_fields_with_lookup function.
deid/dicom/fields.py
Outdated
| "element_keyword": defaultdict(list), | ||
| } | ||
| for uid, field in fields.items(): | ||
| if type(field) is not DicomField: |
There was a problem hiding this comment.
isinstance? See https://switowski.com/blog/type-vs-isinstance/. it may only matter (for speed) for older versions of Python, which unfortunately are still present on many of our clusters... 🙃
deid/dicom/parser.py
Outdated
| if not self.fields: | ||
| self.fields = get_fields( | ||
| if not self.fields or not self.fields_by_name: | ||
| self.fields, self.lookup_tables = get_fields_with_lookup( |
There was a problem hiding this comment.
We might want to reset the looking tables when the class is init (or any update to parse a different file, for example). I'm trying to think of if there is a case where we might generate the lookup table for one dicom dataset and then load another (and have the tables mixed up and unintentionally combine patient data).
|
Thanks for the review and feedback @vsoch ! I believe I've addressed all outstanding comments, all tests are passing, and pre-commit hooks are all green.
That sounds like a plan to me-- I'll add the changelog adjustments and version bump in the other PR. |
vsoch
left a comment
There was a problem hiding this comment.
I'll want to do another pass - but here are some thoughts for early discussion/review!
deid/dicom/fields.py
Outdated
| if field.lower() in expanders: | ||
| if field.lower() == "all": | ||
| fields = contenders | ||
| fields = contenders.fields |
There was a problem hiding this comment.
Just to check - the intention here isn't to make a copy that we can edit and have the class (ontenders) fields not change? If the attribute is mutable I think it will still. E.g.,
class Test:
def __init__(self, fields):
self.fields = fields
myfields = {'1': 1, '2':[2,2,2], '3': '3', '4': {1:1}}
test = Test(myfields)
fields = test.fields
fields['2'].pop()
# 2
# Note list with 2 is smaller
Itest.fields
{'1': 1, '2': [2, 2], '3': '3'}
# note dictionary updated
fields['4'][2] = 2
test.fields
{'1': 1, '2': [2, 2, 2], '3': '3', '4': {1: 1, 2: 2}}There was a problem hiding this comment.
I did not necessarily intend for this to be a deep copy-- I was trying to produce the same behavior as the previous implementation, which did pass along the direct reference to the contenders. I'm happy to put a deepcopy here if you'd like, though!
There was a problem hiding this comment.
I think the decision depends on whether it is safe to potentially modify the fields that were passed. Perhaps the deepcopy would be a safe thing to add?
There was a problem hiding this comment.
I think that makes sense. I didn't notice any performance hit from adding this either.
deid/dicom/fields.py
Outdated
| self.lookup_tables[table_name][key].append(field) | ||
|
|
||
| def remove(self, uid): | ||
| if uid in self.fields: |
There was a problem hiding this comment.
Nit: the pattern I usually prefer (to avoid a level of nesting) is:
if uid not in self.fields:
return
# same logic with one fewer nested level
deid/dicom/fields.py
Outdated
| def remove(self, uid): | ||
| if uid in self.fields: | ||
| field = self.fields[uid] | ||
| del self.fields[uid] |
There was a problem hiding this comment.
I would put this deletion line after everything in case something goes wrong.
There was a problem hiding this comment.
Yeah, that's a good call!
deid/dicom/fields.py
Outdated
| del self.fields[uid] | ||
| for table_name, lookup_keys in self._get_field_lookup_keys(field).items(): | ||
| for key in lookup_keys: | ||
| if field in self.lookup_tables[table_name][key]: |
There was a problem hiding this comment.
Same pattern here - if the field isn't found, then continue (remove a level of nesting).
deid/dicom/fields.py
Outdated
| """ | ||
| skip = skip or [] | ||
| seen = seen or [] | ||
| fields, new_seen, new_skip = _get_fields_inner( |
There was a problem hiding this comment.
Is this outer wrapper needed to support the function having the cache wrapper?
There was a problem hiding this comment.
That's right-- the function decorated with @cache requires all its inputs and its return type to be hashable, so we do this conversion of lists to tuples in the wrapper function.
|
Thanks @vsoch -- pushed some updates in response to your feedback. Will wait to hear back on any other thoughts you've got. |
| if field.lower() in expanders: | ||
| if field.lower() == "all": | ||
| fields = contenders | ||
| fields = deepcopy(contenders.fields) | ||
| return fields |
There was a problem hiding this comment.
@vsoch I'm noticing here that fields may be undefined if we pass the first conditional (field.lower() in expanders) but not the second (field.lower() == "all"). I was going to update this but wasn't sure of the intention here. Should this case return with an empty dictionary?
There was a problem hiding this comment.
The intention of that block was to support:
len(dicom.dir())
# 34 (--- we have 34 fields
fields = expand_field_expression('all', dicom)
len(fields)
# 34 (--- we selected all 34 fieldsAlbeit the check is redundant since there is just one expander. We might want to do:
if field.lower() in expanders and field.lower() == 'all':
return deepcopy(contenders.fields)I might have had the idea to support other expanders (to explain the list), but right now "all" is it. We certainly don't need the check for "in expanders" but I wonder how we might check if the user provided a string there that doesn't make sense (we'd want to raise an error).
There was a problem hiding this comment.
Ah, that makes sense. It seems challenging to detect if a user passed in something that doesn't make sense here, since we're technically grabbing anything in a field that precedes a ":" character and interpreting it as an expander, right? So if a user actually intended on passing in a value that includes a colon, we would see an unexpected "expander" value but we shouldn't throw an error?
At any rate, when I left this comment I hadn't zeroed in on the fact that the expanders list only contained the value of "all", so that negates any concern I had about this particular code block-- and your explanation about future no-arg expanders makes sense.
|
@ReeceStevens this PR LGTM. Do you have a second person that might be able to review? |
|
Thanks @vsoch ! I don’t have any other folks on my end available to review right now, unfortunately. I’ll be testing this in production really soon though so I’ll definitely make sure any feedback gets routed this way. Let me know if you have any other requirements you want fulfilled before merging! |
|
That would work for me @ReeceStevens - if you want to test in production and report back, if it goes smoothly we can merge here. Does that work? |
|
@vsoch sure, that's fine by me. I'll get back to you at the end of this week and let you know what the production run reveals! |
|
@ReeceStevens I was very interested by this PR and I tested things too. It worked quite well, with a great optimization of the speed! I just noticed something which I wasn't able to fully explain. |
|
@Simlomb thank you for checking this out and highlighting this issue! You're right, it looks like when I implemented this lookup table approach I did not preserve the case-insensitive nature of the field lookup. Thank you also for providing an easy-to-run test case :) I've pushed another commit which resolves this issue-- I'm now seeing the test pass even if the private tag values are lower-case in the recipe. |
|
Great news! We just need a bump to the version (maybe a larger one this time) and the corresponding entry to the changelog. |
@ReeceStevens That's great! Thanks for implementing a fix so rapidly! |
|
It would be great to get this merged.
@ReeceStevens could you do this? It seems the last two things that hold up the merge. |
|
This was considered a "patch" change since there was no interface change, but I'm fine switching to a 0.5.0 release instead of 0.4.7 if you'd prefer that. |
|
I guess what @vsoch would like is an update to CHANGELOG.md and deid/version.py. These files are not updated yet in this PR. Bumping the version to 0.4.8 sounds appropriate for a patch (no need for 0.5.0). |
|
You are both right! We don't need to explicitly bump the version (apologies I forgot about that) but we should have a CHANGELOG note under the current. |
|
Thanks @lassoan and @vsoch. Just wanted to clarify because I had included a changelog note for this PR in the previous PR-- see line 18 in the commit I linked to previously: I did this because I thought you had asked for it explicitly in your original PR review comment here, but now I realize I might have misread it:
Either way, is there any specific implementation note you'd like on the changelog beyond what's already there? |
|
@vsoch I pushed a couple of tweak commits here in response to feedback from other folks in this thread, but all tests are still passing and performance improvements are holding steady. |
|
Thank you for your excellent work!
You didn't misread it - you got exactly what I had intended. |
Description
Related issues: None
I have been using
deidin a webassembly execution context with Pyodide. In this configuration, performance issues are exacerbated, and I have been noticing prohibitively slow performance on DICOM header de-identification. I spent some time profiling the header deid functionality and identified three main areas where performance was getting limited:field.name_containswas taking up a very large portion of the overall runtime (the inner loop within case 2 ofexpand_field_expression)get_fieldswas being run over and overThis PR consists of three commits, each tailored to one of the above points-- plus one extra commit to fix up remaining bugs and get all tests passing.
Performance gains here are, of course, dependent on the input DICOM files as well as the deid recipe that is used. In my test setup with 69 input DICOM files, I observed the following speed improvements:
origin/master: 35.589 secondsget_fields: 7.344 secondsget_fields: 5.304 secondsChecklist
Open questions
In order to get caching working, I had to add a
__hash__property topydicom.FileDataset. This is obviously not ideal, but there wasn't another way to get the caching performance boost. I had originally thought I could just wrap FileDataset in a proxy class, but the overhead of creating the proxy class seemed to be enough to slow things down as much as just getting a cache miss.