This repository was archived by the owner on Feb 11, 2026. It is now read-only.
B 22658 enhancements#13
Merged
Merged
Conversation
…t. WARNING: This commit currently thread locks and is not 1-2 processor friendly
traskowskycaci
approved these changes
Mar 19, 2025
traskowskycaci
left a comment
There was a problem hiding this comment.
Significant performance increase here. Very nice!
danieljordan-caci
approved these changes
Mar 19, 2025
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
B-22658
Summary
This enhancement allows for parallel processing at the table extraction level. Previously, we introduced parallel processing to allow for more than 1 table to be extracted at a time, and this introduced a new bottleneck. At the end of a full database extraction, if there was a super large table (audit_history for example), then we would only have 1 processor extracting it from SQL to JSON. This caused timeouts because it was too slow and we were sitting on 5 unused processors. Now, all processors we will used to convert every batch in the table to JSON, and then uploaded to S3. So we will queue all tables and batches to the CPU. By queueing them all without limit, we let the processor execute every job as it needs to.
At the parent table level, the “manager” awaits a response from the worker, which does not queue a job until the worker event triggers it to act. This prevents thread hogging and allows the workers to extract SQL to JSON without interruption, and then when the lower-level manager receives its event, it will upload if the byte size meets the criteria of 50MB buffer or it’s the last batch. We then clear the buffer from memory immediately and let the next worker send its SQL -> JSON data up for the next upload.
Links
Link to a fresh database extraction in Loadtest (6 processors)
https://us-gov-west-1.console.amazonaws-us-gov.com/cloudwatch/home?region=us-gov-west-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fdata-warehouse-data-warehouse/log-events/2025$252F03$252F19$252F$255B$2524LATEST$255D111ca372e0d3471bb1edd8c60e795515
Link to an incremental extraction in Loadtest (2 processors)
https://us-gov-west-1.console.amazonaws-us-gov.com/cloudwatch/home?region=us-gov-west-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fdata-warehouse-data-warehouse/log-events/2025$252F03$252F19$252F$255B$2524LATEST$255D96b4ac03e9ac4d4d8d6fe2d8c8a30923
Per docs for database extraction for Advana, a full, fresh extraction must use 6 processors. Incremental works fine with 2 processors (From loadtest, we'll see once in stg/prd if it needs more. Sent a notice to Advana that we'd like to test it in stg)