Skip to content

feat: add progress_data to worker_metadata#202

Closed
gasperzgonec wants to merge 8 commits into
mainfrom
gasperz/ISS-252036
Closed

feat: add progress_data to worker_metadata#202
gasperzgonec wants to merge 8 commits into
mainfrom
gasperz/ISS-252036

Conversation

@gasperzgonec

@gasperzgonec gasperzgonec commented May 19, 2026

Copy link
Copy Markdown
Contributor

This PR flattens extraction progress metadata into worker_metadata and adds timestamp bounds the platform can use to detect incremental syncs that are not advancing.

  • NormalizedAttachment.created_date and NormalizedAttachment.modified_date (optional): attachments can now contribute source timestamps the same way normalized items do.
  • Repo.dateRanges: each repo now tracks the oldest/newest valid created_date and modified_date seen across uploaded batches.
  • Flat worker_metadata progress fields: extraction progress/done events now send the latest extracted repo’s item_type, oldest_created_date, newest_created_date, oldest_modified_date, and newest_modified_date.
  • State window fields in worker_metadata: oldest_state_date and newest_state_date are copied from event_context.extract_from / extract_to on emitted callback payloads.

Connected Issues

Checklist

  • Tests added/updated and ran with npm run test OR no tests needed.
  • Ran backwards compatibility tests with npm run test:backwards-compatibility.
  • Code formatted and checked with npm run lint.
  • Tested airdrop-template linked to this PR.
  • Documentation updated and provided a link to PR / new docs OR no docs needed.

Migration note

If your connector normalizes attachments, populate created_date and/or modified_date on each NormalizedAttachment using the source system timestamp in RFC3339 format.

const normalizedAttachment: NormalizedAttachment = {
  id: attachment.id,
  url: attachment.downloadUrl,
  file_name: attachment.name,
  parent_id: attachment.parentId,
  created_date: attachment.createdAt,
  modified_date: attachment.updatedAt,
};

Notes:

  • Attachment timestamp fields are optional, so existing connectors keep working.
  • Attachment repos only contribute created/modified bounds when those fields are populated.
  • Invalid timestamp strings are ignored and do not corrupt the tracked ranges.
  • Normalized items already support created_date / modified_date; no connector-side change is needed there.

What worker_metadata now contains

worker_metadata is part of the callback payload. It is not inside event_data.

Repo progress fields

These fields are attached only for:

  • DATA_EXTRACTION_PROGRESS
  • DATA_EXTRACTION_DONE
  • ATTACHMENT_EXTRACTION_PROGRESS
  • ATTACHMENT_EXTRACTION_DONE

Shape:

worker_metadata: {
  item_type?: string;
  oldest_created_date?: string;
  newest_created_date?: string;
  oldest_modified_date?: string;
  newest_modified_date?: string;
  oldest_state_date?: string;
  newest_state_date?: string;
  adaas_library_version?: string;
}

Behavior:

  • The repo progress fields describe only the latest extracted repo.
  • Values are emitted as RFC3339 strings.
  • Unset bounds are omitted.
  • Loader events and non-progress extraction events do not include repo progress fields.

Example:

{
  "worker_metadata": {
    "item_type": "issues",
    "oldest_created_date": "2024-01-01T00:00:00.000Z",
    "newest_created_date": "2024-05-01T00:00:00.000Z",
    "oldest_modified_date": "2024-02-01T00:00:00.000Z",
    "newest_modified_date": "2024-06-01T00:00:00.000Z",
    "oldest_state_date": "2024-01-01T00:00:00.000Z",
    "newest_state_date": "2024-06-01T00:00:00.000Z",
    "adaas_library_version": "..."
  }
}

How repo progress fields are computed

  1. On each Repo.upload(), the SDK scans uploaded items for valid created_date and modified_date values.
  2. It updates that repo’s running dateRanges.creationDate and dateRanges.modifiedDate.
  3. On the extraction progress/done events above, WorkerAdapter emits the bounds for the repo identified by the latest uploaded itemType.

How state window fields are computed

  • oldest_state_date is copied from event_context.extract_from.
  • newest_state_date is copied from event_context.extract_to.
  • These values are injected by control-protocol.emit() and override caller-provided state-date values.

Why this exists

The platform can compare the emitted repo timestamp bounds together with the absolute extraction window. If repeated runs keep reporting the same latest-item bounds and state window, that is a signal that the incremental sync is not making forward progress.

Started returning `progress_data` in `worker_metadata`.
@gasperzgonec gasperzgonec requested review from a team and radovanjorgic as code owners May 19, 2026 11:03

@radovanjorgic radovanjorgic left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests at all? :/

Comment thread src/repo/repo.ts
const itemsToUpload = batch || this.items;

if (itemsToUpload.length > 0) {
for (const item of itemsToUpload) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I have few questions:

  1. Do we really need to do this for each item? How is that from performance perspective? Let's say you have 100+ repos at once each with 5000 items in it?
  2. What if timeout comes at this point?
  3. Can we offload this work to backend? After storing the files, extractor-adapter/snap-in manager scans them and picks progress from there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the #3, we'll need @GasperSenk 's input, but I don't think so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the #1 and #2, I think this is fine.
This logic takes O(n) in time and O(1) in memory use.
There's just integer comparisons, and it should be fast enough to not block any isTimeout checks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, adapter doesn't know the normalization function to find the correct field name for the dates.

Comment thread src/repo/repo.ts Outdated
if (itemsToUpload.length > 0) {
for (const item of itemsToUpload) {
if (
item != null &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply just if (item?.created_date)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't work before I added created_date to the NormalizedAttachment.
Good catch!

Comment thread src/repo/repo.ts Outdated
'created_date' in item &&
item.created_date != null
) {
const created_date = new Date(item['created_date']).getTime();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use snake case for variable names please.

Comment thread src/repo/repo.ts Outdated
Comment on lines +28 to +29
min: 0,
max: 0,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do min and max mean here? Maybe we should use oldest and newest instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are min and max timestamps. Due to them being numbers, we decided to go with the min and max.
Discussed in the ISS comments.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to use newest and oldest, we do that in the backend, but I see why you would use min and max when you are dealing with longs.

Comment thread src/common/control-protocol.ts Outdated
eventType: ExtractorEventType | LoaderEventType;
data?: EventData;
worker_metadata?: {
progress_data: Record<string, { min: number; max: number }>;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a single entry?

newEventType == ExtractorEventType.AttachmentExtractionProgress)
) {
for (const repo of this.repos) {
itemTimestamps[repo.itemType] = repo.itemTimestamps;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We said we only want one entry for progress.
The latest from the list.

Comment thread src/repo/repo.ts Outdated
public uploadedArtifacts: Artifact[];
public itemTimestamps: { min: number; max: number } = {
min: 0,
max: 0,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We said to have a single record type and contain created and modified dates.

@radovanjorgic

Copy link
Copy Markdown
Collaborator

Closing because we opened new PR which merges #202 and #205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants