Skip to content

Adding last page limit and include config to set the max results per page#1

Merged
joaopamaral merged 12 commits into
masterfrom
add_trim_results_config
Jan 20, 2023
Merged

Adding last page limit and include config to set the max results per page#1
joaopamaral merged 12 commits into
masterfrom
add_trim_results_config

Conversation

@joaopamaral
Copy link
Copy Markdown

Description of change

Last page check

This new check verifies if the request reached the last page and avoids the following pagination error (GitHub API limits the result to 40k records):

{
  "message": "In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse.",
  "documentation_url": "https://docs.github.com/v3/#pagination"
}

Max records per page configurable

The other change includes an option to define the max results per page and set by default the max results allowed by the Github API to reduce the number of requests.

Manual QA steps

Risks

Rollback steps

  • revert this branch

@joaopamaral joaopamaral requested review from a team and leomattic January 12, 2023 17:54
@joaopamaral joaopamaral marked this pull request as draft January 12, 2023 18:50
@joaopamaral joaopamaral marked this pull request as ready for review January 12, 2023 19:54
Copy link
Copy Markdown

@leomattic leomattic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks @joaopamaral !
🚀

Comment thread tap_github/client.py Outdated
Comment thread tap_github/client.py Outdated
Comment thread tests/unittests/test_custom_domain.py Outdated
Comment thread tests/unittests/test_custom_domain.py Outdated
Comment thread tests/unittests/test_custom_domain.py Outdated
Copy link
Copy Markdown

@Khrol Khrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While your code works fine, you somehow try to complicate things.

@joaopamaral
Copy link
Copy Markdown
Author

joaopamaral commented Jan 19, 2023

Pagination error test:

  • Max per page 99 records for stargazers schema with more than 40K records (last page bug)
    • Logging a Warning and saving the file without crashing the extraction
2023-01-19T16:24:43.380361Z [info]INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.16736173629760742, "tags": {"endpoint": "https://api.github.com/repositories/45717250/stargazers?per_page=99&page=404", "http_status_code": 200, "status": "succeeded"}} cmd_type=extractor name=tap-github run_id=65355086-3d86-47d7-8675-e12c3bcc25d3 state_id=2023-01-19T162307--tap-github--target-parquet stdio=stderr
2023-01-19T16:24:43.535120Z [info]WARNING HTTP-error-code: 422, Error: In order to keep the API fast for everyone, pagination is limited for this resource. Check the rel=last link relation in the Link response header to see how far back you can traverse.. Please refer 'https://docs.github.com/v3/#pagination' for more details.This is a known issue when the results exceed 40k and the last page is not full (it will trim the results to get only the available by the API). cmd_type=extractor name=tap-github run_id=65355086-3d86-47d7-8675-e12c3bcc25d3 state_id=2023-01-19T162307--tap-github--target-parquet stdio=stderr
2023-01-19T16:24:43.535671Z [info]INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.10387468338012695, "tags": {"endpoint": "https://api.github.com/repositories/45717250/stargazers?per_page=99&page=405", "http_status_code": 422, "status": "succeeded"}} cmd_type=extractor name=tap-github run_id=65355086-3d86-47d7-8675-e12c3bcc25d3 state_id=2023-01-19T162307--tap-github--target-parquet stdio=stderr
2023-01-19T16:24:43.536061Z [info]INFO METRIC: {"type": "counter", "metric": "record_count", "value": 13364, "tags": {"endpoint": "stargazers"}} cmd_type=extractor name=tap-github run_id=65355086-3d86-47d7-8675-e12c3bcc25d3 state_id=2023-01-19T162307--tap-github--target-parquet stdio=stderr
2023-01-19T16:24:44.535253Z [info]INFO Wrote file output/stargazers/20230119_162443-562665.gz.parquet from stargazers stream cmd_type=loader name=target-parquet run_id=65355086-3d86-47d7-8675-e12c3bcc25d3 state_id=2023-01-19T162307--tap-github--target-parquet stdio=stderr
2023-01-19T16:24:44.564375Z [info]INFO Wrote 1 files
  • Max per page 100 records for stargazers schema with more than 40K records
    • No Warning
2023-01-19T16:29:52.660372Z [info     ] INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.17248058319091797, "tags": {"endpoint": "https://api.github.com/repositories/45717250/stargazers?per_page=100&page=400", "http_status_code": 200, "status": "succeeded"}} cmd_type=extractor name=tap-github run_id=d3e99078-e120-414a-9ccf-199b416caa5f state_id=2023-01-19T162818--tap-github--target-parquet stdio=stderr
2023-01-19T16:29:52.710340Z [info     ] INFO METRIC: {"type": "counter", "metric": "record_count", "value": 13669, "tags": {"endpoint": "stargazers"}} cmd_type=extractor name=tap-github run_id=d3e99078-e120-414a-9ccf-199b416caa5f state_id=2023-01-19T162818--tap-github--target-parquet stdio=stderr
2023-01-19T16:29:53.684673Z [info     ] INFO Wrote file output/stargazers/20230119_162952-743109.gz.parquet from stargazers stream cmd_type=loader name=target-parquet run_id=d3e99078-e120-414a-9ccf-199b416caa5f state_id=2023-01-19T162818--tap-github--target-parquet stdio=stderr
2023-01-19T16:29:53.700253Z [info     ] INFO Wrote 1 files 

@joaopamaral joaopamaral requested a review from Khrol January 19, 2023 16:35
Comment thread tap_github/client.py
# Don't raise a NotFoundException
return None

if error_code == 422 and PAGINATION_EXCEED_MSG in response_json.get('message', ''):
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error 422 can be raised for other reasons, that is why we are comparing the message to check if it's related to pagination.

Comment thread tap_github/client.py
timer.tags[metrics.Tag.http_status_code] = resp.status_code
rate_throttling(resp, self.max_sleep_seconds, self.min_remain_rate_limit)
if resp.status_code == 404:
if resp.status_code == 404 or resp.status_code == 422:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check the response message again because if the error code is 422 and it's not a pagination error, the raise_for_error method will raise an exception.

Comment thread tap_github/client.py
Call rest API and return the response in case of status code 200.
"""
with metrics.http_request_timer(source) as timer:
with metrics.http_request_timer(url) as timer:
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already expecting the endpoint, it makes it easier to identify the request in logs.

Copy link
Copy Markdown

@Khrol Khrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way better now and more clear what's going on.
Thank you, LGTM.

@joaopamaral joaopamaral merged commit 1cddabe into master Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants