[WIP] fix: E2E performance pipeline flakyness and other improvements by chrispader · Pull Request #60155 · Expensify/App

chrispader · 2025-04-12T09:20:28Z

Explanation of Change

Fixes issues with the order and number of GH comments (for split up output files) and the flankynes of some performance metrics.

Fixed Issues

$
PROPOSAL:

Tests

Verify that no errors appear in the JS console

Offline tests

QA Steps

// TODO: These must be filled out, or the issue title must include "[No QA]."

Verify that no errors appear in the JS console

PR Author Checklist

I linked the correct issue in the ### Fixed Issues section above
I wrote clear testing steps that cover the changes made in this PR
- I added steps for local testing in the Tests section
- I added steps for the expected offline behavior in the Offline steps section
- I added steps for Staging and/or Production testing in the QA steps section
- I added steps to cover failure scenarios (i.e. verify an input displays the correct error message if the entered data is not correct)
- I turned off my network connection and tested it while offline to ensure it matches the expected behavior (i.e. verify the default avatar icon is displayed if app is offline)
- I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
I included screenshots or videos for tests on all platforms
I ran the tests on all platforms & verified they passed on:
- Android: Native
- Android: mWeb Chrome
- iOS: Native
- iOS: mWeb Safari
- MacOS: Chrome / Safari
- MacOS: Desktop
I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
I followed proper code patterns (see Reviewing the code)
- I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
- I verified that comments were added to code that is not self explanatory
- I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
- I verified any copy / text shown in the product is localized by adding it to src/languages/* files and using the translation method
  - If any non-english text was added/modified, I used JaimeGPT to get English > Spanish translation. I then posted it in #expensify-open-source and it was approved by an internal Expensify engineer. Link to Slack message:
- I verified all numbers, amounts, dates and phone numbers shown in the product are using the localization methods
- I verified any copy / text that was added to the app is grammatically correct in English. It adheres to proper capitalization guidelines (note: only the first word of header/labels should be capitalized), and is either coming verbatim from figma or has been approved by marketing (in order to get marketing approval, ask the Bug Zero team member to add the Waiting for copy label to the issue)
- I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
- I verified the JSDocs style guidelines (in STYLE.md) were followed
If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
I followed the guidelines as stated in the Review Guidelines
I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
I verified any variables that can be defined as constants (ie. in CONST.ts or at the top of the file that uses the constant) are defined as such
I verified that if a function's arguments changed that all usages have also been updated correctly
If any new file was added I verified that:
- The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
If a new CSS style is added I verified that:
- A similar style doesn't already exist
- The style can't be created with an existing StyleUtils function (i.e. StyleUtils.getBackgroundAndBorderStyle(theme.componentBG))
If the PR modifies code that runs when editing or sending messages, I tested and verified there is no unexpected behavior for all supported markdown - URLs, single line code, code blocks, quotes, headings, bold, strikethrough, and italic.
If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
If the PR modifies a component related to any of the existing Storybook stories, I tested and verified all stories for that component are still working as expected.
If the PR modifies a component or page that can be accessed by a direct deeplink, I verified that the code functions as expected when the deeplink is used - from a logged in and logged out account.
If the PR modifies the UI (e.g. new buttons, new UI components, changing the padding/spacing/sizing, moving components, etc) or modifies the form input styles:
- I verified that all the inputs inside a form are aligned with each other.
- I added Design label and/or tagged @Expensify/design so the design team can review the changes.
If a new page is added, I verified it's using the ScrollView component to make it scrollable when more elements are added to the page.
I added unit tests for any new feature or bug fix in this PR to help automatically prevent regressions in this user flow.
If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.

Screenshots/Videos

Android: Native

Android: mWeb Chrome

iOS: Native

iOS: mWeb Safari

MacOS: Chrome / Safari

MacOS: Desktop

mountiny · 2025-04-14T12:28:14Z

Is this ready?

melvin-bot · 2025-04-14T12:28:30Z

@JS00001 Please copy/paste the Reviewer Checklist from here into a new comment on this PR and complete it. If you have the K2 extension, you can simply click: [this button]

JS00001 · 2025-04-14T13:10:33Z


-const MAX_CHARACTERS_PER_FILE = 65536;
-const FILE_SIZE_SAFETY_MARGIN = 1000;
+// This is the maximum number of characters allowed to post to a GitHub comment body through the GitHub CLI


Are these comments necessary? And the commented out const, can that be removed?

JS00001 · 2025-04-14T13:12:44Z

          if ls "./Host_Machine_Files/\$WORKING_DIRECTORY"/output2.md 1> /dev/null 2>&1; then
            # Print all the split files
-            for file in "./Host_Machine_Files/\$WORKING_DIRECTORY/output"*; do
+            for file in $(ls "./Host_Machine_Files/\$WORKING_DIRECTORY/output"* | sort -V); do


is this sorting just for cleaner output?

chrispader · 2025-04-14T13:26:52Z

Is this ready?

No, i'm still actively working on it. This is just a draft PR for now. I still have to address the potential "flankyness" of the E2E performance pipelines.

JS00001 · 2025-05-08T12:21:56Z

is this still being worked on?

chrispader · 2025-05-08T13:57:07Z

is this still being worked on?

yes, i'll continue working on this in the next days

…-follow-up

chrispader · 2025-05-12T17:22:20Z

A quick update around my investigations around this issue

Most if not all of the recent flanky performance regression reports could be caused by "missed network cache hits". If we look into e.g. the Logcat logs of this AWS DeviceFarm job, we can see a lot of !!! Missed cache hit for url: ... logs.

Problem

In the E2E performance pipeline we store responses from network requests in a cache to mock network requests during the performance measurements, since long-taking requests would distort the measurements. The network cache works by hashing the URL and payload of a request, and later looking up the result from cache by comparing the hashes. In a "warmup" run we actually perform the network requests and store the result in cache. In the actual test run, we then lookup the result from cache by the hash of the new request.

This currently does not work for all types of requests, since often times the payload will not fully match the one from the previous request from the warmup run, e.g. with GetMissingOnyxMessages or ReconnectApp. Therefore we get "missed cache hits", in which we log an error and fallback to the actual fetch call.

Solution

To fix this problem, we we're thinking about changing the caching mechanism alltogether, by restoring the initial Onyx state on each iteration and then deterministically fetching the the network request results from the warmup run.

This would work as follows:

We run two warmup runs before each test. The first warmup run is just to login the user.
At the end of the first warmup run, we save/export the Onyx state by dumping the SQLite database into a .sql file, which we can use to re-create (This functionality does not exist yet in Onyx)
The second warmup run (already logged-in) will then do all the network requests and store their results into the network cache.
At the begin of each test iteration (we currently do 60) we re-store the database to the initial state first and only then run the RN app and start measuring performance.

TODO

To make this work, we would need to work on the following tasks:

Create the mechanism directly in Onyx to allow exporting/importing state into some platform-specific format or JSON. (or do this directly from the E2E suite code, which is less clean)
Implement the restoring on each test iteration and re-write the caching mechanism to store network request results in order rather than by a hash, so that we can then get the results deterministically
Test all of this locally and on the actual AWS DeviceFarm

What are your thoughts around this? cc @mountiny @hannojg @kirillzyusko

mountiny · 2025-05-12T18:17:49Z

@chrispader since almost everything in the app works offline and we care about the frontend perfomance, could we just test everything offline with the optimistic data once the user signs in? then we could just not care about the cache

mountiny · 2025-05-12T18:18:22Z

Then also we could just import the state with session too and we could skip the authentication completely too

chrispader · 2025-05-13T10:41:14Z

@chrispader since almost everything in the app works offline and we care about the frontend perfomance, could we just test everything offline with the optimistic data once the user signs in? then we could just not care about the cache

@mountiny if that's an option and fine with you then yes, let's go for it. This will reduce complexity a lot.

I still want to mention, that for some flows and use-cases we might not be able to do E2E performance regression testing then, e.g. things related to chat pagination/scrolling and loading new messages, or maybe extended search with data, that is not stored locally yet.

mountiny · 2025-05-15T18:47:30Z

I think the main issue might be if new collection/ is introduced or changed we would need to update the onyx state in the tests too and that can change the times. But maybe if we find a way to control the state size too, should be ok

…-follow-up

chrispader · 2025-06-20T11:18:22Z

@mountiny i will try to wrap this up over the weekend. I was testing the current E2E tests and some of them didn't really work properly in offline mode.

E.g. we are showing placeholder views instead of the list in the search screen when the user hasn't been online yet, which cause the tests to hang forever and kind of makes it obsolete.

Do you think we should change any of the actual app behavior for the purpose of fixing the E2E performance tests? I feel like this could be a scenario where we would actually want to always show some results in search, rather than just saying "You are offline, there are no results" or sth similar.

mountiny · 2025-06-20T15:31:26Z

I dont think we should be changing that now

perunt · 2025-08-21T12:27:14Z

After catching up on the discussion here are my thoughts:

On the caching issue:
Since the problem is basically how we're comparing requests, fuzzy matching could work well here. We could keep both approaches for now - use the current strict matching as default, but switch to fuzzy matching for those problematic endpoints where random fields (timestamps, session IDs, etc.) keep breaking our cache hits. This way we can fix the immediate flakiness without having to rebuild everything from scratch
On going fully offline:
I get the appeal of eliminating network flakiness entirely, but Chris has a good point - we'd lose coverage for a lot of important flows. Take linking for example - even with a perfectly hydrated Onyx state, we still need mock responses for the navigation and validation to actually work. These tests would basically be dead.
If we go fully offline, we're essentially disabling these tests until the app has true offline-first support everywhere, which feels like we're trading one problem for another

Instead of picking one extreme, why not split our tests based on what they actually need? Some can run offline, some need network mocking, and we keep the approach that makes sense for each group
@mountiny, are you okay with losing test coverage for these network-dependent features if we go offline-only? Or should we try this hybrid approach instead?

mountiny · 2025-08-21T13:15:47Z

That all makes sense! I think right now though, we should try to take a step back and see where the E2E are and what exactly do we want from them or if they up until now were solving the problems we wanted

Going to have to discuss that with the team so please keep this still on hold

mountiny · 2025-09-11T22:12:36Z

i will close this one for now

chrispader added 2 commits April 12, 2025 11:10

fix: print output files in order if more than 9

9523831

fix: also split summary table for meaningless changes in split files

2d0c059

mountiny marked this pull request as ready for review April 14, 2025 12:27

mountiny requested a review from a team as a code owner April 14, 2025 12:27

mountiny requested review from mountiny and removed request for a team April 14, 2025 12:28

melvin-bot Bot requested a review from JS00001 April 14, 2025 12:28

mountiny assigned chrispader Apr 14, 2025

JS00001 reviewed Apr 14, 2025

View reviewed changes

chrispader marked this pull request as draft April 14, 2025 13:26

fix: extract metric names to const variable

0f5b056

chrispader added 5 commits May 11, 2025 22:30

Merge branch 'main' into @chrispader/another-e2e-performance-pipeline…

357df52

…-follow-up

fix: improve meaningless changes index string (from to)

d25fb90

fix: further centralize performance metric and mark names

d9060ff

add comments

05a8b32

Update Performance.tsx

02acb7d

chrispader changed the title ~~fix: Yet another E2E performance pipeline follow up~~ fix: E2E performance pipeline flakyness and other improvements May 12, 2025

chrispader added 2 commits May 14, 2025 15:27

feat: implement interactive mode in testRunner

f993b05

add interactive mode log

c372a34

JS00001 changed the title ~~fix: E2E performance pipeline flakyness and other improvements~~ [WIP] fix: E2E performance pipeline flakyness and other improvements May 15, 2025

chrispader added 2 commits May 20, 2025 17:23

add emoji for interactive mode

9b1b759

fix: run tests in offline mode

10a9b55

mountiny mentioned this pull request Jun 2, 2025

[NoQA] Skip the e2e tests until they are fixed #63259

Merged

51 tasks

chrispader added 2 commits June 5, 2025 15:51

Merge branch 'main' into @chrispader/another-e2e-performance-pipeline…

323b315

…-follow-up

Merge branch 'main' into @chrispader/another-e2e-performance-pipeline…

4f0b8ca

…-follow-up

mountiny closed this Sep 11, 2025

Conversation

chrispader commented Apr 12, 2025

Explanation of Change

Fixed Issues

Tests

Offline tests

QA Steps

PR Author Checklist

Screenshots/Videos

Uh oh!

mountiny commented Apr 14, 2025

Uh oh!

melvin-bot Bot commented Apr 14, 2025

Uh oh!

JS00001 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

JS00001 Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

chrispader commented Apr 14, 2025

Uh oh!

JS00001 commented May 8, 2025

Uh oh!

chrispader commented May 8, 2025

Uh oh!

chrispader commented May 12, 2025

A quick update around my investigations around this issue

Problem

Solution

TODO

Uh oh!

mountiny commented May 12, 2025

Uh oh!

mountiny commented May 12, 2025

Uh oh!

chrispader commented May 13, 2025

Uh oh!

mountiny commented May 15, 2025

Uh oh!

chrispader commented Jun 20, 2025

Uh oh!

mountiny commented Jun 20, 2025

Uh oh!

perunt commented Aug 21, 2025

Uh oh!

mountiny commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mountiny commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mountiny commented Aug 21, 2025 •

edited

Loading