Skip to content

tests: Move random data generation methods from CometCastSuite to new DataGenerator class#426

Merged
viirya merged 5 commits into
apache:mainfrom
andygrove:datagen
May 15, 2024
Merged

tests: Move random data generation methods from CometCastSuite to new DataGenerator class#426
viirya merged 5 commits into
apache:mainfrom
andygrove:datagen

Conversation

@andygrove

@andygrove andygrove commented May 14, 2024

Copy link
Copy Markdown
Member

Which issue does this PR close?

N/A

Rationale for this change

Start moving random data generation out of CometCastSuite and into a class that can be used from other test suites.

What changes are included in this PR?

  • Refactor data generator
  • Mark CAST string -> integer as incompatible because the data gen changes exposed new issues

How are these changes tested?

Existing tests

}
}

// TODO Commented out to work around scalafix since this is currently unused.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this commented out code since it is no longer needed

@advancedxy advancedxy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes looks good to me except one minor comment.

However, the CI is failing, you may need to take a look at that.

import testImplicits._

/** Create a data generator using a fixed seed so that tests are reproducible */
private val gen = new DataGenerator(new Random(42))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to make this instance a field in the Companion object of RandomDataGenerator?

So that other test suites can all use the same random data generator by default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand what you are suggesting, I think you are saying to add this:

object DataGenerator {
  val DEFAULT = new DataGenerator(new Random(42))
}

And then reference that from the test suite?

private val gen = DataGenerator.DEFAULT

I think this could lead to some non-deterministic behavior. For example, running CometCastSuite on its own from my IDE would start with the freshly created generator, but running the same suite after after test suites have run would result in different inputs.

I think it would be better for each test to create a new instance of the generator with a fixed seed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could define it as a def rather than a val to force a new creation each time. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new suggested approach, that’s better and should be used.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @advancedxy. I have implemented your suggestion.

@andygrove

Copy link
Copy Markdown
Member Author

However, the CI is failing, you may need to take a look at that.

Yes, this change effectivey changed the seed and it has exposed new compatibility issues in existing tests. I am investigating.

@andygrove andygrove requested a review from viirya May 14, 2024 17:01
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.18%. Comparing base (14494d3) to head (cf779df).
Report is 9 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #426      +/-   ##
============================================
+ Coverage     34.02%   34.18%   +0.16%     
+ Complexity      857      850       -7     
============================================
  Files           116      116              
  Lines         38565    38552      -13     
  Branches       8517     8523       +6     
============================================
+ Hits          13120    13178      +58     
+ Misses        22691    22609      -82     
- Partials       2754     2765      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

case DataTypes.ByteType | DataTypes.ShortType | DataTypes.IntegerType |
DataTypes.LongType =>
Compatible()
Incompatible(Some("Not all invalid inputs are detected"))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #431 to fix this

private val gen = DataGenerator.DEFAULT

/** Number of random data items to generate in each test */
private val dataSize = 10000

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the size is increased 10x?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. At 1000 I saw string -> short fail, but not the other string -> integer tests. When I increased this to 10000 then all of the string -> integer tests failed, so seems like we get better coverage at this level

Double.NaN,
Double.PositiveInfinity,
Double.NegativeInfinity,
0.0d) ++

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we cover

      1.0d,
       -1.0d,
       Int.MinValue.toDouble,
       Int.MaxValue.toDouble,
       0.0d,
       -0.0d

?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. But I think it could be addressed in a follow-up pr.

Double.NaN,
Double.PositiveInfinity,
Double.NegativeInfinity,
0.0d) ++

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. But I think it could be addressed in a follow-up pr.

@viirya viirya merged commit fcf7d5b into apache:main May 15, 2024
@viirya

viirya commented May 15, 2024

Copy link
Copy Markdown
Member

Merged. Thanks @andygrove

@andygrove andygrove deleted the datagen branch May 15, 2024 19:01
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
… DataGenerator class (apache#426)

* add new DataGenerator class

* move more data gen methods

* ignore some failing tests, update compat docs

* address feedback

* fix regression

(cherry picked from commit fcf7d5b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants