[SPARK-16101][SQL] Refactoring CSV data source to be consistent with JSON data source#13988
[SPARK-16101][SQL] Refactoring CSV data source to be consistent with JSON data source#13988HyukjinKwon wants to merge 6 commits into
Conversation
|
I still need to correct some nits and check the consistency with JSON data source but I opened this just to check if it breaks anything. I will submit some more commits soon. (and will also update the PR description to be in more details maybe). |
|
Test build #61523 has finished for PR 13988 at commit
|
|
Hi @rxin, I think the change in this PR might be still pretty big. Should I maybe make this separate into two PRs for both reading and writing parts? |
|
Test build #61587 has finished for PR 13988 at commit
|
|
Test build #61588 has finished for PR 13988 at commit
|
|
(@rxin gentle ping..) |
|
Test build #62011 has finished for PR 13988 at commit
|
|
Test build #62014 has finished for PR 13988 at commit
|
|
retest this please |
|
I updated the PR description. I hope this is helpful for reviewing. |
|
Test build #62015 has finished for PR 13988 at commit
|
|
Test build #62016 has finished for PR 13988 at commit
|
There was a problem hiding this comment.
@HyukjinKwon we ran in to this issue where csv writes ints for DateType instead of date string. (https://issues.apache.org/jira/browse/SPARK-16597)
There was a problem hiding this comment.
Actually, I opened another PR here, #13912. Maybe it is about that PR.
There was a problem hiding this comment.
ah thanks, commented on that PR. Glad to see someone showing some love to Spark's csv datasource!
|
@rxin Could you take a look please? |
|
Test build #63425 has finished for PR 13988 at commit
|
|
Test build #63482 has finished for PR 13988 at commit
|
|
Test build #64254 has finished for PR 13988 at commit
|
|
Test build #64635 has finished for PR 13988 at commit
|
|
cc @hvanhovell Do you mind if I ask to review this please? I remember the initial proposal was reviewed by you. If this seems too big to review, I can split this into reading path and writing path. |
|
This is also loosely related with https://issues.apache.org/jira/browse/SPARK-15463. After this one is merged, we could resemble the implementation of JSON one easily rather then introducing another refactoring. |
|
Test build #65716 has finished for PR 13988 at commit
|
a6d85b6 to
ac94e67
Compare
|
Test build #66037 has finished for PR 13988 at commit
|
|
Test build #66035 has finished for PR 13988 at commit
|
|
Test build #66038 has finished for PR 13988 at commit
|
|
@hvanhovell If this change looks too big, I will split this into reading path and writing path if you confirm please. |
97eec87 to
697f276
Compare
|
Test build #66939 has finished for PR 13988 at commit
|
|
Test build #66936 has finished for PR 13988 at commit
|
697f276 to
346b1d2
Compare
|
Test build #67391 has finished for PR 13988 at commit
|
|
I will try to split this into two PRs for read path and write path. Would that sound okay to you both @rxin and @hvanhovell? |
346b1d2 to
72e7dcc
Compare
| schema.foreach(field => verifyType(field.dataType)) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
These below just came from CSVRelation.
| @@ -39,22 +38,43 @@ private[csv] object CSVInferSchema { | |||
| * 3. Replace any null types with string type | |||
| */ | |||
| def infer( | |||
There was a problem hiding this comment.
This argument change is kind of a important change to introduce similar functionalities with JSON. (e,g., creating a dataframe from RDD[String] or Dataset[String]).
| case datum => | ||
| Try(datum.toDouble) | ||
| .getOrElse(NumberFormat.getInstance(Locale.US).parse(datum).doubleValue()) | ||
| private def makeSafeHeader( |
There was a problem hiding this comment.
This just came from CSVFileFormat.
|
|
||
| val isCommentSet = this.comment != '\u0000' | ||
|
|
||
| def asWriterSettings: CsvWriterSettings = { |
There was a problem hiding this comment.
These just came from CSVParser.
| writerSettings.setHeaders(schema.fieldNames: _*) | ||
| private val gen = new CsvWriter(writer, writerSettings) | ||
|
|
||
| // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`. |
There was a problem hiding this comment.
These below just mostly came from CSVRelation.
| private type ValueConverter = String => Any | ||
|
|
||
| var numMalformedRecords = 0 | ||
| val row = new GenericInternalRow(requiredSchema.length) |
There was a problem hiding this comment.
Now, we reuse the single row.
There was a problem hiding this comment.
Also, it separates numMalformedRecords when it calls parse (...) which looked weird before.
| * each element represents a column) and turns it into either one resulting row or no row (if the | ||
| * the record is malformed). | ||
| */ | ||
| def parse(input: String): Option[InternalRow] = { |
There was a problem hiding this comment.
Here, I separate the parsing mode logics (withParseMode) and actual converting logics (parse).
There was a problem hiding this comment.
Also, the argument change (matching it up to JacksonParser) is also important. We could avoid additional refactoring when introducing the same funtionalities with JacksonParser, (e.g., loading it from RDD[String] or Dataset[String], from_json and to_json functions).
There was a problem hiding this comment.
For example, PR - 13300 introduces such refactoring.
|
Test build #70917 has finished for PR 13988 at commit
|
|
Test build #70919 has finished for PR 13988 at commit
|
|
Test build #70920 has finished for PR 13988 at commit
|
|
Test build #70923 has finished for PR 13988 at commit
|
|
Hi @cloud-fan, do you mind if I ask to check whether it looks making sense? |
| options: CSVOptions) extends Logging { | ||
| def this(schema: StructType, options: CSVOptions) = this(schema, schema, options) | ||
|
|
||
| val valueConverters = makeConverters(schema, options) |
There was a problem hiding this comment.
Some changes about converting here came from CSVTypeCast.
|
can you split this into smaller PRs? it's really painful to review such a big refactor-only PR. |
|
Sure! Let me split this into reading and writing ones. Thank you for yout comments. Let me close this for now. |
What changes were proposed in this pull request?
This PR refactors CSV data source to be consistent with JSON data source.
This PR removes classes
CSVParserand introduces new classesUnivocityParserandUnivocityGeneratorto be consistent with JSON data source (JacksonParser,JacksonGenerator). Also,CSVRelationis merged withCSVFileFormatjust likeJsonFileFormat.This is a rough look of this change:
CSVOptions- reading/writing settings that can be created from options fromCsvParser.UnivocityGenerator- writing logics fromCSVRelationandCsvParserUnivocityParser- parsing logics inCSVTypeCast,CsvParserandCSVRelationCSVFileFormat-CSVOutputWriterFactoryandCsvOutputWriterinCSVRelationThis PR makes the methods in classes have consistent arguments with JSON ones.
UnivocityGeneratorandJacksonGeneratorUnivocityParserresemblesJacksonParser.csv.CSVInferSchemaandjson.InferSchemaThis PR also makes the classes put in together in a consistent manner with JSON.
CsvFileFormatJsonFileFormatAlso, this re-write existing CSV parsing logics to re-use the row, separate parsing mode logic/convering logics and etc.
How was this patch tested?
Existing tests should cover this.