Describe the bug, including details regarding any error messages, version, and platform.
I thought I would gather the same issues below..
Right now if you have any dates or times in a CSV file, the CSV reader will only successfully parse them if they are in a ISO format like YYYY-MM-DD or HH:MM:SS. Alternative formats like DD-MM-YYYY or MM/DD/YYYY or -H:MM:SS will just fail..
There is a timestamp_parsers option in CSV.ConvertOptions, but those formats only work on timestamp[x] columns.
The error below is common because only YYYY-MM-DD formatted strings are convertable to dates.
CSV conversion error to date32[day]: invalid value '01-20-2000'
Adding "%m-%d-%Y %H:%M:%S" or "%m-%d-%Y" to timestamp_parsers doesn't do anything
since this is a date32[day] column.
#26224
#28303
#33357
#37180
I think the best solution is to add date_parsers and time_parsers options to CSV.ConvertOptions..
I have a current hack I implemented to be able to parse DATEs out of CSV files..
If the dataset schema being used to read a CSV file has any column data types that start with "DATE"..
Change the schema and replace any date columns with a timestamp type.
Include the alternative Date formats in timestamp_parsers.. i.e. "%d-%m-%Y", "%m/%d/%Y", etc..
Read the CSV file which will read the date string values in as timestamps..
Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().
Code to swap out date columns with timestamp columns in a schema for dataset api
new_fields = []
for field in self.arrow_schema.names:
new_field = self.arrow_schema.field(field)
if str(new_field.type).startswith("date"):
new_fields.append(pa.field(field, pa.timestamp("s")))
else:
new_fields.append(self.arrow_schema.field(field))
new_schema = pa.schema(new_fields)
Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table
# convert column list into column dict selection
if isinstance(columns, List):
columns = {column: dataset.field(column) for column in columns}
# cast timestamps to date32 or date64 in schema definition
columns = {
column: (
dataset.field(column).cast(
str(self.arrow_schema.field(column).type)
)
if column in self.arrow_schema.names
and str(self.arrow_schema.field(column).type).startswith("date")
else expr
)
for column, expr in columns.items()
}
Component(s)
C++, Python
Describe the bug, including details regarding any error messages, version, and platform.
I thought I would gather the same issues below..
Right now if you have any dates or times in a CSV file, the CSV reader will only successfully parse them if they are in a ISO format like YYYY-MM-DD or HH:MM:SS. Alternative formats like DD-MM-YYYY or MM/DD/YYYY or -H:MM:SS will just fail..
There is a timestamp_parsers option in CSV.ConvertOptions, but those formats only work on timestamp[x] columns.
The error below is common because only YYYY-MM-DD formatted strings are convertable to dates.
CSV conversion error to date32[day]: invalid value '01-20-2000'
Adding "%m-%d-%Y %H:%M:%S" or "%m-%d-%Y" to timestamp_parsers doesn't do anything
since this is a date32[day] column.
#26224
#28303
#33357
#37180
I think the best solution is to add date_parsers and time_parsers options to CSV.ConvertOptions..
I have a current hack I implemented to be able to parse DATEs out of CSV files..
If the dataset schema being used to read a CSV file has any column data types that start with "DATE"..
Change the schema and replace any date columns with a timestamp type.
Include the alternative Date formats in timestamp_parsers.. i.e. "%d-%m-%Y", "%m/%d/%Y", etc..
Read the CSV file which will read the date string values in as timestamps..
Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().
Code to swap out date columns with timestamp columns in a schema for dataset api
Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table
Component(s)
C++, Python