Skip to content

CSV reader cannot parse dates or times #41488

Description

@davlee1972

Describe the bug, including details regarding any error messages, version, and platform.

I thought I would gather the same issues below..

Right now if you have any dates or times in a CSV file, the CSV reader will only successfully parse them if they are in a ISO format like YYYY-MM-DD or HH:MM:SS. Alternative formats like DD-MM-YYYY or MM/DD/YYYY or -H:MM:SS will just fail..
There is a timestamp_parsers option in CSV.ConvertOptions, but those formats only work on timestamp[x] columns.

The error below is common because only YYYY-MM-DD formatted strings are convertable to dates.
CSV conversion error to date32[day]: invalid value '01-20-2000'
Adding "%m-%d-%Y %H:%M:%S" or "%m-%d-%Y" to timestamp_parsers doesn't do anything
since this is a date32[day] column.

#26224
#28303
#33357
#37180

I think the best solution is to add date_parsers and time_parsers options to CSV.ConvertOptions..

I have a current hack I implemented to be able to parse DATEs out of CSV files..
If the dataset schema being used to read a CSV file has any column data types that start with "DATE"..
Change the schema and replace any date columns with a timestamp type.
Include the alternative Date formats in timestamp_parsers.. i.e. "%d-%m-%Y", "%m/%d/%Y", etc..
Read the CSV file which will read the date string values in as timestamps..
Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().

Code to swap out date columns with timestamp columns in a schema for dataset api

        new_fields = []
        for field in self.arrow_schema.names:
            new_field = self.arrow_schema.field(field)
            if str(new_field.type).startswith("date"):
                new_fields.append(pa.field(field, pa.timestamp("s")))
            else:
                new_fields.append(self.arrow_schema.field(field))
        new_schema = pa.schema(new_fields)

Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table

                # convert column list into column dict selection
                if isinstance(columns, List):
                    columns = {column: dataset.field(column) for column in columns}

                # cast timestamps to date32 or date64 in schema definition
                columns = {
                    column: (
                        dataset.field(column).cast(
                            str(self.arrow_schema.field(column).type)
                        )
                        if column in self.arrow_schema.names
                        and str(self.arrow_schema.field(column).type).startswith("date")
                        else expr
                    )
                    for column, expr in columns.items()
                }

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions