Skip to content

Maintain field nullability through a join #45557

@Fokko

Description

@Fokko

Describe the enhancement requested

Consider the following code:

Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> 
>>> arrow_schema = pa.schema(
...     [
...         pa.field("city", pa.string(), nullable=False),
...         pa.field("population", pa.int32(), nullable=False),
...     ]
... )
>>> 
>>> # Write some data
>>> df = pa.Table.from_pylist(
...     [
...         {"city": "Amsterdam", "population": 921402},
...         {"city": "San Francisco", "population": 808988},
...     ],
...     schema=arrow_schema,
... )
>>> 
>>> joined = df.join(df, "city", join_type="inner")
>>> 
>>> joined
pyarrow.Table
city: string
population: int32
population: int32
----
city: [["Amsterdam","San Francisco"]]
population: [[921402,808988]]
population: [[921402,808988]]
>>> df
pyarrow.Table
city: string not null
population: int32 not null
----
city: [["Amsterdam","San Francisco"]]
population: [[921402,808988]]

We do an inner join of two not null fields, but the output is nullable. Since we know that with the inner join no nulls can be added, and if both sides are not-null, we can set the output as not null too.

I would be happy to see if I can add this with some pointers to the relevant code.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions