-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the enhancement requested
Consider the following code:
Python 3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>>
>>> arrow_schema = pa.schema(
... [
... pa.field("city", pa.string(), nullable=False),
... pa.field("population", pa.int32(), nullable=False),
... ]
... )
>>>
>>> # Write some data
>>> df = pa.Table.from_pylist(
... [
... {"city": "Amsterdam", "population": 921402},
... {"city": "San Francisco", "population": 808988},
... ],
... schema=arrow_schema,
... )
>>>
>>> joined = df.join(df, "city", join_type="inner")
>>>
>>> joined
pyarrow.Table
city: string
population: int32
population: int32
----
city: [["Amsterdam","San Francisco"]]
population: [[921402,808988]]
population: [[921402,808988]]
>>> df
pyarrow.Table
city: string not null
population: int32 not null
----
city: [["Amsterdam","San Francisco"]]
population: [[921402,808988]]We do an inner join of two not null fields, but the output is nullable. Since we know that with the inner join no nulls can be added, and if both sides are not-null, we can set the output as not null too.
I would be happy to see if I can add this with some pointers to the relevant code.
Component(s)
Python