Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
183 commits
Select commit Hold shift + click to select a range
4a4b4af
Merge branch 'branch-0.17' into branch-0.18
shwina Dec 11, 2020
223f2b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 15, 2020
abd6ad2
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Dec 17, 2020
18863b5
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 4, 2021
0fbdd31
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
dc9b943
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 5, 2021
d586aa7
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 7, 2021
996fda8
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into b…
shwina Jan 8, 2021
2808a5c
Add a compute_hash_join_indices that returns just the join indices
shwina Jan 11, 2021
ef0baee
Don't need common_columns stuff for join that returns a gathermap
shwina Jan 11, 2021
18f3074
Add hash_join_impl methods that return gathermaps
shwina Jan 11, 2021
70abf48
Add overloads to public hash_join class
shwina Jan 11, 2021
13dff67
Add top-level join APIs that return gathermaps
shwina Jan 11, 2021
3300fe1
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 12, 2021
7ed694c
Use device_uvector instead of device_vector in join
shwina Jan 12, 2021
636c2ea
Undo some API changes
shwina Jan 12, 2021
b79da68
Add join_result
shwina Jan 13, 2021
380aa59
Add APIs that return join_result
shwina Jan 13, 2021
3cbb2b4
Remove column_in_common
shwina Jan 13, 2021
53ae7c9
Add an inner join API that returns gathermaps
shwina Jan 14, 2021
fde172b
Add remaining APIs to return gathermaps
shwina Jan 14, 2021
4a286dd
Add gathermap join test
shwina Jan 18, 2021
c756db9
Replace -1 with INT_MIN
shwina Jan 18, 2021
6a3d23e
Make join_result columns instead of column_views
shwina Jan 20, 2021
5dfc2a0
Replace join_result with a pair of columns
shwina Jan 20, 2021
362829b
Add gathermap test for outer join
shwina Jan 20, 2021
4e4380c
Add and pass full join gathermap test
shwina Jan 20, 2021
339a13d
Begin Python-side refactor
shwina Jan 21, 2021
2b07802
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 25, 2021
0d5a19c
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Jan 28, 2021
fdbdc12
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 1, 2021
5dd5d29
Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into g…
shwina Feb 5, 2021
6b20429
Merge branch 'branch-0.19' into gathermap-based-join-apis
shwina Feb 8, 2021
044eac1
Add left_semi and left_anti join APIs that return gathermaps
shwina Feb 8, 2021
555d5ec
Add Cython bindings
shwina Feb 8, 2021
56ae616
full -> outer
shwina Feb 9, 2021
dd05121
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 9, 2021
d447924
Progress
shwina Feb 9, 2021
484512e
More progress on py refactor
shwina Feb 9, 2021
5227582
Remove breakpoint
shwina Feb 10, 2021
9cd870e
Fix neg index handling
shwina Feb 10, 2021
8e4f193
Use nullify gather in join
shwina Feb 10, 2021
29fe140
Handle outer joins better
shwina Feb 10, 2021
b634055
Fix index construction
shwina Feb 10, 2021
cd53d6c
Fix sorting behaviour
shwina Feb 10, 2021
75f1efd
Fix Index.join
shwina Feb 10, 2021
1f5d6ad
Progress on semi/anti joins
shwina Feb 10, 2021
de30520
Add simple join test
shwina Feb 10, 2021
66a0de5
Semi-join fix
shwina Feb 11, 2021
ca72295
Only combine key columns in outer join if they have the same name
shwina Feb 11, 2021
ee2242d
Handle when both _on and _index are provided
shwina Feb 11, 2021
e531725
Fix sorting join result
shwina Feb 11, 2021
c8b4948
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Feb 11, 2021
674095c
whitespace
shwina Feb 12, 2021
cbd9dc3
Make construct_join_output_df work with column views
shwina Feb 12, 2021
3f3c3cb
Get rid of hash_join::left_join
shwina Feb 12, 2021
01415fc
More join C++ cleanup
shwina Feb 12, 2021
6185492
Even more cleaning
shwina Feb 17, 2021
d736d1c
More join tests
shwina Feb 18, 2021
b58591d
Fix all join tests
shwina Feb 18, 2021
be560bb
Python regressions
shwina Feb 18, 2021
efb60d6
Revert
shwina Feb 18, 2021
fe6d0b8
Invalid -> Unkown
shwina Feb 18, 2021
547027c
Don't mutate lhs/rhs
shwina Feb 18, 2021
5f93d23
Fix join tests
shwina Feb 19, 2021
b7bf821
Fix semi/anti join trivial cases
shwina Feb 19, 2021
50a2fb2
When testing join results, use a helper that sorts values
shwina Feb 19, 2021
ff0ae79
Totally broken commit
shwina Feb 19, 2021
07cd052
Cleanup
shwina Feb 20, 2021
bd6bf77
Warnings
shwina Feb 20, 2021
a40063e
Cleanup
shwina Feb 22, 2021
ccef9d0
Cleanup
shwina Feb 22, 2021
210244b
Cleanup
shwina Feb 22, 2021
b57348c
Add typing for join helpers
shwina Feb 22, 2021
5c2c9b3
Typing for Join class
shwina Feb 22, 2021
558aa15
Simplify joiner API
shwina Feb 22, 2021
3184896
Example doc
shwina Feb 22, 2021
d3535dc
Refactor join APIs to return a device_uvector
shwina Feb 25, 2021
3b0a2a5
Merge tag 'branch-0.19-latest' of https://github.com/rapidsai/cudf in…
shwina Mar 1, 2021
b82181d
docs
shwina Mar 3, 2021
77d2bfd
Finish up docs?
shwina Mar 3, 2021
0bf34e8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 4, 2021
26a3fb0
Fix join tests
shwina Mar 4, 2021
8a60d62
Refactor join APIs to work with unique_ptr<rmm::device_uvector>>
shwina Mar 5, 2021
387a953
Update join Cython
shwina Mar 5, 2021
6cd6433
Need to resize the gathermap
shwina Mar 5, 2021
c67dcce
Doc
shwina Mar 5, 2021
30c22ed
Changelog
shwina Mar 5, 2021
f73199d
Add helper to convert gather_map_type->Column
shwina Mar 9, 2021
393c06a
Update python/cudf/cudf/core/frame.py
shwina Mar 9, 2021
e91f554
Cannot specify both column and index
shwina Mar 9, 2021
0185896
Vaildate how
shwina Mar 9, 2021
b232f85
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 9, 2021
1eb495d
Can't use a set
shwina Mar 9, 2021
4f1f072
Avoid function local import
shwina Mar 10, 2021
4aa8fec
False -> NotImplementedError
shwina Mar 10, 2021
ae0e5f9
Update cpp/include/cudf/join.hpp
shwina Mar 10, 2021
f47cf7e
Reuse some join logic
shwina Mar 10, 2021
2a201c3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 10, 2021
230ca08
Formatting
shwina Mar 10, 2021
498a621
Update cpp/include/cudf/join.hpp
shwina Mar 11, 2021
2de26f3
Docs?
shwina Mar 11, 2021
d6f128c
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 11, 2021
b7d8d8a
Use mr
shwina Mar 11, 2021
9efc761
Docs
shwina Mar 15, 2021
8779bc7
Simplify suffix handling
shwina Mar 16, 2021
4c651ac
Simplify joiner requirements
shwina Mar 17, 2021
b4f4d7c
Do less work in SemiJoin._merge_results
shwina Mar 17, 2021
d353c92
Doc
shwina Mar 17, 2021
580a346
Doc
shwina Mar 17, 2021
328dafd
Return None from semi_join
shwina Mar 17, 2021
297d20a
Init common_type
shwina Mar 17, 2021
e388dd6
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 18, 2021
935648b
Move validation directly into set_by_label and use a raw dict to stor…
vyasr Mar 19, 2021
806a3ef
Remove all references to OrderedColumnDict.
vyasr Mar 19, 2021
40a7b17
Move validation to separate method and use in both set_by_label and c…
vyasr Mar 19, 2021
a1c576e
Format with black.
vyasr Mar 19, 2021
788d9d6
Expose parameter to make validation optional.
vyasr Mar 19, 2021
6a64285
Coerce constructor input to dict before calling items.
vyasr Mar 19, 2021
e7d0981
Make construction safe.
vyasr Mar 19, 2021
c39932c
Final cleanup and documentation.
vyasr Mar 19, 2021
4ff09fc
Address style issues.
vyasr Mar 19, 2021
35c63ec
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 22, 2021
9433582
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…
shwina Mar 22, 2021
74f2884
Merge remote-tracking branch 'origin/branch-0.19' into feature/optimi…
vyasr Mar 22, 2021
0178127
CA fix
shwina Mar 22, 2021
5c0f202
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
c8d2364
Don't validate on gathers
shwina Mar 22, 2021
efea63d
Prioritize numeric columns
shwina Mar 22, 2021
898a3d8
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
c3b6444
Lazily compute and delete column length on demand.
vyasr Mar 22, 2021
01b2cf5
Remove redundant clear cache in setitem.
vyasr Mar 22, 2021
8899258
Remove mypy annotation for column length.
vyasr Mar 22, 2021
c6cd415
Optimize casting logic
shwina Mar 22, 2021
3507785
Merge branch 'feature/optimize_accessor_copy' of github.com:vyasr/cud…
shwina Mar 22, 2021
7f8e1cd
Undo
shwina Mar 22, 2021
f2e4609
Don't validate when copying type metadata
shwina Mar 22, 2021
5d378c2
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
83cc407
ImportError
shwina Mar 22, 2021
72598fb
Prioritize numeric dtypes in is_numerical_dtype
shwina Mar 22, 2021
fa220b6
Add unsafe CA ctor
shwina Mar 22, 2021
6572cd3
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
f7dc417
Revert "Prioritize numeric dtypes in is_numerical_dtype"
shwina Mar 22, 2021
3760077
Revert "Prioritize numeric dtypes in is_numerical_dtype"
shwina Mar 22, 2021
01cdfcf
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 22, 2021
de9ca28
Change error message back so that tests pass.
vyasr Mar 23, 2021
e35d03b
Faster is_numerical_dtype
shwina Mar 23, 2021
e2fd533
Faster is_numerical_dtype
shwina Mar 23, 2021
9044d62
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 23, 2021
64ca702
Even faster is_numerical_dtype
shwina Mar 23, 2021
749edf1
Enable fast path for constructing a Buffer from a DeviceBuffer
shwina Mar 23, 2021
7526e4a
Merge branch 'feature/optimize_accessor_copy' into join-bench
shwina Mar 23, 2021
ca772b8
Small fix
shwina Mar 23, 2021
739ec57
Add validation option to insert and standardize error message.
vyasr Mar 23, 2021
498b70e
Fix style.
vyasr Mar 23, 2021
3cd012b
Merge remote-tracking branch 'vyasr/feature/optimize_accessor_copy' i…
shwina Mar 23, 2021
660afa6
Merge branch 'various-py-optimizations' into join-bench
shwina Mar 23, 2021
f8ac22f
Merge branch 'gathermap-based-join-apis' into join-bench
shwina Mar 23, 2021
c28866c
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into v…
shwina Mar 23, 2021
01e13fa
Undo formatting change
shwina Mar 23, 2021
89a0301
Add TODO
shwina Mar 23, 2021
26f4cc8
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 23, 2021
f2036eb
Merge branch 'various-py-optimizations' into join-bench
shwina Mar 23, 2021
5e73de7
init->create + doc
shwina Mar 24, 2021
e0c50b5
Merge branch 'various-py-optimizations' into gathermap-based-join-apis
shwina Mar 24, 2021
fa880c1
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 24, 2021
58bdecd
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 25, 2021
ed1b434
Merge branch 'join-bench' into gathermap-based-join-apis
shwina Mar 25, 2021
ca116a3
Only gather the index if necessary
shwina Mar 25, 2021
ce03918
Don't copy type metadata for the index unless we need to
shwina Mar 25, 2021
b7c6b19
Use validate=False in a few more places
shwina Mar 25, 2021
671a0e0
Import
shwina Mar 26, 2021
797087b
Review
shwina Mar 26, 2021
5ad531f
Coerce to tuple first
shwina Mar 26, 2021
f7e94fb
Replace hasattr with isinstance
shwina Mar 26, 2021
1cb9448
Handle renamed indexes
shwina Mar 26, 2021
cc89360
Fix to names setter
shwina Mar 26, 2021
4ca1238
Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into g…
shwina Mar 26, 2021
9cebf2e
Update cpp/src/join/hash_join.cu
shwina Mar 26, 2021
1584b86
Better example
shwina Mar 26, 2021
3977b79
Remove std::moves
shwina Mar 26, 2021
67919a3
Merge branch 'gathermap-based-join-apis' of github.com:shwina/cudf in…
shwina Mar 26, 2021
7bf6561
Fix formatting error
shwina Mar 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add typing for join helpers
  • Loading branch information
shwina committed Feb 22, 2021
commit b57348c88543a01a2ae618d375874924d7b07897
3 changes: 3 additions & 0 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -991,6 +991,9 @@ def distinct_count(
raise NotImplementedError(msg)
return cpp_distinct_count(self, ignore_nulls=dropna)

def can_cast_safely(self, to_dtype: Dtype) -> bool:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's it necessary we have this method in the base class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing so will likely lead to typing errors:

def some_cudf_function(col: ColumnBase):
   if col.can_cast_safely:
        # something

...unless we can restrict the input to a specific column type, which often, we can't/don't:

def some_cudf_function(col: NumericalColumn):
    if col.can_cast_safely:
        # something

Maybe a reasonable compromise is to raise a NotImplementedError here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Yeah, I think NotImplementedError is maybe the way to go. I think that follows the pattern of other functions in this class like binary_operator which require it to be a subclass to really make sense.


def astype(self, dtype: Dtype, **kwargs) -> ColumnBase:
if is_categorical_dtype(dtype):
return self.as_categorical_column(dtype, **kwargs)
Expand Down
4 changes: 4 additions & 0 deletions python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from pandas._config import get_option

import cudf
from cudf._typing import DtypeObj
from cudf.core.abc import Serializable
from cudf.core.column import (
CategoricalColumn,
Expand Down Expand Up @@ -65,6 +66,9 @@ def _to_frame(this_index, index=True, name=None):


class Index(Frame, Serializable):

dtype: DtypeObj

def __new__(
cls,
data=None,
Expand Down
75 changes: 40 additions & 35 deletions python/cudf/cudf/core/join/_join_helpers.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
# Copyright (c) 2021, NVIDIA CORPORATION.
from __future__ import annotations

import warnings
from typing import TYPE_CHECKING, Any

import numpy as np
import pandas as pd

import cudf
from cudf.core.dtypes import CategoricalDtype

if TYPE_CHECKING:
from cudf._typing import Dtype
from cudf.core.column import ColumnBase
from cudf.core.frame import Frame


class _Indexer:
# Indexer into a column (either a data column or index level).
Expand All @@ -21,35 +28,38 @@ class _Indexer:
# >>> _Indexer("a", column=True).get(df) # returns column "a" of df
# >>> _Indexer("b", index=True).get(df) # returns index level "b" of df

def __init__(self, name, column=False, index=False):
def __init__(self, name: Any, column=False, index=False):
self.name = name
self.column, self.index = column, index

def get(self, obj):
def get(self, obj: Frame) -> ColumnBase:
# get the column from `obj`
if self.column:
return obj._data[self.name]
else:
if obj._index is not None:
return obj._index._data[self.name]
raise KeyError()

def set(self, obj, value):
def set(self, obj: Frame, value: ColumnBase):
# set the colum in `obj`
if self.column:
obj._data[self.name] = value
else:
if obj._index is not None:
obj._index._data[self.name] = value
raise KeyError()

def get_numeric_index(self, obj):
def get_numeric_index(self, obj: Frame) -> int:
# get the position of the column in `obj`
# (counting any index columns)
if self.column:
index_nlevels = obj.index.nlevels if obj._index is not None else 0
index_nlevels = obj._index.nlevels if obj._index is not None else 0
return index_nlevels + tuple(obj._data).index(self.name)
else:
return obj.index.names.index(self.name)
if obj._index is not None:
return obj._index.names.index(self.name)
raise KeyError()


def _match_join_keys(lcol, rcol, how):
def _match_join_keys(lcol: ColumnBase, rcol: ColumnBase, how: str) -> Dtype:
# cast the keys lcol and rcol to a common dtype

ltype = lcol.dtype
Expand All @@ -59,7 +69,7 @@ def _match_join_keys(lcol, rcol, how):
if isinstance(ltype, CategoricalDtype) or isinstance(
rtype, CategoricalDtype
):
return _match_join_categorical_keys(lcol, rcol, how)
return _match_categorical_dtypes(ltype, rtype, how)

if pd.api.types.is_dtype_equal(ltype, rtype):
return ltype
Expand Down Expand Up @@ -91,38 +101,31 @@ def _match_join_keys(lcol, rcol, how):
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making it here is cause to raise really. If the dtypes were the same, it should have returned ltype right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above :-)



def _match_join_categorical_keys(lcol, rcol, how):
def _match_categorical_dtypes(ltype: Dtype, rtype: Dtype, how: str) -> Dtype:
# cast the keys lcol and rcol to a common dtype
# when at least one of them is a categorical type

l_is_cat = isinstance(lcol.dtype, CategoricalDtype)
r_is_cat = isinstance(rcol.dtype, CategoricalDtype)

if l_is_cat and r_is_cat:
if isinstance(ltype, CategoricalDtype) and isinstance(
rtype, CategoricalDtype
):
# if both are categoricals, logic is complicated:
return _match_join_categorical_keys_both(lcol, rcol, how)
elif l_is_cat or r_is_cat:
if l_is_cat and how in {"left", "leftsemi", "leftanti"}:
return lcol.dtype
common_type = (
lcol.dtype.categories.dtype
if l_is_cat
else rcol.dtype.categories.dtype
)
return common_type
else:
raise ValueError("Neither operand is categorical")
return _match_categorical_dtypes_both(ltype, rtype, how)

if isinstance(ltype, CategoricalDtype):
if how in {"left", "leftsemi", "leftanti"}:
return ltype
common_type = ltype.categories.dtype
elif isinstance(rtype, CategoricalDtype):
common_type = rtype.categories.dtype
return common_type


def _match_join_categorical_keys_both(lcol, rcol, how):
# cast lcol and rcol to a common type when they are *both*
# categorical types.
#
def _match_categorical_dtypes_both(
ltype: CategoricalDtype, rtype: CategoricalDtype, how: str
) -> Dtype:
# The commontype depends on both `how` and the specifics of the
# categorical variables to be merged.

ltype, rtype = lcol.dtype, rcol.dtype

# when both are ordered and both have the same categories,
# no casting required:
if ltype == rtype:
Expand Down Expand Up @@ -151,7 +154,9 @@ def _match_join_categorical_keys_both(lcol, rcol, how):

if how == "inner":
# cast to category types -- we must cast them back later
return _match_join_keys(ltype.categories, rtype.categories, how)
return _match_join_keys(
ltype.categories._values, rtype.categories._values, how
)
elif how in {"left", "leftanti", "leftsemi"}:
# always cast to left type
return ltype
Expand Down