Skip to content

ARROW-11979: Combine limit into SortOptions - REQUEST FOR COMMENT#9722

Closed
alamb wants to merge 1 commit into
apache:masterfrom
alamb:alamb/sort_in_options
Closed

ARROW-11979: Combine limit into SortOptions - REQUEST FOR COMMENT#9722
alamb wants to merge 1 commit into
apache:masterfrom
alamb:alamb/sort_in_options

Conversation

@alamb

@alamb alamb commented Mar 16, 2021

Copy link
Copy Markdown
Contributor

The sort_limit kernel was added by @sundy-li in #9602

While writing some doc examples in #9721, it occured to me we could potentially simplify the API so I figured I would offer a proposed PR for comment

Rationale

Since we already have a SortOptions structure that controls sorting options, we could also add the limit to that structure rather than adding a new sort_limit function and still avoid changing the API

Changes

Move the limit option to SortOptions

/// If Some(limit), only first `limit` elements in the sort order
/// in the output. Any data data after the limit will be
/// discarded.
pub limit: Option<usize>,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the proposal

assert_eq!(output, expected)
}

// TODO remove this function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rewrite the tests properly (with a single SortOptions parameter) if people like this API change, but I didn't bother until I know it is worthwhile

@alamb

alamb commented Mar 16, 2021

Copy link
Copy Markdown
Contributor Author

@Dandandan I wonder if you have any comments on this approach

@jorgecarleitao

jorgecarleitao commented Mar 16, 2021

Copy link
Copy Markdown
Member

I think that SortOptions is used on a per-column basis; it corresponds to how each column should be ordered (nulls first, descending). limit reflects how many entries should be outputted by sort, which IMO is column-independent.

This is probably more evident in lex_sort, where a SortOption per column is passed.

@alamb

alamb commented Mar 16, 2021

Copy link
Copy Markdown
Contributor Author

I think that SortOptions is used on a per-column basis; it corresponds to how each column should be ordered (nulls first, descending). limit reflects how many entries should be outputted by sort, which IMO is column-independent.

@jorgecarleitao that is a good point. Since the Arrow sort kernel only operates on a single column, I think the distinction is not relevant.

However, since lexsort and DataFusion sort use the same SortOptions struct for each coulmn, that is a good reason not to add a limit option to them (as then it could be inconsistent across columns, potentially)

@sundy-li

sundy-li commented Mar 16, 2021

Copy link
Copy Markdown
Contributor

I think that SortOptions is used on a per-column basis; it corresponds to how each column should be ordered (nulls first, descending). limit reflects how many entries should be outputted by sort, which IMO is column-independent.

Totally agree with that.

select * from table order by a ASC, b DESC limit 10

There are 2 sort_options, yet the limit is the only one for the global results.

@github-actions

Copy link
Copy Markdown

@alamb

alamb commented Mar 16, 2021

Copy link
Copy Markdown
Contributor Author

I am convinced that this PR is not a good idea at this time. Closing. Thanks for the feedback @sundy-li and @jorgecarleitao 👍

@alamb alamb closed this Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants