Skip to content

Left join could use bitmap for left join instead of Vec<bool>#342

Closed
boaz-codota wants to merge 1 commit into
apache:masterfrom
boaz-codota:master
Closed

Left join could use bitmap for left join instead of Vec<bool>#342
boaz-codota wants to merge 1 commit into
apache:masterfrom
boaz-codota:master

Conversation

@boaz-codota

Copy link
Copy Markdown

Which issue does this PR close?

Closes #240 .

Rationale for this change

Described in the issue.

What changes are included in this PR?

Described in the issue.

Are there any user-facing changes?

No user facing changes.

@boaz-codota

Copy link
Copy Markdown
Author

@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly

@Dandandan

Copy link
Copy Markdown
Contributor

@Dandandan I hope I understood the requested change correctly. Was not familiar with bitvec before, but I used the docs and I think I implemented it correctly

Yes, this is awesome, exactly what I meant! Impressive how little is changed.

I will do some benchmarking tonight or tomorrow to see if it's changing anything (but I believe this part is not the most performance sensitive anyway).

@alamb @andygrove @jorgecarleitao
What do you think about adding bitvec as dependency? It might be useful later for other datastructures / algorithms too.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

Merging #342 (00b4a84) into master (1702d6c) will decrease coverage by 0.00%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
- Coverage   75.72%   75.71%   -0.01%     
==========================================
  Files         143      143              
  Lines       23832    23835       +3     
==========================================
+ Hits        18046    18047       +1     
- Misses       5786     5788       +2     
Impacted Files Coverage Δ
datafusion/src/physical_plan/hash_join.rs 86.14% <71.42%> (-0.26%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1702d6c...00b4a84. Read the comment docs.

@jorgecarleitao

Copy link
Copy Markdown
Member

Thanks a lot @boaz-codota !

Is there any performance difference? imo It would be good to measure before adding a dependency.

Note that arrow also has an BooleanBufferBuilder or something that can create a bitmap.

@Dandandan

Copy link
Copy Markdown
Contributor

Thanks a lot @boaz-codota !

Is there any performance difference? imo It would be good to measure before adding a dependency.

Note that arrow also has an BooleanBufferBuilder or something that can create a bitmap.

The idea is here is not performance per se, but memory savings. As this is an in-memory join, any saved memory is welcome, as it will increase the chance a given join will fit in memory.

Using a boolean array buffer builder makes sense to me to avoid the dependency.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we already have Arrow as dependency which has bitmaps, I like @jorgecarleitao 's idea of using code from that crate instead (such as https://docs.rs/arrow/4.0.0/arrow/bitmap/struct.Bitmap.html).

@boaz-codota

Copy link
Copy Markdown
Author

@alamb @jorgecarleitao I'm trying to replace the usage of bitvec with arrow's structures. Both BooleanBufferBuilder & Bitmap seems to behave like immutable data structures, which does not support the required use case (specifically: https://github.com/apache/arrow-datafusion/pull/342/files#diff-44d49c7778aa0c300afacdd7d89b0729ffaedd932d1ac34f3ef8db6b6cdfd73aR1084). So I'd love to change my code to use arrow's structures, but those two does not seem to support what is required.
I looked on other array structures under arrow, but none of them seem to fit as well (https://docs.rs/arrow/4.0.0/arrow/array/index.html).
Let me know how I should advance from here

@Dandandan

Copy link
Copy Markdown
Contributor

I agree both the datastructures don't seem very useful for this use case. I don't think there is a mutable structure like bitvec in Arrow...
WDYT @alamb @jorgecarleitao

@jorgecarleitao

Copy link
Copy Markdown
Member

It is just buried in the catacombs of the builder.rs 💀💀💀 BooleanBufferBuilder is its name.

@jorgecarleitao

Copy link
Copy Markdown
Member

Ah, sorry, I see; you need mutable access to an item in the middle of the builder; yeap, that is currently not supported. It assumes that it is always pushed/advanced.

@Dandandan

Copy link
Copy Markdown
Contributor

Yes, I think access to the middle only could be done by accessing / mutating the contents, in which case you are already close to rolling your own bitvec API.

@jorgecarleitao

Copy link
Copy Markdown
Member

To be fair, it is fairly easy to enable that on the BooleanBufferBuilder; we need to add a method to set the bit using bit_utils::set_bit. Buffer for validity is effectively an immutable bitvec, and BooleanBufferBuilder is the corresponding mutable version; we just haven't added that functionality.

I find the bitvec a bit overkill for a single use-case of a bitmap, specially when we already have a "bitvec" on arrow, but I am also fine pushing this through and leave this problem for later.

@alamb

alamb commented May 17, 2021

Copy link
Copy Markdown
Contributor

I also am hesitant to add a new dependency to DataFusion without some sort of compelling performance or memory improvement

@boazberman

Copy link
Copy Markdown
Contributor

I'll try to tackle adding the set_bit to Arrow's BooleanBufferBuilder later this week, hopefully contributing to arrow-rs will be as easy as contributing to this repo

@boazberman

Copy link
Copy Markdown
Contributor

I'm stuck on this, I've implemented the code which I think that should work, but it is not working as expected. I opened the PR to get help.
apache/arrow-rs#383
cc @alamb @Dandandan @jorgecarleitao

@alamb

alamb commented Jun 14, 2021

Copy link
Copy Markdown
Contributor

FYI arrow 4.3.0 has been released with the code in BooleanBufferBuilder

@boaz-codota

Copy link
Copy Markdown
Author

I'll refactor this as soon as I have some free time. Thanks @alamb !

@boaz-codota

Copy link
Copy Markdown
Author

@alamb tried to refactor the code now, noticed that there is no get_bit equivalent of set_bit in BooleanBufferBuilder and calling finish to make receive a Buffer is consuming the content of the BooleanBufferBuilder. I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I? If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit

@alamb

alamb commented Jun 29, 2021

Copy link
Copy Markdown
Contributor

I'm not sure whether or not I can consume the BooleanBufferBuilder within HashJoinStream.poll_next. Can I?

I am not sure

If not, I guess I'd add another PR to BooleanBufferBuilder that exposes get_bit

One thing that might be helpful would be to prototype out this PR using a local checkout / fork of the arrow-rs repo so that you can get everything working and find other potential "gotchas" without having to wait on the arrow-rs release cycle

Once you figured out what you needed then we could merge it according in parts

@boaz-codota

Copy link
Copy Markdown
Author

Implemented in an arrow native way here: #884

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
HairstonE pushed a commit to HairstonE/datafusion that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Left join could use bitmap for left join instead of Vec<bool>

6 participants