Skip to content

Conversation

@timsaucer
Copy link
Member

@timsaucer timsaucer commented Nov 13, 2025

Which issue does this PR close?

Rationale for this change

This PR adds the concept of a library_marker_id to the FFI crate. The reason for this is described in the README text as part of the PR. In our current use of the FFI library we get into issues where we round trip FFI structs back to their original library that now have FFI wrappers when they are no longer needed.

What changes are included in this PR?

  • Adds a method to find the memory address of a static constant in the library.
  • Adds a check in methods that are creating Foreign FFI structs to see if we are actually in the local library or are actually foreign.
  • Replaces the From<> for Foreign to From<> for Arc<dyn ...>. The actual use case is essentially always to use these structs as their implementation of the trait they back.
  • Adds unit tests and a method to mock being in foreign code
  • Removed an unused function call in FFI_ScalarUDF

Are these changes tested?

  • Code is tested against existing unit tests and with datafusion-python.
  • Added unit tests
  • Coverage report compared to main:
coverage-report

Are there any user-facing changes?

This does change the API for the FFI functions.

@timsaucer timsaucer added api change Changes the API exposed to users of the crate ffi Changes to the ffi crate labels Nov 13, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 13, 2025
@timsaucer timsaucer force-pushed the feat/ffi-add-library-marker branch from adff895 to 27d983c Compare November 13, 2025 16:28
@timsaucer timsaucer force-pushed the feat/ffi-add-library-marker branch from 27d983c to bc0968b Compare November 20, 2025 18:04
@timsaucer timsaucer requested a review from Copilot November 20, 2025 21:13
@timsaucer timsaucer marked this pull request as ready for review November 20, 2025 21:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves FFI (Foreign Function Interface) handling by introducing a library_marker_id mechanism to detect when FFI objects originate from the same library, avoiding unnecessary wrapper overhead during round trips. This optimization allows the system to return original implementations directly when FFI structures are passed back to their originating library.

Key changes:

  • Added library_marker_id field to all FFI structs to identify the originating library
  • Modified From implementations to return Arc<dyn Trait> instead of Foreign* wrapper types, bypassing wrappers when detecting local library origin
  • Updated all test code and examples to use the new direct trait conversion pattern

Reviewed Changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
datafusion/ffi/src/lib.rs Adds get_library_marker_id() function and test mock to identify library origin
datafusion/ffi/README.md Documents the library marker ID concept and its purpose
docs/source/library-user-guide/upgrading.md Provides migration guide for API changes from Foreign* to Arc<dyn Trait>
datafusion/ffi/src/execution_plan.rs Adds library marker support and bypass logic to execution plans
datafusion/ffi/src/table_provider.rs Updates table provider with marker checks and helper methods
datafusion/ffi/src/catalog_provider*.rs Implements library marker checks for catalog providers
datafusion/ffi/src/schema_provider.rs Adds marker-based bypass for schema providers
datafusion/ffi/src/ud*.rs Updates all user-defined function types with library marker logic
datafusion/ffi/src/udaf/accumulator*.rs Implements bypass for accumulator types with null pointer safety
datafusion/ffi/src/udwf/partition_evaluator.rs Adds marker checks to partition evaluators
datafusion/ffi/tests/*.rs Updates test code to use new Arc<dyn Trait> conversion pattern
datafusion-examples/examples/ffi/ffi_module_loader/* Updates example to use new API pattern

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@timsaucer
Copy link
Member Author

Thanks for the review @martin-g . Some of those are identical to the review from copilot, which I'm guessing you're using a AI bot to assist. It would be helpful to not get the duplicates if you can adjust it in that way.

@timsaucer timsaucer changed the title Reduce FFI wrappers when round tripping code FFI: add support for identifying when code is called locally Nov 22, 2025
@timsaucer timsaucer changed the title FFI: add support for identifying when code is called locally FFI: return underlying trait type when converting from FFI structs Nov 22, 2025
@timsaucer
Copy link
Member Author

@alamb Thank you for the review! I believe I've addressed all of your comments. The changes I've pushed are only in documentation.

@renato2099
Copy link
Contributor

renato2099 commented Nov 23, 2025

Hi @timsaucer ,

Cool work! thanks!
I have a question about the overall approach, but please excuse me if this a naive/silly question , I am still trying to get familiar with the approach and the code base :)

Right now we are relying on memory address of a static to determine if we are in a foreign/local context, and although that raw memory check makes it now, I am not sure if having some sort of global registry would help future proofing the whole improvement. The global registry could still hold the raw memory pointer but also some UUID generated when a new library gets registered. Do you think that would help?

@timsaucer
Copy link
Member Author

Right now we are relying on memory address of a static to determine if we are in a foreign/local context, and although that raw memory check makes it now, I am not sure if having some sort of global registry would help future proofing the whole improvement. The global registry could still hold the raw memory pointer but also some UUID generated when a new library gets registered. Do you think that would help?

This is a good question. I did play around with doing some kind of registry, but I don't think it necessarily fits the model we currently have of how these structures are created and used.

Take the ExecutionPlan for instance. Those typically only exist during their execution and then get dropped when the strea is complete. If we had registered those, we'd probably have an Arc<dyn ExecutionPlan> floating around in the registry. That would lead to library memory bloat.

There are other things one could use such a registry for. I haven't really thought of any so I'm not sure how your suggestion would make the appoach more future proof. Do you have a use case in mind already? If there's value add in doing so, we can rework this PR but it would likely be more complex because I suspect each one of these structs would need a custom registry.

I think the advantage of what we have here is that it is very simple. The address for the static variable is guaranteed to be unique from one library or application to another.

One thing this PR doesn't do and it would be awesome if I could figure out a way is to figure out if we have libraries that can access each other's internal memory structures. That's near impossible because Rust doesn't have a stable ABI so the same exact code built with two different versions of rustc could end up with different struct layouts in memory. If there was some way that we could compile a unique ID of the compiler with all it's options and check against that, then we could open up a way for libraries to jump into each other without all of this FFI mess. There is no way that I know of to create such a thing, however.

@timsaucer
Copy link
Member Author

About that last suggestion, I did ask an AI agent and it did generate some code. So it may be possible, but that would also be of very limited utility as it would lock providers into building with the same toolchain. For purely internal development, they can also guarantee that already. But for anyone who wants to share libraries with one another in an open source project I don't think it's really worth the effort.

timsaucer and others added 3 commits November 23, 2025 21:10
Co-authored-by: Renato Marroquin <renatoj.marroquin@gmail.com>
Co-authored-by: Renato Marroquin <renatoj.marroquin@gmail.com>
Co-authored-by: Renato Marroquin <renatoj.marroquin@gmail.com>
/// the same library. It is possible that the interplay between
/// foreign and local functions calls create one FFI struct that
/// references another. It is helpful to determine if a foreign
/// struct is truly foreign or in the same library. If we are in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate what foreign means in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some text

/// This function works by checking the address of the library
/// marker. Each library that implements the FFI code will have
/// a different address for the marker. By checking the marker
/// address we can determine if a struct is truly Foreign or is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// address we can determine if a struct is truly Foreign or is
/// address we can determine if a struct is truly foreign or is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so foreign means the function pointer comes from another library? can it be the same library but different version? like jvm hell?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a second instance of the same library and that would give a different address. In practice I don't think this is a problem, though. The code would still work, they just wouldn't get the benefit of this marker id work. It would work as previous to this PR.

I don't think it's a problem in practice because it would mean importing the same library twice. The canonical example of datafusion-python doesn't do that. There may be an edge case somewhere. Related is this discussion, but that is a lot of effort for minimal return IMO.

type Error = DataFusionError;

fn try_from(ffi_props: FFI_PlanProperties) -> Result<Self, Self::Error> {
if (ffi_props.library_marker_id)() == crate::get_library_marker_id() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can add some comments here? afaiu it is the place where the plan properties are asserted as been foreign or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an assertion, but a check. I think there's plenty of documentation in (1) docstring for library_marker_id (2) docstring for get_library_marker_id and (3) the README.md that I don't think we need an additional inline doc for every time we call this function.

impl From<&FFI_SchemaProvider> for Arc<dyn SchemaProvider + Send> {
fn from(provider: &FFI_SchemaProvider) -> Self {
Self(provider.clone())
if (provider.library_marker_id)() == crate::get_library_marker_id() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

_usually_ do not need to worry about memory management. When interacting with
foreign code, this is not necessarily true. If you review the structures in
this crate, you will find that many of them implement the `Drop` trait and
perform a foreign call.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably nice to remind 'Box::leak' and 'forget' when Rust delegates the destruction of allocated object to the callee site like Python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you're asking for. And I don't think rust is delegating the destruction of the object to Python, but to the Rust code on the other side of the boundary which Python wraps around.

Copy link
Contributor

@comphead comphead Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sometimes ago I found an article https://www.ralphminderhoud.com/blog/rust-ffi-wrong-way/ that states Rust when created a pointer to an object then shouldn't deallocate it, rather delegating the ownership to the callee site, C or Python in this case

#[no_mangle]
pub unsafe extrn "C" fn file_data_read(f: *mut FileData, path: *const libc::c_char) -> bool { 
    // Get pointer to heap allocated `FileData`
    let mut file_data = Box::from_raw(f);

    // Build path from c-string path argument
    let p = PathBuf::from(CStr::from_ptr(path).to_str().unwrap_or_default());

    // Read the data and provide some minimal error handling
    let res = file_data.read(&p).is_ok();

    // Forget the memory so Rust doesn't deallocate when `file_data` is dropped
    std::mem::forget(file_data);

    res 
} 

I was thinking it might be agood knowledge to anyone reading FFI documentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exact text would you like to see? We account for the memory forgetting in the Box::into_raw during FFI struct creation and the corresponding Box::from_raw when we do the release().

I'm not quite sure what additional text you think would be helpful.

The author of that blog was focused on language interoperability, which is something we've decided is not a current goal of this crate. Rather we're focused on Rust backed implementations. The Python work simply exposes these structs via PyCapsule - we don't actually support direct interaction of these methods with python.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trying to be difficult, but I really don't understand what parts need additional explanation. I acknowledge as the author of most of this work it is likely something is obvious to me that isn't clear to others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, it was quite a surprise for me when I first time found how leak or forget are used in FFI, I thought it might be beneficial for the reader, but you are right, this example would be quite out of the context for this doc.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timsaucer I'm not great FFi expert, but it seems fine for me, just couple of minor questions and thanks @alamb and @martin-g for the reviews.

@timsaucer
Copy link
Member Author

Thank you for the review @comphead . I've tried to answer your questions, but I also think we talked past each other a bit in the comments. Please let me know if there's anything you think should block this PR from merging!

@timsaucer timsaucer added this pull request to the merge queue Nov 24, 2025
Merged via the queue into apache:main with commit 1cc9bcd Nov 24, 2025
31 checks passed
@renato2099
Copy link
Contributor

This is a good question. I did play around with doing some kind of registry, but I don't think it necessarily fits the model we currently have of how these structures are created and used.

Take the ExecutionPlan for instance. Those typically only exist during their execution and then get dropped when the strea is complete. If we had registered those, we'd probably have an Arc<dyn ExecutionPlan> floating around in the registry. That would lead to library memory bloat.

this is a fair point 👍

There are other things one could use such a registry for. I haven't really thought of any so I'm not sure how your suggestion would make the appoach more future proof. Do you have a use case in mind already? If there's value add in doing so, we can rework this PR but it would likely be more complex because I suspect each one of these structs would need a custom registry.

I was thinking on a scenario where linking libraries would get us into trouble, but I guess that is also not applicable for the work here.

thanks for the discussion @timsaucer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate documentation Improvements or additions to documentation ffi Changes to the ffi crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants