add vortex-geo crate and WKB extension type#7722
Conversation
a445f7d to
1c99386
Compare
Merging this PR will not alter performance
|
3382e4f to
05dcff9
Compare
| duckdb_logical_type duckdb_vx_create_geometry(const char *crs) { | ||
| D_ASSERT(crs); | ||
| auto geom = | ||
| (*crs == '\0') ? duckdb::LogicalType::GEOMETRY() : duckdb::LogicalType::GEOMETRY(std::string(crs)); | ||
| auto copy = duckdb::make_uniq<duckdb::LogicalType>(std::move(geom)); | ||
| return reinterpret_cast<duckdb_logical_type>(copy.release()); | ||
| } |
There was a problem hiding this comment.
These C++ changes are because DuckDB upstream doesn't expose the full geometry type stuff over the C API.
| let blob = unsafe { cpp::duckdb_get_blob(self.as_ptr()) }; | ||
| let slice = | ||
| unsafe { std::slice::from_raw_parts(blob.data.cast::<u8>(), blob.size.as_()) }; |
There was a problem hiding this comment.
in certain situations this could lead to UB depending on the allocator, b/c if duckdb_malloc returns NULL then you'd call slice::from_raw_parts(NULL) which is UB.
see the new take_blob function below
i only fixed this b/c i was going to repeat this for the GEOMETRY branch and figured i'd fix it in both places instead
| } | ||
|
|
||
| pub unsafe fn set_vector_buffer(&self, buffer: &VectorBufferRef) { | ||
| pub unsafe fn set_vector_buffer(&mut self, buffer: &VectorBufferRef) { |
There was a problem hiding this comment.
these methods should've always been &mut for soundness reasons, they were just missing before
| } | ||
|
|
||
| #[test] | ||
| fn test_geometry() { |
There was a problem hiding this comment.
see this for (rather simple) example of querying geometry data from DuckDB
| type NativeValue<'a> = Wkb<'a>; | ||
|
|
||
| fn id(&self) -> ExtId { | ||
| ExtId::new_static("geo.wkb") |
There was a problem hiding this comment.
lets think about this id a little more.
What namespace do we want?
There was a problem hiding this comment.
yep this is just a placeholder. we can do vortex.geo.wkb, idk we don't have great rules around this rn
There was a problem hiding this comment.
i updated this to vortex.geo.wkb, wdyt?
|
|
||
| fn temporal_to_duckdb(temporal: TemporalMetadata) -> VortexResult<LogicalType> { | ||
| let duckdb_type = match temporal { | ||
| TemporalMetadata::Timestamp(unit, None) => match unit { | ||
| TimeUnit::Nanoseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_NS, | ||
| TimeUnit::Microseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP, | ||
| TimeUnit::Milliseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_MS, | ||
| TimeUnit::Seconds => DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_S, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for timestamp", unit), | ||
| }, | ||
| TemporalMetadata::Timestamp(unit, Some(tz)) => { | ||
| if tz.as_ref() != "UTC" { | ||
| vortex_bail!("Invalid timezone for timestamp_tz {tz}, must be UTC"); | ||
| } | ||
| if unit != &TimeUnit::Microseconds { | ||
| vortex_bail!( | ||
| "Invalid TimeUnit {} for timestamp_tz, must be Microseconds", | ||
| unit | ||
| ); | ||
| } | ||
| DUCKDB_TYPE::DUCKDB_TYPE_TIMESTAMP_TZ | ||
| } | ||
| TemporalMetadata::Date(unit) => match unit { | ||
| TimeUnit::Days => DUCKDB_TYPE::DUCKDB_TYPE_DATE, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for date", unit), | ||
| }, | ||
| TemporalMetadata::Time(unit) => match unit { | ||
| TimeUnit::Microseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIME, | ||
| TimeUnit::Nanoseconds => DUCKDB_TYPE::DUCKDB_TYPE_TIME_NS, | ||
| _ => vortex_bail!("Invalid TimeUnit {} for time", unit), | ||
| }, | ||
| }; | ||
|
|
||
| Ok(LogicalType::new(duckdb_type)) | ||
| } | ||
|
|
There was a problem hiding this comment.
shall we also move these into a ext mod?
|
Is this still blocked on arrow stuff @a10y ? Is that moving or do you want it taken over. Keen for this to merge this time! |
|
Arrow stuff is merged so this is unblocked now. I will find some time this week to fix this up and get merged |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Resolve conflict in vortex-duckdb/src/convert/dtype.rs by keeping both new match arms: DUCKDB_TYPE_GEOMETRY (from this branch) and DUCKDB_TYPE_VARIANT (from develop). Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
…aths Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Resolved conflict in vortex-duckdb/cpp/value.cpp and value.h by adopting develop's removal of duckdb_diagnostics.h wrapping (PR #7747) while preserving the new geometry value APIs and geometry_crs.hpp include. Signed-off-by: Andrew Duffy <andrew@a10y.dev>
| if ext.ext_dtype().is::<AnyTemporal>() { | ||
| return temporal::new_exporter(TemporalArray::try_from(ext)?, ctx); | ||
| } | ||
|
|
||
| if ext.ext_dtype().is::<WellKnownBinary>() { | ||
| return geo::new_wkb_exporter(WellKnownBinaryData::try_from(ext)?, ctx); | ||
| } |
There was a problem hiding this comment.
mind moving this out of the base canonical exporter.
There was a problem hiding this comment.
hmm, how should i thread it thru then? geometry is a core type in duckdb
There was a problem hiding this comment.
if array.is_ext() {
extension::new_exporter(array...)
}
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Develop bumped workspace prost to 0.14.4 via lockfile maintenance, but vortex-geo only exists on this branch so its Cargo.lock entry still pointed at 0.14.3. Fix the lockfile so `cargo check --locked` succeeds after merging develop. Signed-off-by: Andrew Duffy <andrew@a10y.dev>
… module `vortex_array` is only in vortex-duckdb's [dev-dependencies], so lib code can't reference it directly. Switch to the `vortex::array::*` re-exports that the neighboring exporter modules already use. Also drop the unused `ConversionCache` import and add the missing `use crate::exporter::geo;` so the bare `geo::new_wkb_exporter(...)` call resolves to the sibling module instead of an external `geo` crate. Signed-off-by: Andrew Duffy <andrew@a10y.dev>
## Summary Part of #7686 Following up on #7722 , which added the WellKnownBinary extension type and handling for exporting to DuckDB vectors. This PR adds support for import/export to Arrow for the extension type. ## Testing Unit tests are added to exercise both code paths --------- Signed-off-by: Andrew Duffy <andrew@a10y.dev>
We are no longer using the lockfiles, but I accidentally added this one as part of #7722 Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Summary
Part of the implementation of #7686
This PR adds a new crate,
vortex-geo, which will hold all extension types, custom layouts, and encodings necessary to store geospatial vector datasets in Vortex files.The goal of this crate is to enable integration with DuckDB Spatial, GeoDataFusion, SedonaDB, and Iceberg v3 geo types.
This initial PR implements an extension type for the Well-Known Binary encoding (WKB). This encoding is the most common format for geospatial data for analytics, it's what both GeoParquet and DuckDB use to represent geometry types.
We also wire this into vortex DuckDB extension to support converting Geometry columns between Vortex and DuckDB formats.
API Changes
Adds new crate
vortex-geowith extension typeWellKnownBinary, (geo.wkb)Adds support for geometry columns to DuckDB as well, e.g. you can now do something like
It won't be very performant yet, not until we support better layouts + stats for geometry.
Testing
We have unit tests for metadata serde, round trip conversion between Vortex <> DuckDB geometry format, and an additional E2E test that demonstrates reading a Vortex file with geometry data and providing it to the DuckDB Spatial extension.