Skip to content

feat: implement posgres style encode/decode#6821

Merged
alamb merged 12 commits into
apache:mainfrom
ozgrakkurt:impl-encoding-functions
Jul 5, 2023
Merged

feat: implement posgres style encode/decode#6821
alamb merged 12 commits into
apache:mainfrom
ozgrakkurt:impl-encoding-functions

Conversation

@ozgrakkurt

@ozgrakkurt ozgrakkurt commented Jul 1, 2023

Copy link
Copy Markdown
Contributor

closes #3125

Need help with test failure, couldn't understand what it means:
thread 'sql::expr::test_encoding_expressions' panicked at 'called `Result::unwrap()` on an `Err` value: Context("simplify_expressions", Internal("Optimizer rule 'simplify_expressions' failed, due to generate a different schema, original schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: \"encode(Utf8(\\\"tom\\\"),Utf8(\\\"base64\\\"))\", data_type: Binary, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }, new schema: DFSchema { fields: [DFField { qualifier: None, field: Field { name: \"encode(Utf8(\\\"tom\\\"),Utf8(\\\"base64\\\"))\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {} }"))', datafusion/core/tests/sql/mod.rs:556:38

@github-actions github-actions Bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Jul 1, 2023

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you thank you @ozgrakkurt

Something is not quite right

I tried playing around with these functions in datafusion-cli and I got some strange results

Like I expect this to return 'foo' but it did not (I see one of your test uses arrow_cast maybe?):

❯ select decode(encode('foo', 'hex'), 'hex');
+-----------------------------------------------------+
| decode(encode(Utf8("foo"),Utf8("hex")),Utf8("hex")) |
+-----------------------------------------------------+
| 666f6f                                              |
+-----------------------------------------------------+
1 row in set. Query took 0.003 seconds.

I also don't think the code that takes ColumnarValue::Array is covered by tests

Thus I think this PR needs more tests, ideally via sqllogictests -- perhaps in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/test_files/functions.slt

I think that would make the behavior of these functions clearer.

Cases needing coverage

  1. Scalar input (of the string/largestring/binary/largebinary) -- real values, empty '' string, and NULL
  2. table input for each of the types (maybe a single table with different columns?
  3. both hex and base64 arguments

Need update to the feature flags

Note that when I initially tried to test it out, I got the following error (I think because the main datafusion feature flags were not updated)

DataFusion CLI v27.0.0
❯ select encode('foo', 'base64');
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: function encode requires compilation with feature flag: encoding_expressions.. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

I had to add this to Cargo:

diff --git a/datafusion/core/Cargo.toml b/datafusion/core/Cargo.toml
index 468219470..4e0e2fde5 100644
--- a/datafusion/core/Cargo.toml
+++ b/datafusion/core/Cargo.toml
@@ -38,10 +38,11 @@ path = "src/lib.rs"
 avro = ["apache-avro", "num-traits", "datafusion-common/avro"]
 compression = ["xz2", "bzip2", "flate2", "zstd", "async-compression"]
 crypto_expressions = ["datafusion-physical-expr/crypto_expressions", "datafusion-optimizer/crypto_expressions"]
-default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "compression"]
+default = ["crypto_expressions", "encoding__expressions", "regex_expressions", "unicode_expressions", "compression"]
 # Enables support for non-scalar, binary operations on dictionaries
 # Note: this results in significant additional codegen
 dictionary_expressions = ["datafusion-physical-expr/dictionary_expressions", "datafusion-optimizer/dictionary_expressions"]
+encoding__expressions = ["datafusion-physical-expr/encoding_expressions"]
 # Used for testing ONLY: causes all values to hash to the same value (test for collisions)
 force_hash_collisions = []
 pyarrow = ["datafusion-common/pyarrow"]

Once I did that and tested it out it worked great:

❯ create table foo as values ('foo', 'base64');
0 rows in set. Query took 0.002 seconds.
❯ select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo     | base64  |
+---------+---------+
1 row in set. Query took 0.002 seconds.
❯ select encode(column1, 'base64') from foo;
+------------------------------------+
| encode(foo.column1,Utf8("base64")) |
+------------------------------------+
| Zm9v                               |
+------------------------------------+
1 row in set. Query took 0.003 seconds.
❯ select encode(column1, column2) from foo;
Internal error: Encode using dynamically decided method is not yet supported. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
❯ select encode(column1, 'hex') from foo;
+---------------------------------+
| encode(foo.column1,Utf8("hex")) |
+---------------------------------+
| 666f6f                          |
+---------------------------------+

Comment on lines +309 to +311
ColumnarValue::Array(_) => Err(DataFusionError::Internal(
"Encode using dynamically decided method is not yet supported".into(),
)),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a user can cause this error I don't think it is internal (aka it isn't a bug in datafusion, which is what internal errors are for).

Also I think it would help to make the error a bit more user friendly

Suggested change
ColumnarValue::Array(_) => Err(DataFusionError::Internal(
"Encode using dynamically decided method is not yet supported".into(),
)),
ColumnarValue::Array(_) => Err(DataFusionError::NotYetImplemented(
"Second argument to encode must be a constant: Encode using dynamically decided method is not yet supported".into(),
)),

"Unsupported data type {other:?} for function decode",
))),
},
ColumnarValue::Array(_) => Err(DataFusionError::Internal(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this should be a different error and have a nicer message

@ozgrakkurt

ozgrakkurt commented Jul 2, 2023

Copy link
Copy Markdown
Contributor Author

666f6f

This is probably because decode returns binary. It has to return binary since that value isn't guaranteed to be valid utf8 and encode always returns utf8 since the output is guaranteed to be utf8 and it can be converted to binary if wanted.

Working on the error and test issues

@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jul 2, 2023
@ozgrakkurt

ozgrakkurt commented Jul 2, 2023

Copy link
Copy Markdown
Contributor Author

@alamb
All done except I couldn't find how to force a value to be largebinary or largeutf8 in sqllogictest, can you help me with this?

edit: I forgot the errors, doing it now

@ozgrakkurt

Copy link
Copy Markdown
Contributor Author

@alamb I think I got everything, can you check?

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @ozgrakkurt

I also made a PR to add these functions to the user guide: #6856

@alamb alamb merged commit 4bb54e4 into apache:main Jul 5, 2023
@ozgrakkurt

Copy link
Copy Markdown
Contributor Author

Looks great to me -- thank you @ozgrakkurt

I also made a PR to add these functions to the user guide: #6856

no problem, thanks as well!

2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023
* feat: add encode, decode functions

* add test

* add licenses

* fix return types

* delete files

* toml fmt

* add logical expr

* fix NULL case, add test for NULL and empty

* add sqllogic tests

* update error msgs, run cargo update in cli dir

* update sqllogictest

* add more tests
alamb pushed a commit to alamb/datafusion that referenced this pull request Jul 6, 2023
* feat: add encode, decode functions

* add test

* add licenses

* fix return types

* delete files

* toml fmt

* add logical expr

* fix NULL case, add test for NULL and empty

* add sqllogic tests

* update error msgs, run cargo update in cli dir

* update sqllogictest

* add more tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for hex encoding/decoding for Binary in sql queries

2 participants