ARROW-6582: [R] Arrow to R fails with embedded nuls in strings#8365
ARROW-6582: [R] Arrow to R fails with embedded nuls in strings#8365nealrichardson wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
[1] "person" "woman" "ma" "camera" "tv"
😆
|
if what we want is that the nul is kept, based on @bkietz comment from #8536 cpp11::unwind_protect([&] {
if (array->null_count()) {
// need to watch for nulls
arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
array->offset(), n);
for (int i = 0; i < n; i++, null_reader.Next()) {
if (null_reader.IsSet()) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
} else {
SET_STRING_ELT(data, start + i, NA_STRING);
}
}
} else {
for (int i = 0; i < n; i++) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
}
}
});with: private:
SEXP unsafe_r_string(const std::string& s) const {
return Rf_mkCharLenCE(s.c_str(), s.size(), CE_UTF8);
}this builds on knowing that i.e. it assumes utf-8 but since it does not use the known size, it searches for cc @jimhester, is this on purpose that this constructor uses |
|
@nealrichardson is the intent that we do get to the |
|
Oh hmm, it is not on purpose, I think it was just copy pasted from the That being said having an embedded Which erroring maybe is the intent here, so perhaps switching this to |
|
Arguably failing is better than silently truncating, but that puts us back at the original user report. I see our options as:
|
|
I think failing asap is better, either with the current code, or with an StringArrayType* string_array = static_cast<StringArrayType*>(array.get());
auto unsafe_r_string = [](const std::string& s) {
return Rf_mkCharCE(s.c_str(), CE_UTF8);
};
cpp11::unwind_protect([&] {
if (array->null_count()) {
// need to watch for nulls
arrow::internal::BitmapReader null_reader(array->null_bitmap_data(),
array->offset(), n);
for (int i = 0; i < n; i++, null_reader.Next()) {
if (null_reader.IsSet()) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
} else {
SET_STRING_ELT(data, start + i, NA_STRING);
}
}
} else {
for (int i = 0; i < n; i++) {
SET_STRING_ELT(data, start + i, unsafe_r_string(string_array->GetString(i)));
}
}
});
return Status::OK(); |
|
@romainfrancois that looks good to me. I'd recommend using |
|
What you describe (including using GetView) is essentially what we now have on master: https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L290-L321 The difference is that we moved back to If |
|
It does look like cpp11::cpp_eval('Rf_mkCharLenCE("camer\\0a", 7, CE_UTF8)')
#> Error in f(): embedded nul in string: 'camer\0a'Created on 2020-11-13 by the reprex package (v0.3.0.9001) |
8180ee2 to
09446f3
Compare
|
It would be good to get this resolved for 3.0. I pushed a naive fix: if |
4142e80 to
9c20721
Compare
|
@nealrichardson 1) I'll push an implementation of this 2) unfortunately, unwind_exceptions can't really be caught. They are used by cpp11 to get c++ stack unwinding correct but if one is currently in flight then the R runtime has already been informed that |
nealrichardson
left a comment
There was a problem hiding this comment.
Thanks for doing this better :)
|
@jimhester, @nealrichardson, @bkietz @dianaclarke @romainfrancois Just wanted to say thanks for working on this. I reported it a long time ago and have just been periodically watching the developments slowly progress. I'm excited to see that there will be a resolution! Cheers! |
No description provided.