ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values#1651
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values#1651cpcloud wants to merge 31 commits into
Conversation
|
Since we'll probably want to use libre2 for analytics, we should see at some point if we can replace the Boost regexen with libre2 |
|
@kou Do you have any idea why in this build: https://travis-ci.org/apache/arrow/jobs/345443821 OS X isn't finding the correct symbol? Is there some installation step for Here's the error message: |
|
@wesm I'll open a JIRA for it. |
|
Ah, looks like it was added in ARROW-29. |
There was a problem hiding this comment.
I'm not sure if this is the correct behavior here. I need to look into what other systems do with a string of all zeros for precision and scale.
wesm
left a comment
There was a problem hiding this comment.
This doesn't look like too much fun, thanks for slogging through this! left some stylistic comments and other things
There was a problem hiding this comment.
I wonder if we should make some global state that is initialized when the library is loaded
There was a problem hiding this comment.
aren't these dchecks already done?
There was a problem hiding this comment.
Yep they are done in the Import* functions. I'll remove these.
There was a problem hiding this comment.
I kept these DCHECKS since these functions are returning Status but I removed the messages.
There was a problem hiding this comment.
what's the rationale for this, the symbol linking issue?
There was a problem hiding this comment.
import pyarrow.parquet was segfaulting, I assumed because we're statically linking boost in the parquet build and dynamically in the arrow build. This only shows up when using the regex library.
There was a problem hiding this comment.
I see, we should be consistent about which we do across the libraries. Part of why I wish we were building all these libraries in a monorepo setting
There was a problem hiding this comment.
Ugh, Python, what did we do to deserve this? =)
There was a problem hiding this comment.
these dchecks are performed twice -- should this be just DCHECK_OK on each of these?
There was a problem hiding this comment.
I introduced a DCHECK_OK macro and used it here and in a few other places.
There was a problem hiding this comment.
This should never happen by design, right?
There was a problem hiding this comment.
Yep, I was guarding against potential uses of it after the fact so that arrow crashes with a useful error message to the developer.
There was a problem hiding this comment.
I suppose I could relax this and just do nothing if the value is nan.
There was a problem hiding this comment.
is there some TMP magic that makes this abstraction zero-cost, or does this add overhead?
There was a problem hiding this comment.
So, operator[](const std::string) returns a const_reference to a sub_match object, which has a cast to std::string operator defined. sub_match has first and second attributes which are bidirectional iterators which are used to construct a string like std::string(match.first, match.second). Alternatively we use results["SIGN"].str(). The main difference is that the first uses __builtin_memcpy and the second uses reserve then ultimately __builtin_memset N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements.
There was a problem hiding this comment.
I reckon we'll want to replace this with libre2 at some point. it's also a lot faster than boost::regex http://lh3lh3.users.sourceforge.net/reb.shtml
There was a problem hiding this comment.
Yep, I'll make a JIRA for it.
There was a problem hiding this comment.
FWIW it's not necessary to use this NULLPTR macro outside headers I don't believe
There was a problem hiding this comment.
can you add a test here with an explicit decimal type sufficient to accommodate the data?
There was a problem hiding this comment.
This should ignore nans
There was a problem hiding this comment.
The Update method now ignores nans
There was a problem hiding this comment.
I'm going to change to ignore nans
|
Umm. I have never seen the error. I may not help you because I'don't have macOS. What are the outputs of the followings? % nm /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% nm /usr/local/opt/boost/lib/libboost_regex-mt.dylib
% strings /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% strings /usr/local/opt/boost/lib/libboost_regex-mt.dylib | grep boost
% otool -L /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
% otool -L /usr/local/opt/boost/lib/libboost_regex-mt.dylib |
There was a problem hiding this comment.
Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?
There was a problem hiding this comment.
@pitrou This is already conditioned on in .travis.yml just before this script is called. Is it really necessary to condition on it again?
There was a problem hiding this comment.
Not really, though given the filename it might be better to avoid further mistakes :-)
|
@wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453 |
|
Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge |
This PR closes the following JIRAs
ARROW-2145: [Python] Decimal conversion not working for NaN values
ARROW-2153: [C++/Python] Decimal conversion not working for exponential notation
ARROW-2157: [Python] Decimal arrays cannot be constructed from Python lists
ARROW-2160: [C++/Python] Fix decimal precision inference
ARROW-2177: [C++] Remove support for specifying negative scale values in DecimalType
I originally separated these fixes into a few smaller PRs, but it turned out
that the issues were all related, so I fixed them all in one PR.