ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values by cpcloud · Pull Request #1651 · apache/arrow

cpcloud · 2018-02-23T17:50:03Z

This PR closes the following JIRAs

ARROW-2145: [Python] Decimal conversion not working for NaN values
ARROW-2153: [C++/Python] Decimal conversion not working for exponential notation
ARROW-2157: [Python] Decimal arrays cannot be constructed from Python lists
ARROW-2160: [C++/Python] Fix decimal precision inference
ARROW-2177: [C++] Remove support for specifying negative scale values in DecimalType

I originally separated these fixes into a few smaller PRs, but it turned out
that the issues were all related, so I fixed them all in one PR.

wesm · 2018-02-25T00:45:05Z

Since we'll probably want to use libre2 for analytics, we should see at some point if we can replace the Boost regexen with libre2

cpcloud · 2018-02-26T14:39:19Z

@kou Do you have any idea why in this build: https://travis-ci.org/apache/arrow/jobs/345443821 OS X isn't finding the correct symbol? Is there some installation step for brew that I need to add?

Here's the error message:

dyld: Symbol not found: __ZNK5boost16re_detail_10650131cpp_regex_traits_implementationIcE17transform_primaryEPKcS4_
  Referenced from: /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
  Expected in: /usr/local/opt/boost/lib/libboost_regex-mt.dylib

cpcloud · 2018-02-26T14:42:10Z

@wesm I'll open a JIRA for it.

cpcloud · 2018-02-26T14:43:17Z

Ah, looks like it was added in ARROW-29.

cpcloud · 2018-02-26T21:42:31Z

I'm not sure if this is the correct behavior here. I need to look into what other systems do with a string of all zeros for precision and scale.

wesm

This doesn't look like too much fun, thanks for slogging through this! left some stylistic comments and other things

wesm · 2018-02-25T00:47:12Z

I wonder if we should make some global state that is initialized when the library is loaded

aren't these dchecks already done?

Yep they are done in the Import* functions. I'll remove these.

I kept these DCHECKS since these functions are returning Status but I removed the messages.

wesm · 2018-02-26T21:12:22Z

what's the rationale for this, the symbol linking issue?

import pyarrow.parquet was segfaulting, I assumed because we're statically linking boost in the parquet build and dynamically in the arrow build. This only shows up when using the regex library.

I see, we should be consistent about which we do across the libraries. Part of why I wish we were building all these libraries in a monorepo setting

wesm · 2018-02-26T21:14:41Z

Ugh, Python, what did we do to deserve this? =)

wesm · 2018-02-26T21:16:50Z

these dchecks are performed twice -- should this be just DCHECK_OK on each of these?

I introduced a DCHECK_OK macro and used it here and in a few other places.

wesm · 2018-02-26T21:18:57Z

This should never happen by design, right?

Yep, I was guarding against potential uses of it after the fact so that arrow crashes with a useful error message to the developer.

I suppose I could relax this and just do nothing if the value is nan.

wesm · 2018-02-26T21:41:24Z

is there some TMP magic that makes this abstraction zero-cost, or does this add overhead?

So, operator[](const std::string) returns a const_reference to a sub_match object, which has a cast to std::string operator defined. sub_match has first and second attributes which are bidirectional iterators which are used to construct a string like std::string(match.first, match.second). Alternatively we use results["SIGN"].str(). The main difference is that the first uses __builtin_memcpy and the second uses reserve then ultimately __builtin_memset N number of times. I suspect that one call to memcpy N bytes is cheaper than N calls to memset individual elements.

wesm · 2018-02-26T21:43:05Z

I reckon we'll want to replace this with libre2 at some point. it's also a lot faster than boost::regex http://lh3lh3.users.sourceforge.net/reb.shtml

Yep, I'll make a JIRA for it.

wesm · 2018-02-26T21:44:12Z

FWIW it's not necessary to use this NULLPTR macro outside headers I don't believe

Cool I'll fix

wesm · 2018-02-26T21:49:14Z

can you add a test here with an explicit decimal type sufficient to accommodate the data?

wesm · 2018-02-26T21:49:24Z

cpcloud · 2018-02-26T22:13:57Z

This should ignore nans

The Update method now ignores nans

cpcloud · 2018-02-26T22:54:51Z

I'm going to change to ignore nans

kou · 2018-02-27T01:34:44Z

Umm. I have never seen the error. I may not help you because I'don't have macOS.

What are the outputs of the followings?

% nm /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% nm /usr/local/opt/boost/lib/libboost_regex-mt.dylib
% strings /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib | grep boost
% strings /usr/local/opt/boost/lib/libboost_regex-mt.dylib | grep boost
% otool -L /Users/travis/build/apache/arrow/cpp-install/lib/libarrow.0.dylib
% otool -L /usr/local/opt/boost/lib/libboost_regex-mt.dylib

pitrou · 2018-03-01T09:37:52Z

Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?

@pitrou This is already conditioned on in .travis.yml just before this script is called. Is it really necessary to condition on it again?

Not really, though given the filename it might be better to avoid further mistakes :-)

cpcloud · 2018-03-01T19:48:39Z

@wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453

wesm · 2018-03-01T22:24:29Z

Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge

wesm

+1, thanks @cpcloud!

This was referenced Feb 23, 2018

ARROW-2145/ARROW-2157: [Python] Decimal conversion not working for NaN values #1610

Closed

ARROW-2153/ARROW-2160: [C++/Python] Fix decimal precision inference #1618

Closed

cpcloud commented Feb 26, 2018

View reviewed changes

wesm reviewed Feb 26, 2018

View reviewed changes

cpcloud commented Feb 26, 2018

View reviewed changes

Comment thread cpp/src/arrow/python/builtin_convert.cc Outdated

cpcloud Feb 26, 2018

Copy link
Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to change to ignore nans

cpcloud mentioned this pull request Feb 28, 2018

ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array #1681

Closed

pitrou reviewed Mar 1, 2018

View reviewed changes

cpcloud added 17 commits March 1, 2018 10:30

ARROW-2145: [Python] Decimal conversion not working for NaN values

8e816ec

IWYU

f562378

Revert header change

8893a45

Revert test change

0665f6e

Install libboost-regex-dev on travis

e6ac864

Use shared boost on parquet CI build

50e35d6

Install boost with c++11 option

8be22a6

Show boost install

7c7270a

Install boost first

77a41ee

NULLPTR to nullptr

4c74c63

DCHECK_OK

d905202

DCHECK_OK

281f798

DCHECK_OK

1df6923

DCHECK_Ok

db664f2

Fix order of operands

092a962

Check return value of PyList_SetItem

418754f

Add DecimalMetadata::Update test for ignoring NaN values

b24ff25

cpcloud added 13 commits March 1, 2018 10:30

Ignore nans in decimal metadata update

3190b1a

Refactor import decimal and acquire the gil before importing

a05b316

Formatting

4e6db3c

boost osx debugging

29e1ebc

DCHECK_OK for release builds

b4bcfd9

More script debugging

78cbf51

Fix boost root

03ee999

Perms

ae5db5f

Silence cmake complaints about boost version

99505a9

Add tests to accommodate decimal values

00be578

Brewfile

ab3e4a5

Pass version as argument

0d45688

Args must be a ruby Hash

1fc2a96

Make sure we only install if glibc is affected

97fcb96

wesm approved these changes Mar 1, 2018

View reviewed changes

wesm closed this in bfac60d Mar 1, 2018

asfimport mentioned this pull request Mar 1, 2018

[Python] Decimal conversion not working for NaN values #18112

Closed

Uh oh!

Conversation

cpcloud commented Feb 23, 2018

Uh oh!

wesm commented Feb 25, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

cpcloud commented Feb 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Feb 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Mar 1, 2018

Uh oh!

wesm commented Mar 1, 2018