Skip to content

ARROW-1579: [Java] Adding containerized Spark Integration tests#1319

Closed
BryanCutler wants to merge 8 commits into
apache:masterfrom
BryanCutler:docker-spark-integration-ARROW-1579
Closed

ARROW-1579: [Java] Adding containerized Spark Integration tests#1319
BryanCutler wants to merge 8 commits into
apache:masterfrom
BryanCutler:docker-spark-integration-ARROW-1579

Conversation

@BryanCutler

Copy link
Copy Markdown
Member

This adds configuration files for running containerized integration tests with Spark for Java and Python. The image is based off of maven with openJDK-8 and will include a copy of Arrow when built on the parent directory to ARROW_HOME. Running the container will do the following:

  1. Install Arrow Java artifacts to m2 repo in container
  2. Checkout current Spark master branch
  3. Update Spark POM to use the installed version of Arrow and package
  4. Run Arrow related Java/Scala Spark tests
  5. Run pyarrow related Python Spark tests

Successful completion of the running container will indicate that there are no breaking changes between current Spark code and the Arrow code

@BryanCutler

BryanCutler commented Nov 14, 2017

Copy link
Copy Markdown
Member Author

This is currently a WIP, the Scala/Java tests are able to run

Left TODO:

  • Run PySpark Tests
  • Verify working with docker-compose and existing volumes in arrow/dev
  • Check why Zinc is unable to run in mvn build, need to enable port 3030?
  • Speed up pyarrow build using conda prefix as toolchain

@BryanCutler

Copy link
Copy Markdown
Member Author

I don't prior experience with Docker, so any suggestions are welcome! @heimir-sverrisson @wesm please take a look when you have a chance. One question I have from looking at the Dask integration tests is that are we assuming that pyarrow is already built prior to creating the docker image?

@wesm

wesm commented Nov 14, 2017

Copy link
Copy Markdown
Member

I think it needs to be built as part of the Docker build. We should try to create some reusable scripts like our Travis CI setup to help with automating certain common tasks (setting up the C++ toolchain, building things, etc.)

@xhochy xhochy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment about better usage of docker's ability to cache layer of the image creation.

Comment thread dev/spark_integration/Dockerfile Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add apache-arrow as late as a possible. From the point on where you have added it, docker's caching will not be effective anymore. For example, once you move the apt-get line before the add, a subsequent build of the image will not run apt-get again but use the cached layer on top of the maven image that already is at the state after the apt-get call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tip, thanks @xhochy ! apt-get takes a long time and I didn't know why it only sometimes ended up being cached.

@wesm wesm changed the title [WIP] ARROW-1579: [TESTS] Adding containerized Spark Integration tests [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests Nov 18, 2017
@wesm

wesm commented Jan 18, 2018

Copy link
Copy Markdown
Member

@BryanCutler or @icexelloss could we get this set up? I'd like to look into setting up nightly integration test runs on a server someplace (we have a shared machine that @cpcloud and I have been using for nightly builds, and the pandas team users for benchmarks) so we can get warned immediately if something gets broken

@BryanCutler

Copy link
Copy Markdown
Member Author

I'll try to pick this up again in the next few days @wesm. One question though, when we do an API breaking change, like rename MapVector, then this will fail until Spark is patched. Do we need to keep a branch with these kind of patches until they can be applied upstream?

@wesm

wesm commented Jan 18, 2018

Copy link
Copy Markdown
Member

I think that's fine. The failing nightly integration test will help with creating some urgency to get the downstream project fixed

@BryanCutler BryanCutler force-pushed the docker-spark-integration-ARROW-1579 branch from 5d398e8 to 95eb22a Compare January 26, 2018 01:37
@BryanCutler

Copy link
Copy Markdown
Member Author

This is pretty much running for me now, but I'm getting a strange test failure when running pyspark tests

======================================================================
FAIL: test_unsupported_datatype (pyspark.sql.tests.ArrowTests)
----------------------------------------------------------------------
AttributeError: 'LooseVersion' object has no attribute 'version'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/apache-arrow/spark/python/pyspark/sql/tests.py", line 3404, in test_unsupported_datatype
    df.toPandas()
AssertionError: "Unsupported data type" does not match "'LooseVersion' object has no attribute 'version'"

Not sure why because the tests pass normally, maybe something with the conda setup? I'll try some more later..

@xhochy

xhochy commented Jan 28, 2018

Copy link
Copy Markdown
Member

@BryanCutler I had a look into the build and found some errors with the Python/C++ scripts (will do a PR for that) but I also cannot get mvn install to work correctly. Any ideas why it installs to this location?

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /apache-arrow/arrow/java/?/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

@BryanCutler

Copy link
Copy Markdown
Member Author

@xhochy that is strange, until I switched the image to be based on FROM maven:3.5.2-jdk-8-slim I was seeing similar problems with paths containing ? but since then it's been working fine and installs under root

[INFO] Installing /apache-arrow/arrow/java/tools/target/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar to /root/.m2/repository/org/apache/arrow/arrow-tools/0.9.0-SNAPSHOT/arrow-tools-0.9.0-SNAPSHOT-jar-with-dependencies.jar

I did follow this docker post-installation step, maybe that has something to do with it? https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user

I've also been running with my local .m2 repo mounted in the container with docker run -v /home/bryan/.m2:/root/.m2 spark-arrow to save from always downloading dependencies, but I tried skipping that still don't get any errors.

Besides the strange path, what is the exact error that it gives you for mvn install?

@BryanCutler

Copy link
Copy Markdown
Member Author

Looks like the errors I'm getting is because pa.__version__ return None

>>> import pyarrow as pa
>>> from distutils.version import LooseVersion
>>> LooseVersion(pa.__version__) < LooseVersion('0.8.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bryan/miniconda2/envs/pyarrow-dev/lib/python2.7/distutils/version.py", line 296, in __cmp__
    return cmp(self.version, other.version)
AttributeError: LooseVersion instance has no attribute 'version'
>>> pa.__version__
>>>

Is there something in the build I need to set to get a valid version? cc @wesm

@BryanCutler

BryanCutler commented Jan 29, 2018

Copy link
Copy Markdown
Member Author

from pyarrow.__init__.py it looks like maybe I should install setuptools_scm but then I get the following version:

>>> pa.__version__
'0.1.1.dev1231+g0a49022'

I would expect something like '0.8.***', where is it getting the above number from?

@wesm

wesm commented Jan 29, 2018

Copy link
Copy Markdown
Member

Not my area of expertise (@xhochy or @cpcloud would know better) -- usually that relates to not having fetched all the tags

@felixcheung felixcheung left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some native toolchain is missing when switching to maven:3.5.2-jdk-8-slim

popd

# Build Spark with Arrow
SPARK_REPO=https://github.com/apache/spark.git

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should pull from apache git?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixcheung ! Yeah, it would probably be best to use Apache, I'll change it

@BryanCutler

Copy link
Copy Markdown
Member Author

@xhochy any pointers on how to get a correct pyarrow version? If you are able to do a PR to this branch for the errors you found, that would be great - I can make you a collaborator if that would be easier.

@wesm

wesm commented Jan 30, 2018

Copy link
Copy Markdown
Member

@BryanCutler I think all committers on GitBox have write access to your branch, so we can push directly

@xhochy

xhochy commented Feb 4, 2018

Copy link
Copy Markdown
Member

@BryanCutler The main change I did was to move source activate pyarrow-dev to the beginning of the file, set export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX and removed LD_LIBRARY_PATH and PYTHONPATH from the file. This should pick up the conda environment and speed up Arrow builds a lot. That should actually not affect maven builds.

@BryanCutler

BryanCutler commented Feb 9, 2018

Copy link
Copy Markdown
Member Author

Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below..

Btw, to get the correct pyarrow.__version__ from the dev env, you do need to have all git tags fetched and install setuptools_scm from pip or conda. @xhochy , setuptools_scm wasn't listed in any of the developer docs I could find, should it be added to the list of dependent packages for setting up a conda env?

Comment thread python/pyarrow/__init__.py Outdated
try:
import setuptools_scm
__version__ = setuptools_scm.get_version('../')
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy and @wesm , I needed to change this because it would only give a version if run under ARROW_HOME/python directory. So when running Spark tests, on importing pyarrow it would return None. Making it relative to the __file__ seemed to fix it for all cases. I can make this a separate JIRA if you think that would be better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the suggestion above to get rid of this line.

@BryanCutler

Copy link
Copy Markdown
Member Author

@xhochy , I could not get Arrow C++ to build with export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX, I would get a linking error with gflags like "undefined reference google::FlagRegisterer::FlagRegisterer". I thought maybe it was because I wasn't using g++ 4.9, but I had no luck trying to get 4.9 installed since the base image I'm using is Ubuntu 16.04. Have you ever run into this? It seemed like it was some kind of template constructor that it couldn't find..

@wesm

wesm commented Feb 9, 2018

Copy link
Copy Markdown
Member

We'll need to add some flags to ARROW_CXXFLAGS to disable the gcc5 ABI. I can try to take a look in a bit or this weekend

@xhochy xhochy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use the conda toolchain, set "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0" this switches the code then to the correct C++ ABI.

# limitations under the License.
#

# Set up environment and working directory

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call set -e here, then a command failure in the script will lead to a failure of the whole script. Then you can get rid of the if [[ $? -ne 0 ]]; then blocks later on

# Build pyarrow and install inplace
pushd arrow/python
python setup.py clean
python setup.py build_ext --build-type=release --inplace

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use --inplace but rather run python setup.py build_ext --build-type=release install to install the extension into the environment.

popd

# Clean up
echo "Cleaning up.."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for these two line, at the end the environment is thrown away anyways

Comment thread python/pyarrow/__init__.py Outdated
try:
import setuptools_scm
__version__ = setuptools_scm.get_version('../')
__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the suggestion above to get rid of this line.

@BryanCutler

Copy link
Copy Markdown
Member Author

Thanks @xhochy , I'll try all this out!

@BryanCutler

Copy link
Copy Markdown
Member Author

I still get the linking error after setting "ARROW_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0"

[ 78%] Linking CXX executable ../../../release/json-integration-test
CMakeFiles/json-integration-test.dir/json-integration-test.cc.o: In function `_GLOBAL__sub_I__ZN5arrow4test25MakeRandomInt32PoolBufferElPNS_10MemoryPoolEPSt10shared_ptrINS_10PoolBufferEEj':
json-integration-test.cc:(.text.startup+0x1de): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x296): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
json-integration-test.cc:(.text.startup+0x34e): undefined reference to `google::FlagRegisterer::FlagRegisterer<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, char const*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)'
collect2: error: ld returned 1 exit status

@wesm

wesm commented Feb 13, 2018

Copy link
Copy Markdown
Member

Looks like that flag needs to be passed through to the ExternalProject declarations. Try using CMAKE_CXX_FLAGS instead of ARROW_CXXFLAGS? If this doesn't work I can dig in later today

@BryanCutler

Copy link
Copy Markdown
Member Author

thanks @wesm , that seemed to do the trick by adding it to the cmake command line (it didn't work as an env var)
cmake -DCMAKE_CXX_FLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" -DARROW_PYTHON=on -DARROW_HDFS=on -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..

@BryanCutler

BryanCutler commented Feb 14, 2018

Copy link
Copy Markdown
Member Author

Well, I'm able to run the build with ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX but I do need to keep -DCMAKE_INSTALL_PREFIX=$ARROW_HOME for the Python build to work (is this right?), and then LD_LIBRARY_PATH=${ARROW_HOME}/lib:${CONDA_BASE}/lib:${LD_LIBRARY_PATH} to locate libarrow.so.

Now, I get an error with import pyarrow because of some undefined symbol:
ImportError: /home/ubuntu/miniconda/envs/pyarrow-dev/lib/python2.7/site-packages/pyarrow-0.8.1.dev116+g7bf7b2e9-py2.7-linux-x86_64.egg/pyarrow/lib.so: undefined symbol: _ZN5arrow9timestampENS_8TimeUnit4typeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I don't think this is from an arrow library, but I can't tell where it's coming from, any ideas?

@wesm

wesm commented Feb 14, 2018

Copy link
Copy Markdown
Member

Now you need to set the PYARROW_CXXFLAGS environment variable with the gcc5 abi flag

@BryanCutler BryanCutler changed the title [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests ARROW-1579: [Java] Adding containerized Spark Integration tests Feb 14, 2018
@BryanCutler

Copy link
Copy Markdown
Member Author

This is working using ARROW_BUILD_TOOLCHAIN set to the conda evn, and I think this could be merged now. Is somebody else able to verify that it is working before merging?

@wesm

wesm commented Feb 15, 2018

Copy link
Copy Markdown
Member

Sweet, I can take this for a spin later today, or if @cpcloud wants to look that also works

@wesm

wesm commented Feb 16, 2018

Copy link
Copy Markdown
Member

Build failed for me with

Scanning dependencies of target arrow_static
Scanning dependencies of target arrow_shared
make[2]: *** No rule to make target '/home/ubuntu/miniconda/envs/pyarrow-dev/lib/libbrotlidec.a', needed by 'release/libarrow.so.0.0.0'.  Stop.
CMakeFiles/Makefile2:693: recipe for target 'src/arrow/CMakeFiles/arrow_shared.dir/all' failed
make[1]: *** [src/arrow/CMakeFiles/arrow_shared.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 52%] Linking CXX static library ../../release/libarrow.a
[ 52%] Built target arrow_static
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

Haven't dug too far into what's wrong. Can we document how to run this someplace other than the Dockerfile?

@BryanCutler

Copy link
Copy Markdown
Member Author

Thanks @wesm , I think that is because the brotli libraries were recently updated and #1554 should fix that. Could you check if you have the latest master checked out? The docker script just takes what is in your current dir - not sure if that is what we want or not.

@BryanCutler

Copy link
Copy Markdown
Member Author

yeah, we should have it better documented. I copied how the api docs script was done and it's run with docker compose, but I'm not quite sure how to run it with through that. I just used the regular docker run cmd

@xhochy

xhochy commented Feb 16, 2018

Copy link
Copy Markdown
Member

As this is part of a docker-compose script, this should be run using docker-compose build && docker-compose run spark_integration. Then it will use the options declared in the docker-compose.yml (mostly the mount of the current working dir)

@xhochy

xhochy commented Feb 16, 2018

Copy link
Copy Markdown
Member

To get past the build issues I had to rebase locally. Once the script has run through I will report back.

@BryanCutler

Copy link
Copy Markdown
Member Author

I probably should have rebased here earlier, if anyone else is going to try this out let me know and I can do that first.

@xhochy

xhochy commented Feb 16, 2018

Copy link
Copy Markdown
Member

Sadly I get the following error on master:

[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:65: not found: type NullableMapVector
[error]       case (StructType(_), vector: NullableMapVector) =>
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:66: value size is not a member of org.apache.arrow.vector.ValueVector
[error]         val children = (0 until vector.size()).map { ordinal =>
[error]                                        ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:67: value getChildByOrdinal is not a member of org.apache.arrow.vector.ValueVector
[error]           createFieldWriter(vector.getChildByOrdinal(ordinal))
[error]                                    ^
[error] /apache-arrow/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala:318: not found: type NullableMapVector
[error]     val valueVector: NullableMapVector,
[error]                      ^
[warn] Class org.apache.avro.reflect.Stringable not found - continuing with a stub.
[warn] two warnings found
[error] four errors found
[error] Compile failed at Feb 16, 2018 6:49:56 PM [50.120s]

@BryanCutler

Copy link
Copy Markdown
Member Author

That error is from #1490 being merged yesterday - what I was talking about in #1319 (comment) but I was hoping to at least get this running for a little while first before breaking Spark!

I guess we could either rebase this just before #1490 to ensure it works or I can provide a patch for Spark to update it for that breaking change in Java?

@xhochy

xhochy commented Feb 16, 2018

Copy link
Copy Markdown
Member

We definitely need an upstream patch for Spark but I'm happy with merging this PR as-is as it shows that we can now test Spark with it. @wesm is this ok for you?

@wesm

wesm commented Feb 17, 2018

Copy link
Copy Markdown
Member

Works for me. Merging now

@wesm wesm closed this in 62b9eb2 Feb 17, 2018
@BryanCutler

Copy link
Copy Markdown
Member Author

Thanks @wesm @xhochy and @felixcheung ! Since it can sometimes take a while to get Spark updated, ff we get to the point were this is ready to be put in the nightly builds, maybe I could submit a PR to patch Spark and we could configure the docker build to point to that.

@BryanCutler BryanCutler deleted the docker-spark-integration-ARROW-1579 branch November 19, 2018 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants