Skip to content

GH-34630: [C++] Second block of refactoring to move acero out of libarrow#34575

Merged
icexelloss merged 6 commits into
apache:mainfrom
ildipo:refactor_5
Mar 18, 2023
Merged

GH-34630: [C++] Second block of refactoring to move acero out of libarrow#34575
icexelloss merged 6 commits into
apache:mainfrom
ildipo:refactor_5

Conversation

@ildipo

@ildipo ildipo commented Mar 15, 2023

Copy link
Copy Markdown
Contributor

Rationale for this change

This is the second and last block of refactoring before moving acero out of libarrow. This PR removes the remaining dependencies from libarrow into the compute/exec directory.

What changes are included in this PR?

In detail:

  • compute/exec/expression is moved inside compute since it is needed in several kernels
  • all compute/exec specific include files are removed from compute/api.h
  • light_array_exec_test.cc i(introduced in the previous refactoring) s merged back into light_array_test
  • testing generators are refactored to remove the need for exec/options.h
  • some utilities are moved from compute/exec/util to compute/util
  • updated python and r libraries usage of moved files

Are these changes tested?

Yes using existing tests (some of which have been updated due to refactoring).

Are there any user-facing changes?

no

@github-actions

Copy link
Copy Markdown

->FailOnError()
->SourceNode(kRowsPerBatch, kNumBatches);

auto generator = gen::Gen({{"x", gen::Step()}})->FailOnError();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was using generator.h with compute/exec/options.h. I removed that include (and the associated exec options from generator, that is SourceNode()) and expanded it in this test, which was the only place where it was used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning on starting to use these utilities in more places. However, we can introduce exec::gen::Gen (or something like that) which wraps the generator and adds exec-specific utilities. So I think this is tolerable for now.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 15, 2023
@icexelloss

Copy link
Copy Markdown
Contributor

@ildipo High level looks good to me. Is the general idea of moving "expression" from "compute/exec" to "compute" such that it is consider part of "libarrow" and not "libarrow_acero"?

@icexelloss

icexelloss commented Mar 15, 2023

Copy link
Copy Markdown
Contributor

I also removed most of the reviewers - I think the changes to C/Python/R binding is trivial but if @westonpace think this need review from C/R maintainers we can add back.

@ildipo

ildipo commented Mar 15, 2023

Copy link
Copy Markdown
Contributor Author

Yes I feel what is in expression.h is used in many other places (e.g. inside kernels) so it belongs to libarrow

@ildipo

ildipo commented Mar 15, 2023

Copy link
Copy Markdown
Contributor Author

I'm not sure why one check failed (apparently it was cancelled???)

@icexelloss

icexelloss commented Mar 15, 2023

Copy link
Copy Markdown
Contributor

@ildipo Error is this

FAILED: src/arrow/compute/CMakeFiles/arrow-compute-internals-test.dir/Unity/unity_0_cxx.cxx.obj 
"C:\Program Files\Git\usr\bin\ccache.exe" C:\PROGRA~2\MICROS~2\2019\ENTERP~1\VC\Tools\MSVC\1429~1.301\bin\Hostx64\x64\cl.exe  /nologo /TP -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_HDFS -DARROW_MIMALLOC -DARROW_WITH_BENCHMARKS_REFERENCE -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DBOOST_ALL_NO_LIB -DGMOCK_LINKED_AS_SHARED_LIBRARY=1 -DGTEST_LINKED_AS_SHARED_LIBRARY=1 -DURI_STATIC_BUILD -DUTF8PROC_STATIC -D_CRT_SECURE_NO_WARNINGS -D_ENABLE_EXTENDED_ALIGNED_STORAGE -ID:\a\arrow\arrow\build\cpp\src -ID:\a\arrow\arrow\cpp\src -ID:\a\arrow\arrow\cpp\src\generated -external:ID:\a\arrow\arrow\cpp\thirdparty\flatbuffers\include -external:ID:\a\arrow\arrow\build\cpp\rapidjson_ep\src\rapidjson_ep-install\include -external:ID:\a\arrow\arrow\cpp\thirdparty\hadoop\include -external:ID:\a\arrow\arrow\build\cpp\boost_ep-prefix\src\boost_ep -external:I"C:\Program Files\OpenSSL-Win64\include" -external:ID:\a\arrow\arrow\build\cpp\orc_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\lz4_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\zlib_ep\src\zlib_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\zstd_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\snappy_ep\src\snappy_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\protobuf_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\utf8proc_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\re2_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\xsimd_ep\src\xsimd_ep-install\include -external:ID:\a\arrow\arrow\build\cpp\mimalloc_ep\src\mimalloc_ep\include\mimalloc-2.0 -external:ID:\a\arrow\arrow\build\cpp\googletest_ep-prefix\include -external:W0 /D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING /W3 /EHsc /wd5105 /bigobj /utf-8 /W3 /wd4365 /wd4267 /wd4838 /wd4800 /wd4996 /wd4065   /MDd /Z7 /Ob0 /Od /RTC1 /WX -std:c++17 /showIncludes /Fosrc\arrow\compute\CMakeFiles\arrow-compute-internals-test.dir\Unity\unity_0_cxx.cxx.obj /Fdsrc\arrow\compute\CMakeFiles\arrow-compute-internals-test.dir\ /FS -c D:\a\arrow\arrow\build\cpp\src\arrow\compute\CMakeFiles\arrow-compute-internals-test.dir\Unity\unity_0_cxx.cxx
D:\a\arrow\arrow\cpp\src\arrow/compute/exec/test_util.h(49): error C2375: 'arrow::compute::ExecBatchFromJSON': redefinition; different linkage
D:/a/arrow/arrow/cpp/src/arrow/compute/light_array_test.cc(37): note: see declaration of 'arrow::compute::ExecBatchFromJSON'

Looks like some linking errors but on windows only

@icexelloss

Copy link
Copy Markdown
Contributor

Yes I feel what is in expression.h is used in many other places (e.g. inside kernels) so it belongs to libarrow

Got it - makes sense to me. @westonpace does that sound about right?

@westonpace

Copy link
Copy Markdown
Member

Got it - makes sense to me. @westonpace does that sound about right?

Yes, expression is pretty stable. So if the goal is a stable "compute" module then I think moving it inside compute should be safe.

Thinking beyond just "stability" I think there is a pretty reasonable separation in Arrow between "compute functions" and "the execution engine" and I think it makes sense for Expression to live in the former. There is also some desire for a Substrait API for expression evaluation. In theory that could be done without "exec" but that would require splitting the substrait module into two pieces and I don't think that's worth our time (e.g. the expression evaluation API can just depend on "exec" even though it doesn't strictly need to)

@westonpace

Copy link
Copy Markdown
Member

I'll try and get a closer look at this tonight but it might be tomorrow.

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things but this looks correct to me. Thanks!

#include "arrow/compute/exec.h" // IWYU pragma: export
#include "arrow/compute/exec/exec_plan.h" // IWYU pragma: export
#include "arrow/compute/exec/groupby.h" // IWYU pragma: export
#include "arrow/compute/exec.h" // IWYU pragma: export

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be split someday (no need to do it in this PR). ExecContext / ExecSpan / ExecResult / ExecValue / CallFunction / GetFunctionExecutor belong in compute. ExecBatch belongs in exec. SelectionVector could probably be removed at this point and added back in later if we ever decide to do something with it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix the other things in this PR and then make another one to move ExecBatch

->FailOnError()
->SourceNode(kRowsPerBatch, kNumBatches);

auto generator = gen::Gen({{"x", gen::Step()}})->FailOnError();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am planning on starting to use these utilities in more places. However, we can introduce exec::gen::Gen (or something like that) which wraps the generator and adds exec-specific utilities. So I think this is tolerable for now.

Comment thread cpp/src/arrow/compute/type_fwd.h Outdated
@@ -50,14 +50,6 @@ struct KernelState;

struct Declaration;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Declaration is part of exec too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, moving in exec/type_fwd.h

Comment thread cpp/src/arrow/testing/generator.cc Outdated
Comment on lines +372 to +373
std::shared_ptr<DataGenerator> Target() { return target_; }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed anywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing

@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Mar 16, 2023
@ildipo ildipo requested a review from westonpace March 16, 2023 20:15
@ildipo

ildipo commented Mar 16, 2023

Copy link
Copy Markdown
Contributor Author

Updated the PR with the changes detailed above.

@assignUser

Copy link
Copy Markdown
Member

@github-actions crossbow submit -g cpp

@github-actions

Copy link
Copy Markdown

Revision: a3c7b38

Submitted crossbow builds: ursacomputing/crossbow @ actions-b5e5c878a5

Task Status
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp Github Actions
test-debian-10-cpp-amd64 Github Actions
test-debian-10-cpp-i386 Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-18.04-cpp Github Actions
test-ubuntu-18.04-cpp-release Github Actions
test-ubuntu-18.04-cpp-static Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-20 Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions

@ildipo

ildipo commented Mar 17, 2023

Copy link
Copy Markdown
Contributor Author

The above failures seem unrelated to this PR.

@icexelloss

icexelloss commented Mar 17, 2023

Copy link
Copy Markdown
Contributor

@westonpace

There seems to be a failure in recent CI runs (test-alpine-linux-cpp):

 68/103 Test  #60: arrow-dataset-file-json-test ..............***Failed   61.16 sec
Running arrow-dataset-file-json-test, redirecting output into /build/cpp/build/test-logs/arrow-dataset-file-json-test.txt (attempt 1/1)
/arrow/cpp/build-support/run-test.sh: line 88: 28113 Segmentation fault      (core dumped) 

Also some valgrind failures on the following tests:
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=45749&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181

I suspect those are not caused by this PR - do we have these failures on default branch as well?

Edit: Didn't see @ildipo 's comment above saying these seem unrelated. From the change set I don't think this PR would cause the failures above and I think this is safe to merge.

@icexelloss icexelloss left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@icexelloss

Copy link
Copy Markdown
Contributor

@westonpace @ildipo I approved. Is there anything you would like to check/change? If not I will merge this end of day today.

@ildipo

ildipo commented Mar 17, 2023

Copy link
Copy Markdown
Contributor Author

nothing else to add to this PR

@ildipo

ildipo commented Mar 17, 2023

Copy link
Copy Markdown
Contributor Author

just don't close the associated issue - I think github will close it automatically

@assignUser assignUser changed the title GH-15280: [C++] Second block of refactoring to move acero out of libarrow GH-34630: [C++] Second block of refactoring to move acero out of libarrow Mar 18, 2023
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #34630 has been automatically assigned in GitHub to PR creator.

@assignUser

Copy link
Copy Markdown
Member

Please create separate issues and track them in an umbrella issue to avoid having to re-open issues to avoid confusion now and during changelog creation.

I have created #34630 for this PR @ildipo please comment "take" on it to have it assigned (automatic assignment failed as you have not interacted with the issue - a GH spam protection)

@icexelloss icexelloss merged commit 8524572 into apache:main Mar 18, 2023
@kou

kou commented Mar 18, 2023

Copy link
Copy Markdown
Member

@icexelloss Could you use https://github.com/apache/arrow/blob/main/dev/merge_arrow_pr.py instead of GitHub's merge button as much as possible? The script does some more preparations.

@icexelloss

icexelloss commented Mar 18, 2023 via email

Copy link
Copy Markdown
Contributor

@ursabot

ursabot commented Mar 18, 2023

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 682c112 and contender = 8524572. 8524572 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.15%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.01% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 85245722 ec2-t3-xlarge-us-east-2
[Finished] 85245722 test-mac-arm
[Finished] 85245722 ursa-i9-9960x
[Finished] 85245722 ursa-thinkcentre-m75q
[Finished] 682c112c ec2-t3-xlarge-us-east-2
[Failed] 682c112c test-mac-arm
[Finished] 682c112c ursa-i9-9960x
[Finished] 682c112c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot

ursabot commented Mar 18, 2023

Copy link
Copy Markdown

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Second block of refactoring to move acero out of libarrow

6 participants