Skip to content

ARROW-6825: [C++] Rework CSV reader IO around readahead iterator#5727

Closed
pitrou wants to merge 2 commits into
apache:masterfrom
pitrou:ARROW-6825-delimited
Closed

ARROW-6825: [C++] Rework CSV reader IO around readahead iterator#5727
pitrou wants to merge 2 commits into
apache:masterfrom
pitrou:ARROW-6825-delimited

Conversation

@pitrou

@pitrou pitrou commented Oct 24, 2019

Copy link
Copy Markdown
Member

Make the delimiting chunker a common facility used by CSV and JSON.

@pitrou

pitrou commented Oct 24, 2019

Copy link
Copy Markdown
Member Author

@bkietz

@github-actions

Copy link
Copy Markdown

@bkietz bkietz self-requested a review October 24, 2019 15:18
@pitrou

pitrou commented Oct 24, 2019

Copy link
Copy Markdown
Member Author

@pitrou pitrou force-pushed the ARROW-6825-delimited branch 2 times, most recently from f487461 to c2804d0 Compare October 24, 2019 15:43

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very elegant. Just a few comments

Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/json/chunker.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is empty enough to fold into json/reader.cc

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunker.h is small, but chunker.cc is non-trivial IMHO.

Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/csv/chunker.cc Outdated
Comment thread cpp/src/arrow/util/delimiting.h Outdated
Comment thread cpp/src/arrow/json/chunker.cc Outdated
@fsaintjacques fsaintjacques self-requested a review October 24, 2019 16:07
Make the delimiting chunker a common facility used by CSV and JSON.
@pitrou pitrou force-pushed the ARROW-6825-delimited branch from c2804d0 to 68a5a02 Compare November 4, 2019 13:36
@pitrou

pitrou commented Nov 4, 2019

Copy link
Copy Markdown
Member Author

@bkietz I think I've addressed all your comments.
@fsaintjacques Are you still willing to review?

@bkietz bkietz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bkietz bkietz closed this in 21ca13a Nov 4, 2019
@pitrou pitrou deleted the ARROW-6825-delimited branch November 4, 2019 17:10
// Two blocks
auto csv1 = MakeCSVData({"ab,cd\n"});
auto csv2 = MakeCSVData({"ef,"});
AssertParseFinal(parser, {csv1, csv2});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou I'm getting 'ambiguous function call' compilation errors with this and AssertParseOk(line 241). Not sure why this was not caught in CI builds. I am building with tests, gandiva, jni ON. Could you please take a look and let me know if I should set any compiler flags? Thanks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you post the full compiler output?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`
/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:241:5: error: call to 'AssertParseOk' is ambiguous

AssertParseOk(parser, {csv1, csv2});

^~~~~~~~~~~~~

/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:138:6: note: candidate function

void AssertParseOk(BlockParser& parser, const std::string& str) {

 ^

/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:144:6: note: candidate function

void AssertParseOk(BlockParser& parser, const std::vectorutil::string_view& data) {

 ^

/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:393:3: error: call to 'AssertParseFinal' is ambiguous

AssertParseFinal(parser, {csv1, csv2});

^~~~~~~~~~~~~~~~

/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:150:6: note: candidate function

void AssertParseFinal(BlockParser& parser, const std::string& str) {

 ^

/Users/travis/build/dremio/arrow-build/arrow/cpp/src/arrow/csv/parser_test.cc:156:6: note: candidate function

void AssertParseFinal(BlockParser& parser, const std::vectorutil::string_view& data) {

 ^

`

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try the following patch?

diff --git a/cpp/src/arrow/csv/parser_test.cc b/cpp/src/arrow/csv/parser_test.cc
index f988f3ce2..418340f54 100644
--- a/cpp/src/arrow/csv/parser_test.cc
+++ b/cpp/src/arrow/csv/parser_test.cc
@@ -135,16 +135,25 @@ Status ParseFinal(BlockParser& parser, const std::string& str, uint32_t* out_siz
   return parser.ParseFinal(util::string_view(str), out_size);
 }
 
+std::vector<util::string_view> ViewsFromStrings(const std::vector<std::string>& data) {
+  std::vector<util::string_view> views(data.size());
+  for (size_t i = 0; i < data.size(); ++i) {
+    views[i] = data[i];
+  }
+  return views;
+}
+
 void AssertParseOk(BlockParser& parser, const std::string& str) {
   uint32_t parsed_size = static_cast<uint32_t>(-1);
   ASSERT_OK(Parse(parser, str, &parsed_size));
   ASSERT_EQ(parsed_size, str.size());
 }
 
-void AssertParseOk(BlockParser& parser, const std::vector<util::string_view>& data) {
+void AssertParseOk(BlockParser& parser, const std::vector<std::string>& data) {
   uint32_t parsed_size = static_cast<uint32_t>(-1);
-  ASSERT_OK(parser.Parse(data, &parsed_size));
-  ASSERT_EQ(parsed_size, TotalViewLength(data));
+  auto views = ViewsFromStrings(data);
+  ASSERT_OK(parser.Parse(views, &parsed_size));
+  ASSERT_EQ(parsed_size, TotalViewLength(views));
 }
 
 void AssertParseFinal(BlockParser& parser, const std::string& str) {
@@ -153,10 +162,11 @@ void AssertParseFinal(BlockParser& parser, const std::string& str) {
   ASSERT_EQ(parsed_size, str.size());
 }
 
-void AssertParseFinal(BlockParser& parser, const std::vector<util::string_view>& data) {
+void AssertParseFinal(BlockParser& parser, const std::vector<std::string>& data) {
   uint32_t parsed_size = static_cast<uint32_t>(-1);
-  ASSERT_OK(parser.ParseFinal(data, &parsed_size));
-  ASSERT_EQ(parsed_size, TotalViewLength(data));
+  auto views = ViewsFromStrings(data);
+  ASSERT_OK(parser.ParseFinal(views, &parsed_size));
+  ASSERT_EQ(parsed_size, TotalViewLength(views));
 }
 
 void AssertParsePartial(BlockParser& parser, const std::string& str,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same error with the patch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enclosing elements of initializer list in braces seems to work. I pushed a patch - https://github.com/apache/arrow/pull/5791/files . Please merge if it looks alright. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants