Skip to content

Apply operation, columnToIndex and fastCSV#5

Merged
Kriyszig merged 9 commits into
masterfrom
apply
Jul 5, 2019
Merged

Apply operation, columnToIndex and fastCSV#5
Kriyszig merged 9 commits into
masterfrom
apply

Conversation

@Kriyszig
Copy link
Copy Markdown
Owner

@Kriyszig Kriyszig commented Jun 24, 2019

  • apply: Applies a function to a row/column of DataFrame
  • columnToIndex: Converts a column of data to a level of row indexes
  • fastCSV: The old parser was subpar to put it lightly. I saw a post which I couldn't find taking a scenario of a csv 2,000,000 X 5. I wrote a script to generate a mock CSV with the same specification and from_csv couldn't do it. Then I reduced 2,000,000 to 200,000 and from_csv still couldn't parse it fast enough. It took almost 6 minutes to parse 100,000 rows. The time just increases exponentially with from_csv. Hence I researched a bit and got fastCSV working. The benchmarks can be found here

I will replace from_csv with fastCSV after I extend it's functionality to match that of from_csv. Should be finished by the end of the week.

cc/ @thewilsonator - Ready for review 👾
If you find anything out of place, or anything that can be improved, please leave a review and I'll make the necessary changes at the earliest of my convenience :octocat:

Kriyszig added 7 commits June 24, 2019 14:37
* apply - apply a function to a row/column. Overloaded to apply to the entire DataFrame
* Added a helper template to evaluate RowType when column needs to be dropped
* Added unittests to verify correct behavior
* Drop row/column from DataFrame
* Convert a column of data to indexing layer
* Added unittests to verify correct behavior
* Added unittests for drop on heterogeneous DataFrame
* Faster than current parser
Documentation upadted with the developments of this week
Comment thread source/magpie/dataframe.d Outdated
Comment thread source/magpie/dataframe.d
@params: index - integer or string indexes of rows
+/
void apply(alias Fn, int axis, T)(T index)
if(is(T == int[]) || is(T == string[][]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to add a constraint to Fn to make sure that it takes and returns a typeof(data[0][0])

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I have kept the implementation very forgiving.
I'll probably add a __compile soon to check if Fn(data[i]) is possible.

Comment thread source/magpie/helper.d Outdated
Comment thread source/magpie/helper.d Outdated
Comment thread source/magpie/helper.d Outdated
Comment thread source/magpie/dataframe.d Outdated
Comment thread source/magpie/dataframe.d Outdated

foreach(i, ele; lines[columnDepth .. $])
{
if(ele.length > 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for conditionals like this you can reduce the indenting by writing

if(ele.length == 0)
    continue;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dec86d1
I'll eventually scrub the entire code base looking for cases like this 👍

Comment thread source/magpie/dataframe.d Outdated
@thewilsonator
Copy link
Copy Markdown

It seems you could do with a utility function to compare predicates on the lines of two files, see the unit tests for fastCSV

@Kriyszig
Copy link
Copy Markdown
Owner Author

Thanks for the review :octocat:
Will make the necessary changes soon

Comment thread source/magpie/helper.d Outdated
Kriyszig added 2 commits June 25, 2019 13:28
Display used to check the length of the complete data. Now max data chacked = 50
* Used the right way to build a hash map
* Removed un-necessary rem
* using range for file reading
* Some documentation fixes
@Kriyszig Kriyszig merged commit bb0ec6f into master Jul 5, 2019
@Kriyszig Kriyszig deleted the apply branch July 10, 2019 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants