Apply operation, columnToIndex and fastCSV by Kriyszig · Pull Request #5 · Kriyszig/magpie

Kriyszig · 2019-06-24T19:56:00Z

apply: Applies a function to a row/column of DataFrame
columnToIndex: Converts a column of data to a level of row indexes
fastCSV: The old parser was subpar to put it lightly. I saw a post which I couldn't find taking a scenario of a csv 2,000,000 X 5. I wrote a script to generate a mock CSV with the same specification and from_csv couldn't do it. Then I reduced 2,000,000 to 200,000 and from_csv still couldn't parse it fast enough. It took almost 6 minutes to parse 100,000 rows. The time just increases exponentially with from_csv. Hence I researched a bit and got fastCSV working. The benchmarks can be found here

I will replace from_csv with fastCSV after I extend it's functionality to match that of from_csv. Should be finished by the end of the week.

cc/ @thewilsonator - Ready for review 👾
If you find anything out of place, or anything that can be improved, please leave a review and I'll make the necessary changes at the earliest of my convenience

* apply - apply a function to a row/column. Overloaded to apply to the entire DataFrame

* Added a helper template to evaluate RowType when column needs to be dropped * Added unittests to verify correct behavior

* Drop row/column from DataFrame

* Convert a column of data to indexing layer * Added unittests to verify correct behavior * Added unittests for drop on heterogeneous DataFrame

* Faster than current parser

Documentation upadted with the developments of this week

thewilsonator · 2019-06-25T04:51:43Z

+    @params: index - integer or string indexes of rows
+    +/
+    void apply(alias Fn, int axis, T)(T index)
+        if(is(T == int[]) || is(T == string[][]))


You probably want to add a constraint to Fn to make sure that it takes and returns a typeof(data[0][0])

For now I have kept the implementation very forgiving.
I'll probably add a __compile soon to check if Fn(data[i]) is possible.

thewilsonator · 2019-06-25T05:13:10Z

+
+        foreach(i, ele; lines[columnDepth .. $])
+        {
+            if(ele.length > 0)


for conditionals like this you can reduce the indenting by writing

if(ele.length == 0) continue;

Done in dec86d1
I'll eventually scrub the entire code base looking for cases like this 👍

thewilsonator · 2019-06-25T05:19:10Z

It seems you could do with a utility function to compare predicates on the lines of two files, see the unit tests for fastCSV

Kriyszig · 2019-06-25T05:24:39Z

Thanks for the review
Will make the necessary changes soon

Display used to check the length of the complete data. Now max data chacked = 50

* Used the right way to build a hash map * Removed un-necessary rem * using range for file reading * Some documentation fixes

Kriyszig added 7 commits June 24, 2019 14:37

Adding apply to DataFrame

54e1ffe

* apply - apply a function to a row/column. Overloaded to apply to the entire DataFrame

Helper for drop

c89271f

* Added a helper template to evaluate RowType when column needs to be dropped * Added unittests to verify correct behavior

Added drop for DataFrame

f57ebe5

* Drop row/column from DataFrame

Convert a column of data to new level of row index

9605930

* Convert a column of data to indexing layer * Added unittests to verify correct behavior * Added unittests for drop on heterogeneous DataFrame

Experimental: fastCSV()

7574b5e

* Faster than current parser

Added vectorizer in helper

3c301c7

Updated documentation

8ad4425

Documentation upadted with the developments of this week

Kriyszig added the Ready for Review! label Jun 24, 2019