Preliminary Aggregate Ops and optimiaztion for Homogeneous DataFrames by Kriyszig · Pull Request #11 · Kriyszig/magpie

Kriyszig · 2019-08-02T18:09:26Z

This PR adds some preliminary aggregate operations and reducing of traversal overhead when targeting a single column of DataFrame by replacing TypeTuple with an array. (Same thing done for Group as well)
Implemented some preliminary aggregate operations.

cc/ @thewilsonator, ready for review (^_^)

* Homogeneous cases can use arrays to save static traversal overhead * UnitTests are modified only at places where type assertion of data was being carried out

* Removed static traversal for homogeneous DataFrame * No changes made in unit-tests (non-breaking)

* Used isHomogenenousType to reduce static traversal * Unit tests remain unchanged (Non-Breaking)

* Requested by a used in the forum, a less verbose way to assign index * Added unittests to check consistency and correct behavior

* Aggregate operation implemented for columns of DataFrame * Added unittests to check for correct behavior

* Aggregate operation on DataFrame column * Added preliminary unittest

* Implemented Aggregate operation on Group columns * Added preliminary unittest to ensure correct behavior

* Aggregate now supports operation along Group row * Added tests to check for correct behavior

* Updated README documenting the new features

thewilsonator · 2019-08-05T01:05:08Z

+        static if(isHomogeneousType)
+            auto sortdata = transposed(data);
+        else
+            auto ref sortdata = data;


does that even work? Try alias sortdata = data;

I tested it on the online ide: https://run.dlang.io/is/d8K9sr
Even alias does the job: https://run.dlang.io/is/X9Xx1O

alias afaik doesn't allocate the data again so it's indeed a a better option

thewilsonator · 2019-08-05T01:16:29Z

                }

-                static foreach(j; 0 .. RowType.length)
+                static if(isHomogeneousType)


create an auxiliary function that does something like

void homogenousDispatch(alias func) { static if(isHomogeneousType) func(dataIndex); else static foreach(j; 0 .. RowType.length) if(j == dataIndex) func(dataIndex); }

and the call it like

homogenousDispatch!( (j) { size_t maxsize1 = 0, maxsize2 = 0; // ... });

Ditto throughout.

If you've got a better name for it please use it instead.

Figured out dispatcher long at last 😅

thewilsonator · 2019-08-05T01:17:53Z

@@ -187,25 +188,47 @@ public:
                    maxGap = (maxGap > lenCol)? maxGap: lenCol;


https://dlang.org/phobos/std_algorithm_comparison.html#.max

Ditto throughout.

thewilsonator · 2019-08-05T01:20:47Z

+            static if(kind == Universal)
+                Slice!(Type*, 1, kind) ret = slice!(Type)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).universal;
+            else static if(kind == Canonical)
+                Slice!(Type*, 1, kind) ret = slice!(string)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).canonical;


this should either be Slice!(string*, 1, kind) ret = slice!(string) or Slice!(Type*, 1, kind) ret = slice!(Type)`

thewilsonator · 2019-08-05T01:25:23Z

+auto aggregate(int axis, T, Ops...)(T df, Ops ops)
+    if(isDataFrame!T)
+{
+    static if(axis)


if calling this with axis==0 does nothing, it probably makes no sense and should be disallowed.

auto aggregate(int axis,...)(...) if(isDataFrame!T && axis)

scratch that axis is unused remove it.

thewilsonator · 2019-08-05T01:42:57Z

    else static assert(0, "Invalid join type. Available join type: left, right, outer, inner(default)");
 }

+enum AggregateOP


I would strongly suggest to make this not a restricted set of ops e.g. this misses variance.

auto aggregate(T, Ops...)(T df) and then use with

auto x = df.aggregate!(min, max);

this will simplify the implementation to

static foreach(i;0 .. Ops.length) { static foreach(j; 0 .. df.RowType.length) opres = Ops[i](df.data[j]); ret.indx.row.index[0][i] = Ops[i].stringof; }

note that you still do i passes through the data, whereas you should be able to do all aggregation statistics with one pass. You should look at fold and take some inspiration from there, particularly the example arr.fold!(min, max).

thewilsonator · 2019-08-05T01:44:42Z

+{
+    static if(axis)
+    {
+        DataFrame!(float, df.RowType.length) ret;


I'm slightly concerned with the change of data types here, a data frame of ints shouldn't get converted to a df of floats just because I computed the max of a column or row.

thewilsonator · 2019-08-05T01:44:52Z

+        foreach(i; 0 .. df.RowType.length)
+            ret.data[i].length = Ops.length;
+
+        double opres;


thewilsonator · 2019-08-05T01:46:14Z

+
+import std.algorithm: map, reduce;
+
+double count(T)(T[] arr)


you should try to make aggregate work directly with phobos (and mir's if you can) algorithms directly without the need for this shim.

* asSlice had wrongly specified Slice iterator types for Slices other than Universal

* Functions predefined have been removed * Aggregate moved inside DataFrame * Aggregate accepts functions from user

* Added some internal and external templates to evaluate optimized types * Used compiles trait to check for correct execution method * Added some more unittest

* Added aggregate for Group * Added unittest * Removed unused parameter in aggregateType template

thewilsonator · 2019-08-07T04:14:38Z

-                    {
-                        maxGap = maxsize2;
-                    }
+                        maxsize2 = data[dataIndex][$ - bottom .. $].map!(e => to!string(e).length).reduce!max;   


can I suggest making T[]..map!(e => to!string(e).length).reduce!max a function and then calling it on the computed ranges.
e.g.

/*use a better name*/ auto transform(T)(T[] arr){ return arr.map!(e => to!string(e).length).reduce!max;} ... if(bottom > data[dataIndex].length) maxsize2 = transform(data[dataIndex]); else if (bottom > 0) maxsize2 = transform(data[dataIndex][$ - bottom .. $]);

also computing to!string(e) just for the length seems wasteful if you don't use the result. This is totally fine for getting stuff up and running quickly, but you should do some profiling to determine the bottlenecks of this library.

I suspect GC allocation may be one of them, but do the profiling anyway because it i entirely possible I'm wrong and that the optimiser inlines everything and discards the useless allocation (use LDC for that, DMD won't do much to help with that).

* Used fucnnction to reduce code repitition * Used std.algorithms: max to compute maximum

* Previously, to_csv didn't have an option to set precision to save floating point. std.con: to rounded floating numbers larger than 3 decimal places to 3 decimal places.

* two variation of at - one takes i1 at runtime and the other takes both arguments as parameters

* Dispatcher calls the function supplying params statically or at run time depending on the type of DataFrame

thewilsonator · 2019-08-09T00:20:34Z

-            return data[i2][i1];
-        else
+
+        auto auxDispatch(alias Fn)(size_t indx)


no need to declare this multiple times just use the one at the top. If these are on different types move this to you utils file and parameterise on the type.

The whole point was to reduce code remember ;)

ditto throughout.

I tried this but it leads to this error:

Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch` cannot get frame pointer to magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc

The function seems to accessible only within the function scope and couldn't be referenced outside hence I had to bring dispatcher into each function.
Is there a workaround for this?

Try making auxDispatch static.

making it static gives this:

Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc` is a nested function and cannot be accessed from magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch

telling nested function cannot be accessed outside.
I fiddled around this for a while however everything led to the same error

Hmm, maybe try asking on the learn forums. There definitely should be a way to do it.

I was trying to make a sample example for dlang learn and this worked somehow
https://run.dlang.io/is/EvwVVq
So I was wondering what is the difference between this example and the one I was working with and I noticed if the nested function is a template function, it cannot send a pointer outside.

I noticed it can if you instantiate it at the call site, i.e. this works (but is not what you want):

-void nested(int a) +void nested()(int a) ... -dispatcher!(nested)(5); +dispatcher!(nested!())(5);

struct S { int data; int pos = 5; private: mixin template auxdispatch(alias F/*, alias indx*/) { auto auxdispatch(int x /*= indx*/) { return F(x); } } public: void outer() { void nested()(int a) { data += a; } mixin auxdispatch!(nested/*,pos*/); auxdispatch(pos); import std.stdio; writeln(data); } } void main() { S s; s.outer(); }

will do the trick, the commented out sections are what I would have liked to have done but a DMD bug prevents it from compiling.

Thanks a lot for the above example @thewilsonator
Added mixin template in 403aa24

thewilsonator · 2019-08-09T00:22:41Z

-                }
-            }
+
+        void assignAux(ptrdiff_t si = -1)(size_t ri = 0) @property


ditto. there is an identical one just above.

* Dispather is converted to mixin template to work around nested function pointer * Added dispatcher to Groups

thewilsonator · 2019-08-13T00:47:39Z

-                        thisGap = tmp;
-                    }
+                    thisGap = max(thisGap, tmp);
                }


You have this same block of code repeated three times with different different indices, you should make that a (nested) function.

thewilsonator · 2019-08-13T00:51:00Z

+            static if(si > -1)
+                alias i = si;
+            else
+                size_t i = ri;


you could probably refactor this into a mixin (template) too.

doing this would make the intent of the function much clearer.

thewilsonator · 2019-08-13T00:57:20Z

+    +/
+    auto aggregate(int axis, Ops...)() @property
+    {
+        static if(axis)


There is a lot of similar code in the two branches, try to push this static if deeper into the function.

in particular do

static if (axis) { import std.meta: staticMap; alias Resolve(T) = suitableType!(resolverInternal!(T, Ops)); alais DFTypes = staticMap!(Resolve, RowType); } else alais DFTypes =aggregateType!(Ops) DataFrame!(true, DFTypes) ret; init!(axis)(ret); ...

thewilsonator · 2019-08-13T00:58:57Z

+    assert(approxEqual(doubledf.data[1][0], 5.6, 1e-8) && approxEqual(doubledf.data[1][1], 8, 1e-8));
+}
+
+private auto customFunc(T)(T[] arr)


move this inside the unittest that uses it and make it static

Done in e30f57a 👍

thewilsonator · 2019-08-13T01:03:06Z

+
+    auto aggregate(int axis, Ops...)() @property
+    {
+        static if(axis)


ditto, there is a lot of similar code and it makes it difficult to see what is different.

I put all the initialisation into nested function

thewilsonator · 2019-08-13T01:03:44Z

+    assert(approxEqual(dub.data[1][0], 2, 1e-8) && approxEqual(dub.data[1][1], 3, 1e-8) && approxEqual(dub.data[1][2], 5, 1e-8) && approxEqual(dub.data[1][3], 6, 1e-8));
+}
+
+private auto customFunc(T)(T[] arr)


ditto, move this inside the unittest.

Addressed in e30f57a 👍

* Nested function to initialize return values * Made unittest related functions static

thewilsonator · 2019-08-15T01:11:42Z

+        auto gapCalc(T)(T[] arr)
+        {
+            static if(is(T == string))
+                return arr.map!(e => e.length).reduce!max;


This static if branch is not needed as to!string on a string is a no-op. just use the line below.

thewilsonator · 2019-08-15T01:23:11Z

+                        }
+
+            ret.grpIndex.generateCodes();
+            return ret;


These two lines are duplicate in each branch.

thewilsonator · 2019-08-15T01:28:12Z

+            }
+
+            ret.indx.generateCodes();
+            return ret;


thewilsonator · 2019-08-15T01:33:13Z

+            alias Resolve(T) = suitableType!(resolverInternal!(T, Ops));
+            alias ResolvedTypes = staticMap!(Resolve, GrpRowType);
+            Group!(ResolvedTypes) ret;
+            init!(axis)(ret);


thewilsonator · 2019-08-15T01:41:12Z

+    static if(!isHomogeneous!(GrpRowType))
+        alias GrpType = staticMap!(toArr, GrpRowType);
+    else
+        alias GrpType = toArr!(GrpRowType[0])[GrpRowType.length];


this code is duplicate with dataframe, move it to utils, and do

alias GrpType = homogeneousTypesFor!(GrpRowType);

if you need isHomogeneousType use a mixin template.

thewilsonator · 2019-08-15T01:42:23Z

+
+            assert(0);
+        }
+    }


all of this code is duplicate too.

thewilsonator · 2019-08-15T01:47:51Z

+            mixin auxDispatch!(mixinAux);
+            auxDispatch(pos[1]);
        }
        else static if(T.length == 1 && isArray!(T[0]))


The only thing different between this branch and the one above is

-mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]];"); +mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]].get!(GrpRowType[typepos]);");

you should merge them.

thewilsonator · 2019-08-15T01:50:24Z

            }
+
+            mixin auxDispatch!(assignAux);
+            auxDispatch(pos);


the line count increase for doing this here is quite large and the entire point of it was to increase readability which I don't think this does.

thewilsonator · 2019-08-15T02:31:41Z

One thing I want you to think about is the amount of similarity and differences between Group and DataFrame and if the differences actually need to be different at all.

For example they can both be of homogeneous types (or not) and they both contain a field that depend on that property and they both have an index. But, for the data frame the alias for the (in)homogeneous type is FrameType in data frame and GrpType (the field is both called data which is good) and the index is called indx for data frames and grpIndex.

This means if all I care about the thing I have as a template is typeof(data) and the index, I must treat them differently.

This affect you too. The implementation of aggregate for Groups & DataFrames is almost iIdentical, almost, but not quite. Now some of this is by necessity, they are different data types after all, but having to implement it twice is not a desirable property of the code.

In D this is usually done as a free function that then gets called via Uniform Function Call Syntax, that is to say, one writes

struct DataFrame(...)
{
    //...
}

struct Group(...)
{
    //...
}

auto aggregate(T, int axis, Ops...)(ref T t) // T is a dataframe or a Group
{
    //...
}

//User code
void func()
{
    Dataframe!(whatever) df;
    Group!(whatever) grp;
    auto aggdf = df.aggregate!(0,max);
    auto agggrp = grp.aggregate!(0,max);
}

I want you to do that, and then for every difference between DataFrame and Group in the implementation of aggregate, ask yourself "does this difference need to exist?".

This separation of behaviour from data is very widely used in D, it is for example the basis of the range: a data structure provides the implementation of how to iterate over it, and then all the algorithms on phobos can be used with that data structure (providing the constraints are met).

(Sometimes it is necessary to specialise an algorithm for a data structure because of performance but that is rare, and you should do profiling to make sure that it is worth doing)

* Moved same code out of if statements * Moved auxDispatch to helpers * Squashed similar if-else branch

Kriyszig · 2019-08-15T17:21:22Z

I was on the track of doing exactly that: an aggregate that takes either Group or DataFrame but with the custom function changes, I ran into this trouble:

https://run.dlang.io/is/88rXa3

At that point I had implemented aggregate as an overload so I moved each of them into the respective structs and optimized them individually. Some parts are indeed very similar - If the above mentioned problem can be circumvented, I can move the aggregate into a single function by this weekend

* Alias input for aggregate * Updated snippets

Kriyszig added 10 commits July 29, 2019 20:37

Converted DataFrmes to use arrays instead of TypeTuples for data

0640344

Group uses an array instead of a TypeTuple in Homogeneous cases

6ea39d2

* Homogeneous cases can use arrays to save static traversal overhead * UnitTests are modified only at places where type assertion of data was being carried out

Reduced traversal for homogeneous DataFrame

2e27351

* Removed static traversal for homogeneous DataFrame * No changes made in unit-tests (non-breaking)

Reduced foreach travesal in Group

90c334f

* Used isHomogenenousType to reduce static traversal * Unit tests remain unchanged (Non-Breaking)

Assign index using array like method

43c35b1

* Requested by a used in the forum, a less verbose way to assign index * Added unittests to check consistency and correct behavior

Aggregate Operation on DataFrame collumns

5276c4b

* Aggregate operation implemented for columns of DataFrame * Added unittests to check for correct behavior

Aggregate operation on DataFrame Column

3dd2729

* Aggregate operation on DataFrame column * Added preliminary unittest

Aggregate operation on Group columns

9277611

* Implemented Aggregate operation on Group columns * Added preliminary unittest to ensure correct behavior

Aggregate operation on Group rows

187f466

* Aggregate now supports operation along Group row * Added tests to check for correct behavior

Documnetation: Added documentation of Aggregate and Index assignment

b49b59d

* Updated README documenting the new features

Kriyszig added Ready for Review! Aggregate Optimization labels Aug 2, 2019

thewilsonator reviewed Aug 5, 2019

View reviewed changes

Kriyszig added 4 commits August 5, 2019 14:26

Buugfix: Fixed wrongly specified Slice tpe in asSlice

26e04d3

* asSlice had wrongly specified Slice iterator types for Slices other than Universal

Refactor: Aggregate now accepts functions from user

5781070

* Functions predefined have been removed * Aggregate moved inside DataFrame * Aggregate accepts functions from user

Fixes to allow custom functions in aggregate

b93db09

* Added some internal and external templates to evaluate optimized types * Used compiles trait to check for correct execution method * Added some more unittest

Aggregate operation refactor for Group

6a74274

* Added aggregate for Group * Added unittest * Removed unused parameter in aggregateType template

thewilsonator reviewed Aug 7, 2019

View reviewed changes

Kriyszig added 3 commits August 7, 2019 23:12

Use auxillary function to compute gaps for display

6349057

* Used fucnnction to reduce code repitition * Used std.algorithms: max to compute maximum

Added precision setting for to_csv

d7cfa6f

* Previously, to_csv didn't have an option to set precision to save floating point. std.con: to rounded floating numbers larger than 3 decimal places to 3 decimal places.

Replace auto with alias to represent data for sorting in Group

744818a

Kriyszig added 2 commits August 8, 2019 22:38

at overload for runtime access

2e3807c

* two variation of at - one takes i1 at runtime and the other takes both arguments as parameters

Added dispatcher to reduce code redundency

10013d9

* Dispatcher calls the function supplying params statically or at run time depending on the type of DataFrame

thewilsonator reviewed Aug 9, 2019

View reviewed changes

Convert dispatcher to mixin template

403aa24

* Dispather is converted to mixin template to work around nested function pointer * Added dispatcher to Groups

thewilsonator reviewed Aug 13, 2019

View reviewed changes

Used auxillary function to initialize aggregate return values

e30f57a

* Nested function to initialize return values * Made unittest related functions static

thewilsonator reviewed Aug 15, 2019

View reviewed changes

Comment thread source/magpie/dataframe.d Outdated

}

ret.indx.generateCodes();

return ret;

Copy link
Copy Markdown

thewilsonator Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

thewilsonator reviewed Aug 15, 2019

View reviewed changes

Comment thread source/magpie/group.d Outdated

assert(0);

}

}

Copy link
Copy Markdown

thewilsonator Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this code is duplicate too.

thewilsonator reviewed Aug 15, 2019

View reviewed changes

Reduce code repitition

e3e063c

* Moved same code out of if statements * Moved auxDispatch to helpers * Squashed similar if-else branch

Kriyszig mentioned this pull request Aug 19, 2019

Filter Operation on DataFrame #12

Merged

Kriyszig added the 48hrs no objection -> merge label Aug 19, 2019

Update documentation after specs change

45a4969

* Alias input for aggregate * Updated snippets

Kriyszig merged commit 663cd5b into master Aug 25, 2019

		@@ -187,25 +188,47 @@ public:
		maxGap = (maxGap > lenCol)? maxGap: lenCol;

Conversation

Kriyszig commented Aug 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thewilsonator Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

thewilsonator Aug 5, 2019 •

edited

Loading