Skip to content

Preliminary Aggregate Ops and optimiaztion for Homogeneous DataFrames #11

Merged
Kriyszig merged 23 commits into
masterfrom
homogeneous
Aug 25, 2019
Merged

Preliminary Aggregate Ops and optimiaztion for Homogeneous DataFrames #11
Kriyszig merged 23 commits into
masterfrom
homogeneous

Conversation

@Kriyszig
Copy link
Copy Markdown
Owner

@Kriyszig Kriyszig commented Aug 2, 2019

This PR adds some preliminary aggregate operations and reducing of traversal overhead when targeting a single column of DataFrame by replacing TypeTuple with an array. (Same thing done for Group as well)
Implemented some preliminary aggregate operations.

cc/ @thewilsonator, ready for review (^_^)

Kriyszig added 10 commits July 29, 2019 20:37
* Homogeneous cases can use arrays to save static traversal overhead
* UnitTests are modified only at places where type assertion of data
was being carried out
* Removed static traversal for homogeneous DataFrame
* No changes made in unit-tests (non-breaking)
* Used isHomogenenousType to reduce static traversal
* Unit tests remain unchanged (Non-Breaking)
* Requested by a used in the forum, a less verbose way to assign index
* Added unittests to check consistency and correct behavior
* Aggregate operation implemented for columns of DataFrame
* Added unittests to check for correct behavior
* Aggregate operation on DataFrame column
* Added preliminary unittest
* Implemented Aggregate operation on Group columns
* Added preliminary unittest to ensure correct behavior
* Aggregate now supports operation along Group row
* Added tests to check for correct behavior
* Updated README documenting the new features
Comment thread source/magpie/group.d Outdated
static if(isHomogeneousType)
auto sortdata = transposed(data);
else
auto ref sortdata = data;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that even work? Try alias sortdata = data;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it on the online ide: https://run.dlang.io/is/d8K9sr
Even alias does the job: https://run.dlang.io/is/X9Xx1O

alias afaik doesn't allocate the data again so it's indeed a a better option

Comment thread source/magpie/dataframe.d Outdated
}

static foreach(j; 0 .. RowType.length)
static if(isHomogeneousType)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an auxiliary function that does something like

void homogenousDispatch(alias func)
{
     static if(isHomogeneousType)
         func(dataIndex);
    else
        static foreach(j; 0 .. RowType.length)
              if(j == dataIndex)
                     func(dataIndex);
}

and the call it like

homogenousDispatch!(
    (j)
    {
          size_t maxsize1 = 0, maxsize2 = 0;
          // ...
    });

Ditto throughout.

If you've got a better name for it please use it instead.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured out dispatcher long at last 😅

Comment thread source/magpie/dataframe.d Outdated
@@ -187,25 +188,47 @@ public:
maxGap = (maxGap > lenCol)? maxGap: lenCol;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread source/magpie/group.d Outdated
static if(kind == Universal)
Slice!(Type*, 1, kind) ret = slice!(Type)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).universal;
else static if(kind == Canonical)
Slice!(Type*, 1, kind) ret = slice!(string)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).canonical;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should either be Slice!(string*, 1, kind) ret = slice!(string) or Slice!(Type*, 1, kind) ret = slice!(Type)`

Comment thread source/magpie/operation.d Outdated
auto aggregate(int axis, T, Ops...)(T df, Ops ops)
if(isDataFrame!T)
{
static if(axis)
Copy link
Copy Markdown

@thewilsonator thewilsonator Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if calling this with axis==0 does nothing, it probably makes no sense and should be disallowed.

auto aggregate(int axis,...)(...) if(isDataFrame!T && axis)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scratch that axis is unused remove it.

Comment thread source/magpie/operation.d Outdated
else static assert(0, "Invalid join type. Available join type: left, right, outer, inner(default)");
}

enum AggregateOP
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly suggest to make this not a restricted set of ops e.g. this misses variance.

auto aggregate(T, Ops...)(T df) and then use with

auto x = df.aggregate!(min, max);

this will simplify the implementation to

static foreach(i;0 ..  Ops.length)
{
    static foreach(j; 0 .. df.RowType.length)
         opres = Ops[i](df.data[j]);
         ret.indx.row.index[0][i] = Ops[i].stringof;
}

note that you still do i passes through the data, whereas you should be able to do all aggregation statistics with one pass. You should look at fold and take some inspiration from there, particularly the example arr.fold!(min, max).

Comment thread source/magpie/operation.d Outdated
{
static if(axis)
{
DataFrame!(float, df.RowType.length) ret;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned with the change of data types here, a data frame of ints shouldn't get converted to a df of floats just because I computed the max of a column or row.

Comment thread source/magpie/operation.d Outdated
foreach(i; 0 .. df.RowType.length)
ret.data[i].length = Ops.length;

double opres;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Comment thread source/magpie/ops.d Outdated

import std.algorithm: map, reduce;

double count(T)(T[] arr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should try to make aggregate work directly with phobos (and mir's if you can) algorithms directly without the need for this shim.

* asSlice had wrongly specified Slice iterator types for Slices other
than Universal
* Functions predefined have been removed
* Aggregate moved inside DataFrame
* Aggregate accepts functions from user
* Added some internal and external templates to evaluate optimized
types
* Used compiles trait to check for correct execution method
* Added some more unittest
* Added aggregate for Group
* Added unittest
* Removed unused parameter in aggregateType template
Comment thread source/magpie/dataframe.d Outdated
{
maxGap = maxsize2;
}
maxsize2 = data[dataIndex][$ - bottom .. $].map!(e => to!string(e).length).reduce!max;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I suggest making T[]..map!(e => to!string(e).length).reduce!max a function and then calling it on the computed ranges.
e.g.

/*use a better name*/
auto transform(T)(T[] arr){ return arr.map!(e => to!string(e).length).reduce!max;}
...
if(bottom > data[dataIndex].length)
    maxsize2 = transform(data[dataIndex]);
else if (bottom > 0) 
    maxsize2 = transform(data[dataIndex][$ - bottom .. $]);

also computing to!string(e) just for the length seems wasteful if you don't use the result. This is totally fine for getting stuff up and running quickly, but you should do some profiling to determine the bottlenecks of this library.

I suspect GC allocation may be one of them, but do the profiling anyway because it i entirely possible I'm wrong and that the optimiser inlines everything and discards the useless allocation (use LDC for that, DMD won't do much to help with that).

* Used fucnnction to reduce code repitition
* Used std.algorithms: max to compute maximum
* Previously, to_csv didn't have an option to set precision
to save floating point. std.con: to rounded floating numbers
larger than 3 decimal places to 3 decimal places.
* two variation of at - one takes i1 at runtime and the other
takes both arguments as parameters
* Dispatcher calls the function supplying params statically
or at run time depending on the type of DataFrame
Comment thread source/magpie/dataframe.d Outdated
return data[i2][i1];
else

auto auxDispatch(alias Fn)(size_t indx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to declare this multiple times just use the one at the top. If these are on different types move this to you utils file and parameterise on the type.

The whole point was to reduce code remember ;)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto throughout.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it leads to this error:

Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch` cannot get frame pointer to magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc

The function seems to accessible only within the function scope and couldn't be referenced outside hence I had to bring dispatcher into each function.
Is there a workaround for this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try making auxDispatch static.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making it static gives this:

Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc` is a nested function and cannot be accessed from magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch

telling nested function cannot be accessed outside.
I fiddled around this for a while however everything led to the same error

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe try asking on the learn forums. There definitely should be a way to do it.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make a sample example for dlang learn and this worked somehow
https://run.dlang.io/is/EvwVVq
So I was wondering what is the difference between this example and the one I was working with and I noticed if the nested function is a template function, it cannot send a pointer outside.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed it can if you instantiate it at the call site, i.e. this works (but is not what you want):

-void nested(int a)
+void nested()(int a)
...
-dispatcher!(nested)(5);
+dispatcher!(nested!())(5);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct S
{
    int data;
    int pos = 5;
private:  
    mixin template auxdispatch(alias F/*, alias indx*/)
    {
        auto auxdispatch(int x /*= indx*/)
        {
            return F(x);
        }
    }
 public:   
    void outer()
    {
        void nested()(int a)
        {
            data += a;
        }
        
        mixin auxdispatch!(nested/*,pos*/);
		auxdispatch(pos);
        import std.stdio;
        writeln(data);
    }
}

void main()
{
    S s;
    s.outer();
}

will do the trick, the commented out sections are what I would have liked to have done but a DMD bug prevents it from compiling.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the above example @thewilsonator
Added mixin template in 403aa24

Comment thread source/magpie/dataframe.d
}
}

void assignAux(ptrdiff_t si = -1)(size_t ri = 0) @property
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. there is an identical one just above.

* Dispather is converted to mixin template to work around nested function pointer
* Added dispatcher to Groups
Comment thread source/magpie/dataframe.d
thisGap = tmp;
}
thisGap = max(thisGap, tmp);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have this same block of code repeated three times with different different indices, you should make that a (nested) function.

Comment thread source/magpie/dataframe.d
static if(si > -1)
alias i = si;
else
size_t i = ri;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably refactor this into a mixin (template) too.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing this would make the intent of the function much clearer.

Comment thread source/magpie/dataframe.d
+/
auto aggregate(int axis, Ops...)() @property
{
static if(axis)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of similar code in the two branches, try to push this static if deeper into the function.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular do

static if (axis)
{
    import std.meta: staticMap;
    alias Resolve(T) = suitableType!(resolverInternal!(T, Ops));
    alais DFTypes = staticMap!(Resolve, RowType);
}
else 
     alais DFTypes =aggregateType!(Ops)

DataFrame!(true, DFTypes) ret;
init!(axis)(ret);
...

Comment thread source/magpie/dataframe.d Outdated
assert(approxEqual(doubledf.data[1][0], 5.6, 1e-8) && approxEqual(doubledf.data[1][1], 8, 1e-8));
}

private auto customFunc(T)(T[] arr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this inside the unittest that uses it and make it static

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e30f57a 👍

Comment thread source/magpie/group.d

auto aggregate(int axis, Ops...)() @property
{
static if(axis)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, there is a lot of similar code and it makes it difficult to see what is different.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put all the initialisation into nested function :octocat:

Comment thread source/magpie/group.d Outdated
assert(approxEqual(dub.data[1][0], 2, 1e-8) && approxEqual(dub.data[1][1], 3, 1e-8) && approxEqual(dub.data[1][2], 5, 1e-8) && approxEqual(dub.data[1][3], 6, 1e-8));
}

private auto customFunc(T)(T[] arr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, move this inside the unittest.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in e30f57a 👍

* Nested function to initialize return values
* Made unittest related functions static
Comment thread source/magpie/dataframe.d
auto gapCalc(T)(T[] arr)
{
static if(is(T == string))
return arr.map!(e => e.length).reduce!max;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static if branch is not needed as to!string on a string is a no-op. just use the line below.

Comment thread source/magpie/group.d Outdated
}

ret.grpIndex.generateCodes();
return ret;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are duplicate in each branch.

Comment thread source/magpie/dataframe.d Outdated
}

ret.indx.generateCodes();
return ret;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread source/magpie/group.d
alias Resolve(T) = suitableType!(resolverInternal!(T, Ops));
alias ResolvedTypes = staticMap!(Resolve, GrpRowType);
Group!(ResolvedTypes) ret;
init!(axis)(ret);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment thread source/magpie/group.d
static if(!isHomogeneous!(GrpRowType))
alias GrpType = staticMap!(toArr, GrpRowType);
else
alias GrpType = toArr!(GrpRowType[0])[GrpRowType.length];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is duplicate with dataframe, move it to utils, and do

alias GrpType = homogeneousTypesFor!(GrpRowType);

if you need isHomogeneousType use a mixin template.

Comment thread source/magpie/group.d Outdated

assert(0);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this code is duplicate too.

Comment thread source/magpie/group.d Outdated
mixin auxDispatch!(mixinAux);
auxDispatch(pos[1]);
}
else static if(T.length == 1 && isArray!(T[0]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing different between this branch and the one above is

-mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]];");
+mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]].get!(GrpRowType[typepos]);");

you should merge them.

Comment thread source/magpie/dataframe.d
}

mixin auxDispatch!(assignAux);
auxDispatch(pos);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line count increase for doing this here is quite large and the entire point of it was to increase readability which I don't think this does.

@thewilsonator
Copy link
Copy Markdown

One thing I want you to think about is the amount of similarity and differences between Group and DataFrame and if the differences actually need to be different at all.

For example they can both be of homogeneous types (or not) and they both contain a field that depend on that property and they both have an index. But, for the data frame the alias for the (in)homogeneous type is FrameType in data frame and GrpType (the field is both called data which is good) and the index is called indx for data frames and grpIndex.

This means if all I care about the thing I have as a template is typeof(data) and the index, I must treat them differently.

This affect you too. The implementation of aggregate for Groups & DataFrames is almost iIdentical, almost, but not quite. Now some of this is by necessity, they are different data types after all, but having to implement it twice is not a desirable property of the code.

In D this is usually done as a free function that then gets called via Uniform Function Call Syntax, that is to say, one writes

struct DataFrame(...)
{
    //...
}

struct Group(...)
{
    //...
}

auto aggregate(T, int axis, Ops...)(ref T t) // T is a dataframe or a Group
{
    //...
}

//User code
void func()
{
    Dataframe!(whatever) df;
    Group!(whatever) grp;
    auto aggdf = df.aggregate!(0,max);
    auto agggrp = grp.aggregate!(0,max);
}

I want you to do that, and then for every difference between DataFrame and Group in the implementation of aggregate, ask yourself "does this difference need to exist?".

This separation of behaviour from data is very widely used in D, it is for example the basis of the range: a data structure provides the implementation of how to iterate over it, and then all the algorithms on phobos can be used with that data structure (providing the constraints are met).

(Sometimes it is necessary to specialise an algorithm for a data structure because of performance but that is rare, and you should do profiling to make sure that it is worth doing)

* Moved same code out of if statements
* Moved auxDispatch to helpers
* Squashed similar if-else branch
@Kriyszig
Copy link
Copy Markdown
Owner Author

I was on the track of doing exactly that: an aggregate that takes either Group or DataFrame but with the custom function changes, I ran into this trouble:

https://run.dlang.io/is/88rXa3

At that point I had implemented aggregate as an overload so I moved each of them into the respective structs and optimized them individually. Some parts are indeed very similar - If the above mentioned problem can be circumvented, I can move the aggregate into a single function by this weekend

* Alias input for aggregate
* Updated snippets
@Kriyszig Kriyszig merged commit 663cd5b into master Aug 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants