Preliminary Aggregate Ops and optimiaztion for Homogeneous DataFrames #11
Conversation
* Homogeneous cases can use arrays to save static traversal overhead * UnitTests are modified only at places where type assertion of data was being carried out
* Removed static traversal for homogeneous DataFrame * No changes made in unit-tests (non-breaking)
* Used isHomogenenousType to reduce static traversal * Unit tests remain unchanged (Non-Breaking)
* Requested by a used in the forum, a less verbose way to assign index * Added unittests to check consistency and correct behavior
* Aggregate operation implemented for columns of DataFrame * Added unittests to check for correct behavior
* Aggregate operation on DataFrame column * Added preliminary unittest
* Implemented Aggregate operation on Group columns * Added preliminary unittest to ensure correct behavior
* Aggregate now supports operation along Group row * Added tests to check for correct behavior
* Updated README documenting the new features
| static if(isHomogeneousType) | ||
| auto sortdata = transposed(data); | ||
| else | ||
| auto ref sortdata = data; |
There was a problem hiding this comment.
does that even work? Try alias sortdata = data;
There was a problem hiding this comment.
I tested it on the online ide: https://run.dlang.io/is/d8K9sr
Even alias does the job: https://run.dlang.io/is/X9Xx1O
alias afaik doesn't allocate the data again so it's indeed a a better option
| } | ||
|
|
||
| static foreach(j; 0 .. RowType.length) | ||
| static if(isHomogeneousType) |
There was a problem hiding this comment.
create an auxiliary function that does something like
void homogenousDispatch(alias func)
{
static if(isHomogeneousType)
func(dataIndex);
else
static foreach(j; 0 .. RowType.length)
if(j == dataIndex)
func(dataIndex);
}
and the call it like
homogenousDispatch!(
(j)
{
size_t maxsize1 = 0, maxsize2 = 0;
// ...
});
Ditto throughout.
If you've got a better name for it please use it instead.
There was a problem hiding this comment.
Figured out dispatcher long at last 😅
| @@ -187,25 +188,47 @@ public: | |||
| maxGap = (maxGap > lenCol)? maxGap: lenCol; | |||
There was a problem hiding this comment.
https://dlang.org/phobos/std_algorithm_comparison.html#.max
Ditto throughout.
| static if(kind == Universal) | ||
| Slice!(Type*, 1, kind) ret = slice!(Type)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).universal; | ||
| else static if(kind == Canonical) | ||
| Slice!(Type*, 1, kind) ret = slice!(string)(elementCountTill[pos[0] + 1] - elementCountTill[pos[0]]).canonical; |
There was a problem hiding this comment.
this should either be Slice!(string*, 1, kind) ret = slice!(string) or Slice!(Type*, 1, kind) ret = slice!(Type)`
| auto aggregate(int axis, T, Ops...)(T df, Ops ops) | ||
| if(isDataFrame!T) | ||
| { | ||
| static if(axis) |
There was a problem hiding this comment.
if calling this with axis==0 does nothing, it probably makes no sense and should be disallowed.
auto aggregate(int axis,...)(...) if(isDataFrame!T && axis)
There was a problem hiding this comment.
scratch that axis is unused remove it.
| else static assert(0, "Invalid join type. Available join type: left, right, outer, inner(default)"); | ||
| } | ||
|
|
||
| enum AggregateOP |
There was a problem hiding this comment.
I would strongly suggest to make this not a restricted set of ops e.g. this misses variance.
auto aggregate(T, Ops...)(T df) and then use with
auto x = df.aggregate!(min, max);
this will simplify the implementation to
static foreach(i;0 .. Ops.length)
{
static foreach(j; 0 .. df.RowType.length)
opres = Ops[i](df.data[j]);
ret.indx.row.index[0][i] = Ops[i].stringof;
}
note that you still do i passes through the data, whereas you should be able to do all aggregation statistics with one pass. You should look at fold and take some inspiration from there, particularly the example arr.fold!(min, max).
| { | ||
| static if(axis) | ||
| { | ||
| DataFrame!(float, df.RowType.length) ret; |
There was a problem hiding this comment.
I'm slightly concerned with the change of data types here, a data frame of ints shouldn't get converted to a df of floats just because I computed the max of a column or row.
| foreach(i; 0 .. df.RowType.length) | ||
| ret.data[i].length = Ops.length; | ||
|
|
||
| double opres; |
|
|
||
| import std.algorithm: map, reduce; | ||
|
|
||
| double count(T)(T[] arr) |
There was a problem hiding this comment.
you should try to make aggregate work directly with phobos (and mir's if you can) algorithms directly without the need for this shim.
* asSlice had wrongly specified Slice iterator types for Slices other than Universal
* Functions predefined have been removed * Aggregate moved inside DataFrame * Aggregate accepts functions from user
* Added some internal and external templates to evaluate optimized types * Used compiles trait to check for correct execution method * Added some more unittest
* Added aggregate for Group * Added unittest * Removed unused parameter in aggregateType template
| { | ||
| maxGap = maxsize2; | ||
| } | ||
| maxsize2 = data[dataIndex][$ - bottom .. $].map!(e => to!string(e).length).reduce!max; |
There was a problem hiding this comment.
can I suggest making T[]..map!(e => to!string(e).length).reduce!max a function and then calling it on the computed ranges.
e.g.
/*use a better name*/
auto transform(T)(T[] arr){ return arr.map!(e => to!string(e).length).reduce!max;}
...
if(bottom > data[dataIndex].length)
maxsize2 = transform(data[dataIndex]);
else if (bottom > 0)
maxsize2 = transform(data[dataIndex][$ - bottom .. $]);
also computing to!string(e) just for the length seems wasteful if you don't use the result. This is totally fine for getting stuff up and running quickly, but you should do some profiling to determine the bottlenecks of this library.
I suspect GC allocation may be one of them, but do the profiling anyway because it i entirely possible I'm wrong and that the optimiser inlines everything and discards the useless allocation (use LDC for that, DMD won't do much to help with that).
* Used fucnnction to reduce code repitition * Used std.algorithms: max to compute maximum
* Previously, to_csv didn't have an option to set precision to save floating point. std.con: to rounded floating numbers larger than 3 decimal places to 3 decimal places.
* two variation of at - one takes i1 at runtime and the other takes both arguments as parameters
* Dispatcher calls the function supplying params statically or at run time depending on the type of DataFrame
| return data[i2][i1]; | ||
| else | ||
|
|
||
| auto auxDispatch(alias Fn)(size_t indx) |
There was a problem hiding this comment.
no need to declare this multiple times just use the one at the top. If these are on different types move this to you utils file and parameterise on the type.
The whole point was to reduce code remember ;)
There was a problem hiding this comment.
I tried this but it leads to this error:
Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch` cannot get frame pointer to magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc
The function seems to accessible only within the function scope and couldn't be referenced outside hence I had to bring dispatcher into each function.
Is there a workaround for this?
There was a problem hiding this comment.
making it static gives this:
Error: function `magpie.dataframe.DataFrame!(int, 2).DataFrame.display.maxGapCalc!-1L.maxGapCalc` is a nested function and cannot be accessed from magpie.dataframe.DataFrame!(int, 2).DataFrame.auxDispatch!(maxGapCalc).auxDispatch
telling nested function cannot be accessed outside.
I fiddled around this for a while however everything led to the same error
There was a problem hiding this comment.
Hmm, maybe try asking on the learn forums. There definitely should be a way to do it.
There was a problem hiding this comment.
I was trying to make a sample example for dlang learn and this worked somehow
https://run.dlang.io/is/EvwVVq
So I was wondering what is the difference between this example and the one I was working with and I noticed if the nested function is a template function, it cannot send a pointer outside.
There was a problem hiding this comment.
I noticed it can if you instantiate it at the call site, i.e. this works (but is not what you want):
-void nested(int a)
+void nested()(int a)
...
-dispatcher!(nested)(5);
+dispatcher!(nested!())(5);There was a problem hiding this comment.
struct S
{
int data;
int pos = 5;
private:
mixin template auxdispatch(alias F/*, alias indx*/)
{
auto auxdispatch(int x /*= indx*/)
{
return F(x);
}
}
public:
void outer()
{
void nested()(int a)
{
data += a;
}
mixin auxdispatch!(nested/*,pos*/);
auxdispatch(pos);
import std.stdio;
writeln(data);
}
}
void main()
{
S s;
s.outer();
}
will do the trick, the commented out sections are what I would have liked to have done but a DMD bug prevents it from compiling.
There was a problem hiding this comment.
Thanks a lot for the above example @thewilsonator
Added mixin template in 403aa24
| } | ||
| } | ||
|
|
||
| void assignAux(ptrdiff_t si = -1)(size_t ri = 0) @property |
There was a problem hiding this comment.
ditto. there is an identical one just above.
* Dispather is converted to mixin template to work around nested function pointer * Added dispatcher to Groups
| thisGap = tmp; | ||
| } | ||
| thisGap = max(thisGap, tmp); | ||
| } |
There was a problem hiding this comment.
You have this same block of code repeated three times with different different indices, you should make that a (nested) function.
| static if(si > -1) | ||
| alias i = si; | ||
| else | ||
| size_t i = ri; |
There was a problem hiding this comment.
you could probably refactor this into a mixin (template) too.
There was a problem hiding this comment.
doing this would make the intent of the function much clearer.
| +/ | ||
| auto aggregate(int axis, Ops...)() @property | ||
| { | ||
| static if(axis) |
There was a problem hiding this comment.
There is a lot of similar code in the two branches, try to push this static if deeper into the function.
There was a problem hiding this comment.
in particular do
static if (axis)
{
import std.meta: staticMap;
alias Resolve(T) = suitableType!(resolverInternal!(T, Ops));
alais DFTypes = staticMap!(Resolve, RowType);
}
else
alais DFTypes =aggregateType!(Ops)
DataFrame!(true, DFTypes) ret;
init!(axis)(ret);
...
| assert(approxEqual(doubledf.data[1][0], 5.6, 1e-8) && approxEqual(doubledf.data[1][1], 8, 1e-8)); | ||
| } | ||
|
|
||
| private auto customFunc(T)(T[] arr) |
There was a problem hiding this comment.
move this inside the unittest that uses it and make it static
|
|
||
| auto aggregate(int axis, Ops...)() @property | ||
| { | ||
| static if(axis) |
There was a problem hiding this comment.
ditto, there is a lot of similar code and it makes it difficult to see what is different.
There was a problem hiding this comment.
I put all the initialisation into nested function ![]()
| assert(approxEqual(dub.data[1][0], 2, 1e-8) && approxEqual(dub.data[1][1], 3, 1e-8) && approxEqual(dub.data[1][2], 5, 1e-8) && approxEqual(dub.data[1][3], 6, 1e-8)); | ||
| } | ||
|
|
||
| private auto customFunc(T)(T[] arr) |
There was a problem hiding this comment.
ditto, move this inside the unittest.
* Nested function to initialize return values * Made unittest related functions static
| auto gapCalc(T)(T[] arr) | ||
| { | ||
| static if(is(T == string)) | ||
| return arr.map!(e => e.length).reduce!max; |
There was a problem hiding this comment.
This static if branch is not needed as to!string on a string is a no-op. just use the line below.
| } | ||
|
|
||
| ret.grpIndex.generateCodes(); | ||
| return ret; |
There was a problem hiding this comment.
These two lines are duplicate in each branch.
| } | ||
|
|
||
| ret.indx.generateCodes(); | ||
| return ret; |
| alias Resolve(T) = suitableType!(resolverInternal!(T, Ops)); | ||
| alias ResolvedTypes = staticMap!(Resolve, GrpRowType); | ||
| Group!(ResolvedTypes) ret; | ||
| init!(axis)(ret); |
| static if(!isHomogeneous!(GrpRowType)) | ||
| alias GrpType = staticMap!(toArr, GrpRowType); | ||
| else | ||
| alias GrpType = toArr!(GrpRowType[0])[GrpRowType.length]; |
There was a problem hiding this comment.
this code is duplicate with dataframe, move it to utils, and do
alias GrpType = homogeneousTypesFor!(GrpRowType);
if you need isHomogeneousType use a mixin template.
|
|
||
| assert(0); | ||
| } | ||
| } |
There was a problem hiding this comment.
all of this code is duplicate too.
| mixin auxDispatch!(mixinAux); | ||
| auxDispatch(pos[1]); | ||
| } | ||
| else static if(T.length == 1 && isArray!(T[0])) |
There was a problem hiding this comment.
The only thing different between this branch and the one above is
-mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]];");
+mixin("data[i][j] " ~ op ~ "= elements.data[j - elementCountTill[pos[0]]].get!(GrpRowType[typepos]);");you should merge them.
| } | ||
|
|
||
| mixin auxDispatch!(assignAux); | ||
| auxDispatch(pos); |
There was a problem hiding this comment.
the line count increase for doing this here is quite large and the entire point of it was to increase readability which I don't think this does.
|
One thing I want you to think about is the amount of similarity and differences between Group and DataFrame and if the differences actually need to be different at all. For example they can both be of homogeneous types (or not) and they both contain a field that depend on that property and they both have an index. But, for the data frame the alias for the (in)homogeneous type is This means if all I care about the thing I have as a template is This affect you too. The implementation of In D this is usually done as a free function that then gets called via Uniform Function Call Syntax, that is to say, one writes struct DataFrame(...)
{
//...
}
struct Group(...)
{
//...
}
auto aggregate(T, int axis, Ops...)(ref T t) // T is a dataframe or a Group
{
//...
}
//User code
void func()
{
Dataframe!(whatever) df;
Group!(whatever) grp;
auto aggdf = df.aggregate!(0,max);
auto agggrp = grp.aggregate!(0,max);
}I want you to do that, and then for every difference between This separation of behaviour from data is very widely used in D, it is for example the basis of the range: a data structure provides the implementation of how to iterate over it, and then all the algorithms on phobos can be used with that data structure (providing the constraints are met). (Sometimes it is necessary to specialise an algorithm for a data structure because of performance but that is rare, and you should do profiling to make sure that it is worth doing) |
* Moved same code out of if statements * Moved auxDispatch to helpers * Squashed similar if-else branch
|
I was on the track of doing exactly that: an aggregate that takes either Group or DataFrame but with the custom function changes, I ran into this trouble: https://run.dlang.io/is/88rXa3 At that point I had implemented aggregate as an overload so I moved each of them into the respective structs and optimized them individually. Some parts are indeed very similar - If the above mentioned problem can be circumvented, I can move the aggregate into a single function by this weekend |
* Alias input for aggregate * Updated snippets
This PR adds some preliminary aggregate operations and reducing of traversal overhead when targeting a single column of DataFrame by replacing TypeTuple with an array. (Same thing done for Group as well)
Implemented some preliminary aggregate operations.
cc/ @thewilsonator, ready for review (^_^)