[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating#2029
[SPARK-2873] [SQL] using ExternalAppendOnlyMap to resolve OOM when aggregating#2029guowei2 wants to merge 1 commit into
Conversation
|
Can one of the admins verify this patch? |
|
@marmbrus |
|
People usually just summarize the benchmark itself and the results in description of the PR. For example: #1439 |
|
|
it's very sad about the result of benchmark above. the size of AppendOnlyMap is according to the number of keys for values with the same key merged i think it's not a good way by using ExternalAppendOnlyMap,fot it is too expensive when records with the same key spill to disk over and over again. otherwise, user can easily avoid OOM by raising spark.sql.shuffle.partitions to reduce the key numbsers i think the logic of ExternalAppendOnlyMap should Optimize. join seems have similar problems. meanwhile, both left and right table put into ExternalAppendOnlyMap is expensive too |
|
What were the actual results of the benchmark? It is acceptable for there to be some performance hit here. In cases where there are too many keys, its much better to spill to disk than to OOM, though you have a good point about just adding more partitions. |
|
Can one of the admins verify this patch? |
|
I've run a micro benchmark in my local with 50000 records,1500 keys.
|
|
Thanks for working on this, but we are trying to clean up the PR queue (in order to make it easier for us to review). Thus, I think we should close this issue for now and reopen when its ready for review. |
A new PR clone from PR 1822
Fix numbers of problems
Reuse the CompactBuffer from Spark Core to save memory and pointer dereferences as PR 1993
Hive UDAF not support external aggregate, for hive AggregationBuffer need serializable and hive GenericUDAFEvaluator has no method implement to merge two evaluators