Delayed update Binary Tree for Lazy strategy by Cyan4973 · Pull Request #970 · facebook/zstd

Cyan4973 · 2017-12-30T15:13:10Z

This is a pretty nice speed win.

The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.

The baseline performance for btlazy2 on my laptop is :

15#calgary.tar       :   3265536 ->    955985 (3.416),  7.06 MB/s , 618.0 MB/s
15#enwik7            :  10000000 ->   3067341 (3.260),  4.65 MB/s , 521.2 MB/s
15#silesia.tar       : 211984896 ->  58095131 (3.649),  6.20 MB/s , 682.4 MB/s

(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)

After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :

15#calgary.tar       :   3265536 ->    958060 (3.408),  9.12 MB/s , 621.1 MB/s
15#enwik7            :  10000000 ->   3078318 (3.249),  6.37 MB/s , 525.1 MB/s
15#silesia.tar       : 211984896 ->  58444111 (3.627),  9.89 MB/s , 680.4 MB/s

That's because it keeps 1<<searchLog as a maximum number of unsorted candidates to update.
For a hash chain, it represents the total number of candidates in the chain,
while for binary tree, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the binary tree,
since they are filtered out by the sort operation.

As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.

To compensate, it's enough to increase searchLog value.
Here is the result after adding just +1 to searchLog :

15#calgary.tar       :   3265536 ->    956311 (3.415),  8.32 MB/s , 611.4 MB/s
15#enwik7            :  10000000 ->   3067655 (3.260),  5.43 MB/s , 535.5 MB/s
15#silesia.tar       : 211984896 ->  58113144 (3.648),  8.35 MB/s , 679.3 MB/s

aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).

This modification makes btlazy2 more competitive.
A new paramgrill round determined it could also be applied to level 13 and 14, so compression level table has been adapted.

Not that this patch will conflict with #964, since they work on same files.
#964 usefulness is less clear though, since it merely moves around speed/ratio trade off,
whereas this patch produces a net gain, albeit on a limited scope (btlazy2).

This is a pretty nice speed win. The new strategy consists in stacking new candidates as if it was a hash chain. Then, only if there is a need to actually consult the chain, they are batch-updated, before starting the match search itself. This is supposed to be beneficial when skipping positions, which happens a lot when using lazy strategy. The baseline performance for btlazy2 on my laptop is : 15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s 15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s 15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s (only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt) After this patch, and keeping all parameters identical, speed is increased by a pretty good margin (+30-50%), but compression ratio suffers a bit : 15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s 15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s 15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s That's because I kept `1<<searchLog` as a maximum number of candidates to update. But for a hash chain, this represents the total number of candidates in the chain, while for the binary, it represents the maximum depth of searches. Keep in mind that a lot of candidates won't even be visited in the btree, since they are filtered out by the binary sort. As a consequence, in the new implementation, the effective depth of the binary tree is substantially shorter. To compensate, it's enough to increase `searchLog` value. Here is the result after adding just +1 to searchLog (level 15 setting in this patch): 15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s 15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s 15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s aka, almost the same compression ratio as before, but with a noticeable speed increase (+20-30%). This modification makes btlazy2 more competitive. A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.

fuzz tests artifacts

new code supposed to be easier to auto-vectorize

the chain of unsorted candidates could grow beyond lowLimit.

The unsorted_mark is handled like any index, which fails after a rescale.

we want the dictionary table to be fully sorted, not just lazily filled. Dictionary loading is a bit more intensive, but it saves cpu cycles for match search during compression.

since no compiler attempts to auto-vectorize it.

now selected for levels 13, 14 and 15. Also : dropped the requirement for monotonic memory budget increase of compression levels,, which was required for ZSTD_estimateCCtxSize() in order to ensure that a memory budget for level L is large enough for any level <= L. This condition is now ensured at run time inside ZSTD_estimateCCtxSize().

It used to stop on reaching extDict, for simplification. As a consequence, there was a small loss of performance each time the round buffer would restart from beginning. It's not a large difference though, just several hundreds of bytes on silesia. This patch fixes it.

terrelln

I haven't looked that zstd_lazy.c yet.

terrelln · 2018-01-06T01:13:09Z

lib/compress/zstd_compress.c

+        ZSTD_reduceTable(zc->hashTable, hSize, reducerValue);
+    }
+
+    if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2) {


Should this be deleted? It looks like for strategy that isn't ZSTD_btlazy2 or ZSTD_fast, the chain table will be reduced twice.

You are right, this is messed up.
As a mitigation, I suspect that the second reduce is a no-op, since it will discover it can't reduce indexes much further.
Still, it's useless cpu cycles spent.

terrelln · 2018-01-06T01:13:26Z

lib/compress/zstd_compress.c

-      ZSTD_reduceTable(zc->chainTable, chainSize, reducerValue); }
+    if (zc->appliedParams.cParams.strategy != ZSTD_fast) {
+        U32 const chainSize = (U32)1 << zc->appliedParams.cParams.chainLog;
+        if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2)


Should this be == ZSTD_btlazy2?

Yes, indeed, it should.
And now I'm wondering why the associated test has not crashed ...

OK, both issues must be fixed.
Though indeed, it's not as impactful as I initially feared.

The situation was worse when ZSTD_DUBT_UNSORTED_MARK was defined as 0xFFFFFFFF. In which case, after a reduce, the resulting index was pointing nowhere.

Now, since it's defined as 1, a reduce operation squash it to 0, which means "no further candidate". It's a much smaller impact, and I could not create a corruption or crash issue with it. But that could just be due to lack of samples. ZSTD_preserveUnsortedMark() just ensures there is no unforeseen side-effect after a reduce.

@terrelln

following suggestions from @terrelln. Also added some comments to present logic behind ZSTD_preserveUnsortedMark().

a pointer calculation was wrong in a corner case

terrelln

If I understand the delayed tree correctly, it works like this:

Update like a hash chain, using the left pointer, and set the right pointer to ZSTD_DUBT_UNSORTED_MARK. This puts unsorted positions before the rest of the sorted tree.
When inserting a position, we first fix up the tree of matchIndex. To do that we traverse the unsorted chain, storing "parent pointers" in the right positions where ZSTD_DUBT_UNSORTED_MARK was. Then we sort the chain in reverse order, so that ZSTD_insertDUBT1() doesn't see any of the unsorted positions.
Then we search for the largest match and add current to the newly fixed up tree.

terrelln · 2018-01-11T02:21:14Z

lib/compress/zstd_lazy.c

-            match = base + matchIndex;
+        if ( (!extDict)
+          || (matchIndex+matchLength >= dictLimit)  /* both in current segment*/
+          || (current < dictLimit) /* both in extDict */) {


Could matchLength be potentially cut to the end of the extDict in this case, where it really extends further?

Yes.
This is a simplification. It avoids creating some new counting logic, which would be quite complex to handle due to buffer changes, while the expected improvement is fairly limited (negligible).

terrelln · 2018-01-11T02:31:52Z

lib/compress/zstd_opt.c

+/** ZSTD_insertBt1() : add one or multiple positions to tree.
+ *  ip : assumed <= iend-8 .
+ * @return : nb of positions added */
+static U32 ZSTD_insertBt1(ZSTD_CCtx* zc,


Just to make sure, this is just copied from the old ZSTD_lazy.c with no other changes, right?

Yes. Since ZSTD_insertBt1() is no longer provided by zstd_lazy.c, it needs to be implemented directly within zstd_opt.c.

terrelln · 2018-01-11T02:42:07Z

lib/compress/zstd_lazy.c

+    /* batch sort stacked candidates */
+    matchIndex = previousCandidate;
+    while (matchIndex) {  /* will end on matchIndex == 0 */
+        U32* const nextCandidateIdxPtr = bt + 2*(matchIndex&btMask) + 1;


Any reason to have this variable instead of just *(bt + 2*(matchIndex&btMask) + 1)?

Not really. It was just easier to think it this way.
It can be coalesced into nextCandidateIdx if that matters.

Cyan4973 · 2018-01-11T03:21:51Z

Yes, this is a correct description.

terrelln

Looks good to me!

terrelln · 2018-01-11T20:21:17Z

lib/compress/zstd_lazy.c

 }


 static size_t ZSTD_insertBtAndFindBestMatch (


Should this also be renamed to DUBT?

and fixed conflicts

Cyan4973 added 14 commits December 28, 2017 16:58

complete ignore list

ffc335b

fuzz tests artifacts

minor speed optimization : index overflow prevention

f36da5b

new code supposed to be easier to auto-vectorize

fixed bug in dubt

64482c2

the chain of unsorted candidates could grow beyond lowLimit.

add test case which reliably produces btlazy2 rescale overflow bug

4c7f137

The unsorted_mark is handled like any index, which fails after a rescale.

btlazy2: fixed interaction between unsortedMark and reduceTable

02f64ef

btlazy2 : optimization for dictionary compression

d228b6b

we want the dictionary table to be fully sorted, not just lazily filled. Dictionary loading is a bit more intensive, but it saves cpu cycles for match search during compression.

Merge branch 'benchfix' into dubtlazy

77fc611

simplify ZSTD_preserveUnsortedMark() implementation

eb52e2f

since no compiler attempts to auto-vectorize it.

Merge branch 'dubtlazy' of github.com:facebook/zstd into dubtlazy

76e802c

fixed minor argument property for Visual

00db4db

facebook-github-bot added the CLA Signed label Dec 30, 2017

terrelln reviewed Jan 6, 2018

View reviewed changes

Cyan4973 added 3 commits January 6, 2018 12:31

fixed ZSTD_reduceIndex()

a927fae

following suggestions from @terrelln. Also added some comments to present logic behind ZSTD_preserveUnsortedMark().

Merge branch 'dev' into dubtlazy

658d6b8

fixed msan test

b17fb48

a pointer calculation was wrong in a corner case

terrelln reviewed Jan 11, 2018

View reviewed changes

terrelln approved these changes Jan 11, 2018

View reviewed changes

Cyan4973 added 2 commits January 11, 2018 12:38

changed function name to ZSTD_DUBT_findBestMatch()

b9a1490

Merge branch 'dev' into dubtlazy

cacf47c

and fixed conflicts

Cyan4973 merged commit 9d65a5c into dev Jan 12, 2018

Cyan4973 deleted the dubtlazy branch February 11, 2018 04:07

Conversation

Cyan4973 commented Dec 30, 2017

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyan4973 commented Jan 11, 2018

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants