Conversation
This is a pretty nice speed win. The new strategy consists in stacking new candidates as if it was a hash chain. Then, only if there is a need to actually consult the chain, they are batch-updated, before starting the match search itself. This is supposed to be beneficial when skipping positions, which happens a lot when using lazy strategy. The baseline performance for btlazy2 on my laptop is : 15#calgary.tar : 3265536 -> 955985 (3.416), 7.06 MB/s , 618.0 MB/s 15#enwik7 : 10000000 -> 3067341 (3.260), 4.65 MB/s , 521.2 MB/s 15#silesia.tar : 211984896 -> 58095131 (3.649), 6.20 MB/s , 682.4 MB/s (only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt) After this patch, and keeping all parameters identical, speed is increased by a pretty good margin (+30-50%), but compression ratio suffers a bit : 15#calgary.tar : 3265536 -> 958060 (3.408), 9.12 MB/s , 621.1 MB/s 15#enwik7 : 10000000 -> 3078318 (3.249), 6.37 MB/s , 525.1 MB/s 15#silesia.tar : 211984896 -> 58444111 (3.627), 9.89 MB/s , 680.4 MB/s That's because I kept `1<<searchLog` as a maximum number of candidates to update. But for a hash chain, this represents the total number of candidates in the chain, while for the binary, it represents the maximum depth of searches. Keep in mind that a lot of candidates won't even be visited in the btree, since they are filtered out by the binary sort. As a consequence, in the new implementation, the effective depth of the binary tree is substantially shorter. To compensate, it's enough to increase `searchLog` value. Here is the result after adding just +1 to searchLog (level 15 setting in this patch): 15#calgary.tar : 3265536 -> 956311 (3.415), 8.32 MB/s , 611.4 MB/s 15#enwik7 : 10000000 -> 3067655 (3.260), 5.43 MB/s , 535.5 MB/s 15#silesia.tar : 211984896 -> 58113144 (3.648), 8.35 MB/s , 679.3 MB/s aka, almost the same compression ratio as before, but with a noticeable speed increase (+20-30%). This modification makes btlazy2 more competitive. A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
fuzz tests artifacts
new code supposed to be easier to auto-vectorize
the chain of unsorted candidates could grow beyond lowLimit.
The unsorted_mark is handled like any index, which fails after a rescale.
we want the dictionary table to be fully sorted, not just lazily filled. Dictionary loading is a bit more intensive, but it saves cpu cycles for match search during compression.
since no compiler attempts to auto-vectorize it.
now selected for levels 13, 14 and 15. Also : dropped the requirement for monotonic memory budget increase of compression levels,, which was required for ZSTD_estimateCCtxSize() in order to ensure that a memory budget for level L is large enough for any level <= L. This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
now selected for levels 13, 14 and 15. Also : dropped the requirement for monotonic memory budget increase of compression levels,, which was required for ZSTD_estimateCCtxSize() in order to ensure that a memory budget for level L is large enough for any level <= L. This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
It used to stop on reaching extDict, for simplification. As a consequence, there was a small loss of performance each time the round buffer would restart from beginning. It's not a large difference though, just several hundreds of bytes on silesia. This patch fixes it.
terrelln
left a comment
There was a problem hiding this comment.
I haven't looked that zstd_lazy.c yet.
lib/compress/zstd_compress.c
Outdated
| ZSTD_reduceTable(zc->hashTable, hSize, reducerValue); | ||
| } | ||
|
|
||
| if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2) { |
There was a problem hiding this comment.
Should this be deleted? It looks like for strategy that isn't ZSTD_btlazy2 or ZSTD_fast, the chain table will be reduced twice.
There was a problem hiding this comment.
You are right, this is messed up.
As a mitigation, I suspect that the second reduce is a no-op, since it will discover it can't reduce indexes much further.
Still, it's useless cpu cycles spent.
lib/compress/zstd_compress.c
Outdated
| ZSTD_reduceTable(zc->chainTable, chainSize, reducerValue); } | ||
| if (zc->appliedParams.cParams.strategy != ZSTD_fast) { | ||
| U32 const chainSize = (U32)1 << zc->appliedParams.cParams.chainLog; | ||
| if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2) |
There was a problem hiding this comment.
Should this be == ZSTD_btlazy2?
There was a problem hiding this comment.
Yes, indeed, it should.
And now I'm wondering why the associated test has not crashed ...
There was a problem hiding this comment.
OK, both issues must be fixed.
Though indeed, it's not as impactful as I initially feared.
The situation was worse when ZSTD_DUBT_UNSORTED_MARK was defined as 0xFFFFFFFF. In which case, after a reduce, the resulting index was pointing nowhere.
Now, since it's defined as 1, a reduce operation squash it to 0, which means "no further candidate". It's a much smaller impact, and I could not create a corruption or crash issue with it. But that could just be due to lack of samples. ZSTD_preserveUnsortedMark() just ensures there is no unforeseen side-effect after a reduce.
following suggestions from @terrelln. Also added some comments to present logic behind ZSTD_preserveUnsortedMark().
a pointer calculation was wrong in a corner case
terrelln
left a comment
There was a problem hiding this comment.
If I understand the delayed tree correctly, it works like this:
- Update like a hash chain, using the left pointer, and set the right pointer to
ZSTD_DUBT_UNSORTED_MARK. This puts unsorted positions before the rest of the sorted tree. - When inserting a position, we first fix up the tree of
matchIndex. To do that we traverse the unsorted chain, storing "parent pointers" in the right positions whereZSTD_DUBT_UNSORTED_MARKwas. Then we sort the chain in reverse order, so thatZSTD_insertDUBT1()doesn't see any of the unsorted positions. - Then we search for the largest match and add
currentto the newly fixed up tree.
| match = base + matchIndex; | ||
| if ( (!extDict) | ||
| || (matchIndex+matchLength >= dictLimit) /* both in current segment*/ | ||
| || (current < dictLimit) /* both in extDict */) { |
There was a problem hiding this comment.
Could matchLength be potentially cut to the end of the extDict in this case, where it really extends further?
There was a problem hiding this comment.
Yes.
This is a simplification. It avoids creating some new counting logic, which would be quite complex to handle due to buffer changes, while the expected improvement is fairly limited (negligible).
| /** ZSTD_insertBt1() : add one or multiple positions to tree. | ||
| * ip : assumed <= iend-8 . | ||
| * @return : nb of positions added */ | ||
| static U32 ZSTD_insertBt1(ZSTD_CCtx* zc, |
There was a problem hiding this comment.
Just to make sure, this is just copied from the old ZSTD_lazy.c with no other changes, right?
There was a problem hiding this comment.
Yes. Since ZSTD_insertBt1() is no longer provided by zstd_lazy.c, it needs to be implemented directly within zstd_opt.c.
| /* batch sort stacked candidates */ | ||
| matchIndex = previousCandidate; | ||
| while (matchIndex) { /* will end on matchIndex == 0 */ | ||
| U32* const nextCandidateIdxPtr = bt + 2*(matchIndex&btMask) + 1; |
There was a problem hiding this comment.
Any reason to have this variable instead of just *(bt + 2*(matchIndex&btMask) + 1)?
There was a problem hiding this comment.
Not really. It was just easier to think it this way.
It can be coalesced into nextCandidateIdx if that matters.
|
Yes, this is a correct description. |
lib/compress/zstd_lazy.c
Outdated
| } | ||
|
|
||
|
|
||
| static size_t ZSTD_insertBtAndFindBestMatch ( |
There was a problem hiding this comment.
Should this also be renamed to DUBT?
This is a pretty nice speed win.
The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.
The baseline performance for
btlazy2on my laptop is :(only level 15 remains for
btlazy2, as this strategy is squeezed betweenlazy2andbtopt)After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
That's because it keeps
1<<searchLogas a maximum number of unsorted candidates to update.For a hash chain, it represents the total number of candidates in the chain,
while for binary tree, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the binary tree,
since they are filtered out by the sort operation.
As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.
To compensate, it's enough to increase
searchLogvalue.Here is the result after adding just +1 to searchLog :
aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).
This modification makes
btlazy2more competitive.A new paramgrill round determined it could also be applied to level 13 and 14, so compression level table has been adapted.
Not that this patch will conflict with #964, since they work on same files.
#964 usefulness is less clear though, since it merely moves around speed/ratio trade off,
whereas this patch produces a net gain, albeit on a limited scope (
btlazy2).