Skip to content

Delayed update Binary Tree for Lazy strategy#970

Merged
Cyan4973 merged 19 commits intodevfrom
dubtlazy
Jan 12, 2018
Merged

Delayed update Binary Tree for Lazy strategy#970
Cyan4973 merged 19 commits intodevfrom
dubtlazy

Conversation

@Cyan4973
Copy link
Contributor

This is a pretty nice speed win.

The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.

The baseline performance for btlazy2 on my laptop is :

15#calgary.tar       :   3265536 ->    955985 (3.416),  7.06 MB/s , 618.0 MB/s
15#enwik7            :  10000000 ->   3067341 (3.260),  4.65 MB/s , 521.2 MB/s
15#silesia.tar       : 211984896 ->  58095131 (3.649),  6.20 MB/s , 682.4 MB/s

(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)

After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :

15#calgary.tar       :   3265536 ->    958060 (3.408),  9.12 MB/s , 621.1 MB/s
15#enwik7            :  10000000 ->   3078318 (3.249),  6.37 MB/s , 525.1 MB/s
15#silesia.tar       : 211984896 ->  58444111 (3.627),  9.89 MB/s , 680.4 MB/s

That's because it keeps 1<<searchLog as a maximum number of unsorted candidates to update.
For a hash chain, it represents the total number of candidates in the chain,
while for binary tree, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the binary tree,
since they are filtered out by the sort operation.

As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.

To compensate, it's enough to increase searchLog value.
Here is the result after adding just +1 to searchLog :

15#calgary.tar       :   3265536 ->    956311 (3.415),  8.32 MB/s , 611.4 MB/s
15#enwik7            :  10000000 ->   3067655 (3.260),  5.43 MB/s , 535.5 MB/s
15#silesia.tar       : 211984896 ->  58113144 (3.648),  8.35 MB/s , 679.3 MB/s

aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).

This modification makes btlazy2 more competitive.
A new paramgrill round determined it could also be applied to level 13 and 14, so compression level table has been adapted.

Not that this patch will conflict with #964, since they work on same files.
#964 usefulness is less clear though, since it merely moves around speed/ratio trade off,
whereas this patch produces a net gain, albeit on a limited scope (btlazy2).

This is a pretty nice speed win.

The new strategy consists in stacking new candidates as if it was a hash chain.
Then, only if there is a need to actually consult the chain, they are batch-updated,
before starting the match search itself.
This is supposed to be beneficial when skipping positions,
which happens a lot when using lazy strategy.

The baseline performance for btlazy2 on my laptop is :
15#calgary.tar       :   3265536 ->    955985 (3.416),  7.06 MB/s , 618.0 MB/s
15#enwik7            :  10000000 ->   3067341 (3.260),  4.65 MB/s , 521.2 MB/s
15#silesia.tar       : 211984896 ->  58095131 (3.649),  6.20 MB/s , 682.4 MB/s
(only level 15 remains for btlazy2, as this strategy is squeezed between lazy2 and btopt)

After this patch, and keeping all parameters identical,
speed is increased by a pretty good margin (+30-50%),
but compression ratio suffers a bit :
15#calgary.tar       :   3265536 ->    958060 (3.408),  9.12 MB/s , 621.1 MB/s
15#enwik7            :  10000000 ->   3078318 (3.249),  6.37 MB/s , 525.1 MB/s
15#silesia.tar       : 211984896 ->  58444111 (3.627),  9.89 MB/s , 680.4 MB/s

That's because I kept `1<<searchLog` as a maximum number of candidates to update.
But for a hash chain, this represents the total number of candidates in the chain,
while for the binary, it represents the maximum depth of searches.
Keep in mind that a lot of candidates won't even be visited in the btree,
since they are filtered out by the binary sort.

As a consequence, in the new implementation,
the effective depth of the binary tree is substantially shorter.

To compensate, it's enough to increase `searchLog` value.
Here is the result after adding just +1 to searchLog (level 15 setting in this patch):
15#calgary.tar       :   3265536 ->    956311 (3.415),  8.32 MB/s , 611.4 MB/s
15#enwik7            :  10000000 ->   3067655 (3.260),  5.43 MB/s , 535.5 MB/s
15#silesia.tar       : 211984896 ->  58113144 (3.648),  8.35 MB/s , 679.3 MB/s

aka, almost the same compression ratio as before,
but with a noticeable speed increase (+20-30%).

This modification makes btlazy2 more competitive.
A new round of paramgrill will be necessary to determine which levels are impacted and could adopt the new strategy.
fuzz tests artifacts
new code supposed to be easier to auto-vectorize
the chain of unsorted candidates could grow beyond lowLimit.
The unsorted_mark is handled like any index,
which fails after a rescale.
we want the dictionary table to be fully sorted,
not just lazily filled.
Dictionary loading is a bit more intensive,
but it saves cpu cycles for match search during compression.
since no compiler attempts to auto-vectorize it.
now selected for levels 13, 14 and 15.

Also : dropped the requirement for monotonic memory budget increase of compression levels,,
which was required for ZSTD_estimateCCtxSize()
in order to ensure that a memory budget for level L is large enough for any level <= L.
This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
now selected for levels 13, 14 and 15.

Also : dropped the requirement for monotonic memory budget increase of compression levels,,
which was required for ZSTD_estimateCCtxSize()
in order to ensure that a memory budget for level L is large enough for any level <= L.
This condition is now ensured at run time inside ZSTD_estimateCCtxSize().
It used to stop on reaching extDict, for simplification.
As a consequence, there was a small loss of performance each time the round buffer would restart from beginning.
It's not a large difference though, just several hundreds of bytes on silesia.
This patch fixes it.
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked that zstd_lazy.c yet.

ZSTD_reduceTable(zc->hashTable, hSize, reducerValue);
}

if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be deleted? It looks like for strategy that isn't ZSTD_btlazy2 or ZSTD_fast, the chain table will be reduced twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is messed up.
As a mitigation, I suspect that the second reduce is a no-op, since it will discover it can't reduce indexes much further.
Still, it's useless cpu cycles spent.

ZSTD_reduceTable(zc->chainTable, chainSize, reducerValue); }
if (zc->appliedParams.cParams.strategy != ZSTD_fast) {
U32 const chainSize = (U32)1 << zc->appliedParams.cParams.chainLog;
if (zc->appliedParams.cParams.strategy != ZSTD_btlazy2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be == ZSTD_btlazy2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, it should.
And now I'm wondering why the associated test has not crashed ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, both issues must be fixed.
Though indeed, it's not as impactful as I initially feared.

The situation was worse when ZSTD_DUBT_UNSORTED_MARK was defined as 0xFFFFFFFF. In which case, after a reduce, the resulting index was pointing nowhere.

Now, since it's defined as 1, a reduce operation squash it to 0, which means "no further candidate". It's a much smaller impact, and I could not create a corruption or crash issue with it. But that could just be due to lack of samples. ZSTD_preserveUnsortedMark() just ensures there is no unforeseen side-effect after a reduce.

following suggestions from @terrelln.
Also added some comments to present logic behind ZSTD_preserveUnsortedMark().
a pointer calculation was wrong in a corner case
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the delayed tree correctly, it works like this:

  1. Update like a hash chain, using the left pointer, and set the right pointer to ZSTD_DUBT_UNSORTED_MARK. This puts unsorted positions before the rest of the sorted tree.
  2. When inserting a position, we first fix up the tree of matchIndex. To do that we traverse the unsorted chain, storing "parent pointers" in the right positions where ZSTD_DUBT_UNSORTED_MARK was. Then we sort the chain in reverse order, so that ZSTD_insertDUBT1() doesn't see any of the unsorted positions.
  3. Then we search for the largest match and add current to the newly fixed up tree.

match = base + matchIndex;
if ( (!extDict)
|| (matchIndex+matchLength >= dictLimit) /* both in current segment*/
|| (current < dictLimit) /* both in extDict */) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could matchLength be potentially cut to the end of the extDict in this case, where it really extends further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
This is a simplification. It avoids creating some new counting logic, which would be quite complex to handle due to buffer changes, while the expected improvement is fairly limited (negligible).

/** ZSTD_insertBt1() : add one or multiple positions to tree.
* ip : assumed <= iend-8 .
* @return : nb of positions added */
static U32 ZSTD_insertBt1(ZSTD_CCtx* zc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, this is just copied from the old ZSTD_lazy.c with no other changes, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since ZSTD_insertBt1() is no longer provided by zstd_lazy.c, it needs to be implemented directly within zstd_opt.c.

/* batch sort stacked candidates */
matchIndex = previousCandidate;
while (matchIndex) { /* will end on matchIndex == 0 */
U32* const nextCandidateIdxPtr = bt + 2*(matchIndex&btMask) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have this variable instead of just *(bt + 2*(matchIndex&btMask) + 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It was just easier to think it this way.
It can be coalesced into nextCandidateIdx if that matters.

@Cyan4973
Copy link
Contributor Author

Yes, this is a correct description.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

}


static size_t ZSTD_insertBtAndFindBestMatch (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be renamed to DUBT?

@Cyan4973 Cyan4973 merged commit 9d65a5c into dev Jan 12, 2018
@Cyan4973 Cyan4973 deleted the dubtlazy branch February 11, 2018 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants