Skip to content

Add and use Liquid::C::BlockBody for the block body#58

Merged
dylanahsmith merged 5 commits into
masterfrom
c-block-body
Sep 30, 2020
Merged

Add and use Liquid::C::BlockBody for the block body#58
dylanahsmith merged 5 commits into
masterfrom
c-block-body

Conversation

@dylanahsmith
Copy link
Copy Markdown
Contributor

@dylanahsmith dylanahsmith commented Sep 8, 2020

Problem

Currently liquid-c is mostly used to speed up parsing (evaluating expressions is the exception). However, there is a big opportunity to reduce the overhead of both parsing and rendering by moving to a C representation for block bodies, since we can represent it as a set of instructions that can be efficiently executed directly from C rather than indirectly through another VM and would also allow us to serialize and deserialize it to avoid parsing overhead for pre-compiled templates. The block body is the ideal starting place, since it can encapsulate everything it contains, allowing builtin functionality to be compiled directly into the block bodies instructions.

@macournoyer created a proof of concept liquid VM as part of hack days, which showed significant potential improvements from a liquid VM. Even after modifying it to use a Liquid::Context rather than a simple hash for the context, it was still showing a 4.4x improvement for deserializing and rendering the liquid VM code compared to parsing and rendering liquid using liquid-c. Loading a marshal dump of the Liquid::Template instead of parsing liquid only made it 4.0x slower than liquid VM code, which seems to indicate that we should serialize to a format that can (at least mostly) be executed without deserialization.

Solution

This pull request depends on Shopify/liquid#1289 to provide Block#new_body and Document#new_body to override in order to return a Liquid::C::BlockBody when liquid-c is enabled. These two methods also have access to the parse context, so we still don't need to support profiling for liquid-c's block body.

The code implementing the Liquid::BlockBody#parse override was refactored to implement Liquid::C::BlockBody#parse, so we can build the Liquid::C::BlockBody directly without the overhead of creating a Liquid::BlockBody#nodelist array. This also allows us to avoid copying the raw template strings, which was previously needed to create individual ruby strings, since we can instead just write the output directly from slices of a shared string that contains all the raw template strings.

To keep the scope of this PR more minimal, I haven't included block body serialization and have not added compilation of variables or tags. So the VM currently only has 3 instructions: OP_LEAVE, OP_WRITE_RAW and OP_WRITE_NODE.

The VM keeps track of both an instruction pointer and a constant pointer, which are incremented as instructions or constants are used, so the instructions will be easy to used directly from a serialized state and we won't have to worry about complexities of encoding a constant references (e.g. a byte constant index is too limiting and a multi-byte index can introduce either extra alignment or decoding concerns). We can always iterate on the VM design or even compile directly to machine code in the future.

To make this easier to adopt Liquid::C::BlockBody, I added Liquid::C::BlockBody#nodelist, although I will also be refactoring hot code paths using that in Shopify to not depend on it for performance reasons. As part of our storefront rewrite we have some code to debug differences in the rendered output, which is quite coupled to the liquid parse tree node ruby objects, but doesn't seem to be performance sensitive, so I added a :disable_liquid_c_nodes parse option so this can be used without disabling liquid-c globally (which wouldn't be thread-safe).

Benchmark

On master:

              parse:    146.973  (± 4.1%) i/s -      1.470k in  10.020119s
             render:    135.523  (± 3.0%) i/s -      1.365k in  10.083170s
     parse & render:     64.774  (± 3.1%) i/s -    648.000  in  10.010060s

On this PR's branch:

              parse:    145.177  (± 2.1%) i/s -      1.456k in  10.032703s
             render:    157.193  (± 5.1%) i/s -      1.575k in  10.048723s
     parse & render:     68.667  (± 2.9%) i/s -    690.000  in  10.060160s

Check List

Copy link
Copy Markdown
Contributor

@macournoyer macournoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work 👏

A few comments, no blockers, as long as we can change the bytecode later.

Comment thread ext/liquid_c/block.c
static void block_body_add_write_raw(block_body_t *body, const char *string, size_t size)
{
block_body_write_opcode(body, OP_WRITE_RAW);
size_t slice[2] = { (size_t)string, size };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using slices instead of copying strings is an awesome idea!

But I think storing the address of the pointer will prevent body->constants from being serializable. Could you store the starting index instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will already be constants that we won't use directly from the serialized format, such as uncompiled tags or expression constants (e.g. strings, big integers, constant ranges). In the serialized format, there will end up being a split between these constants to reduce deserialization overhead, so that at least the data for constants like raw strings can be used without deserializing.

I don't think it would be that much overhead for resolving the pointers to raw strings at deserialization, but as mentioned in #58 (comment), I do plan on revisiting the decision of using a moving constant pointer.

Comment thread ext/liquid_c/vm.c
return;
case OP_WRITE_RAW:
{
const char *text = (const char *)*const_ptr++;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice and simple way to avoid index overflow 👍

However, it will break once we introduce loops (for, etc.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do plan on revisiting the moving constant pointer when adding jumps. It could be implemented by have the jump target be a pair of instruction and constant pointer offsets, so there is a trade-off between jump overhead and constant lookup overhead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. First time seeing this approach...

But when you'll add a pointer offset, I think it will be equivalent to the traditional constants[i], and you'll end up needing multi-byte index too.

Comment thread lib/liquid/c.rb Outdated
alias_method :ruby_new_body, :new_body

def new_body
if parse_context.disable_liquid_c_nodes || parse_context[:profile]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does profiling disables this only because instrumentation is missing from C::BlockBody at the moment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liquid's template render profiling support already wasn't supported in liquid-c. I just chose to keep it that way. Note that this is different from stackprof, which we typically use for profiling at Shopify. I would like to revisit profiling in the future, but not in a way that ties us to calling a render_node ruby method that requires a node object to be passed to it.

@dylanahsmith dylanahsmith changed the base branch from master to avoid-expression-match-on-dynamic-variable September 25, 2020 22:26
Base automatically changed from avoid-expression-match-on-dynamic-variable to master September 28, 2020 15:20
@felix-d
Copy link
Copy Markdown

felix-d commented Sep 28, 2020

As part of our storefront rewrite we have some code to debug differences in the rendered output, which is quite coupled to the liquid parse tree node ruby objects, but doesn't seem to be performance sensitive, so I added a :disable_liquid_c_nodes parse option so this can be used without disabling liquid-c globally (which wouldn't be thread-safe).

We don't use liquid tracking anymore. It can be 🔥

@dylanahsmith
Copy link
Copy Markdown
Contributor Author

Oh right, I removed the use of :disable_liquid_c_nodes in the storefront renderer PR, but didn't completely remove the option here. I've done that now.

Copy link
Copy Markdown
Contributor

@pushrax pushrax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, this was easy to follow and looks quite efficient. Also just had some minor non-blocking comments.

Comment thread ext/liquid_c/block.c Outdated
Comment thread ext/liquid_c/block.c
VALUE obj = TypedData_Make_Struct(klass, block_body_t, &block_body_data_type, body);
body->instructions = c_buffer_allocate(8);
block_body_add_leave(body);
body->constants = c_buffer_allocate(8 * sizeof(VALUE));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you choose 8 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was chosen fairly arbitrarily based on the current situation where we could end up with a lot of small block bodies for things like if tags. I tried to keep it as a multiple of 2, since that seemed like it would make it more likely that the memory allocator will re-use buffer allocations for new or expanding buffers and because it could use whole pages for larger memory allocations. However, I haven't tested any of these assumptions yet.

We will probably want to revisit the default Liquid::Document's block body as we start compiling more functionality into VM instructions, since then we will be able to compile builtin block tags into their parent block body.

Comment thread ext/liquid_c/block.c Outdated
Comment thread ext/liquid_c/block.c
tags = rb_funcall(self, intern_registered_tags, 0);

VALUE tag_class = rb_funcall(tags, intern_square_brackets, 1, tag_name);
VALUE tag_class = rb_funcall(tag_registry, intern_square_brackets, 1, tag_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you already updated Core to avoid this being a breaking change for SectionTemplate 👍

In theory external Liquid users could rely on this, but it seems pretty unlikely and we just need to note it in the changelog.

We should probably delete BlockBody#registered_tags entirely from Liquid.

Comment thread ext/liquid_c/block.c
if (*size_ptr) {
*size_ptr = 0; // effectively a no-op
body->render_score--;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memmove also seems appropriate here. I guess when we serialize we can elide the extra operation there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in #59 (comment)

I'm not concerned about further optimizing this edge case. Primarily these blank strings are handled in this way to minimize impact on code that isn't relying on this feature while keeping backwards compatibility,

Specifically, once we have jump instructions for control flow tags, we would also be forced to update all the affected jump offsets. That will be easier to introduce without having to deal with jump offset relocation, but afterwards an extra pass could be added that is more clearly beneficial, since we could replace pessimistically large forward jump instructions (e.g. a decoded 24-bit offset or aligned 32-bit offset) with smaller jump instructions (e.g. a byte offset) where possible.

Comment thread ext/liquid_c/block.c
VALUE nodelist = rb_attr_get(self, intern_ivar_nodelist);
if (nodelist != Qnil)
return nodelist;
nodelist = rb_ary_new_capa(body->instructions.size / sizeof(VALUE));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just the capacity it doesn't matter, but it looks like this is including an extra index for the OP_LEAVE.

Comment thread ext/liquid_c/vm.c Outdated
Comment thread ext/liquid_c/vm.c
VALUE interrupts = rb_ivar_get(context, id_ivar_interrupts);
Check_Type(interrupts, T_ARRAY);

while (true) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do some tracking for how many bytes of the instruction buffer have been consumed, to gracefully handle a corrupted buffer with a missing OP_LEAVE? Perhaps by computing last_ip at the start and doing a single comparison per iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not silently try to handle a corrupt buffer, since that seems likely to lead to more problems. However, I could add an assertion to help with debugging a problem like that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be to do something like an rb_bug in that case, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants