Skip to content

Compile variable nodes into the Liquid::C::BlockBody VM code#59

Merged
dylanahsmith merged 5 commits into
masterfrom
vm-variable
Oct 7, 2020
Merged

Compile variable nodes into the Liquid::C::BlockBody VM code#59
dylanahsmith merged 5 commits into
masterfrom
vm-variable

Conversation

@dylanahsmith
Copy link
Copy Markdown
Contributor

@dylanahsmith dylanahsmith commented Sep 16, 2020

Depends on #58 and Shopify/liquid#1294

This is a continuation of #58 that compiles the strict parseable liquid variables directly in the block body VM code for performance.

For simplicity, variables and expressions in tags do not yet leverage this VM code and the variable VM code still relies on the existing expression evaluation code (e.g. context_evaluate of Liquid::VariableLookup and Liquid::RangeLookup). These will be optimized in following PR(s).

All the commits except the (currently) last one (Compile variable nodes into the Liquid::C::BlockBody VM code) are refactors:

  • I've moved the VM instructions and constants out of block_body_t into a vm_assembler_t struct, since we will want to embed the VM code in a future Liquid::C::Variable. In the future, I intend to wrap the assembler in a Liquid::C::Assembler as a safe API for compiling tags from ruby code (edit: extracted to Extract a vm_assembler_t struct from block_body_t #69)
  • I changed the c_buffer struct to store pointers for the end of the data and capacity, instead of storing the sizes and capacity directly. This is because we are mostly getting and updating the data end pointer, which is especially true in order to use this for the VM stack where we reserve space for the block and don't have to check the capacity for each write/push. (edit: extracted to Refactor c_buffer to make it suitable to re-use for a stack #70)
  • I've also added support to c_buffer to be initialized with size 0 (without an allocation) which is the one case where we can't just double the capacity until we have the requested extra capacity. (edit: extracted to Refactor c_buffer to make it suitable to re-use for a stack #70)
  • I introduced a Liquid::C::VM that is attached to the Liquid::Context through an hidden instance variable, so the stack can be shared with nested blocks
  • Leveraging Refactor to support liquid-c VM compilation of variables liquid#1294 to only run liquid integration tests to avoid having to be fully compatible with the deprecated Liquid::BlockBody#nodelist
  • I've added a liquid_vm_next_instruction function to reduce duplication for code iterating instructions by using it for both Liquid::C::BlockBody#remove_blank_strings and Liquid::C::BlockBody#nodelist. This way I can further leverage it for variable error handling. (edit: extracted to Add liquid_vm_next_instruction to reduce duplication #72)
  • Drop support for ruby 2.4 which is no longer supported upstream

The VM stack is one of the significant additions from this PR. I chose to re-use the stack for nested blocks, which means that it needs to be expandable, hence the use of c_buffer_t for its memory allocation. We can't globally determine the maximum stack size, since this stack could even be re-used in a template partial, but this PR does calculate the maximum stack size needed for a block body so that we can reserve stack space at the start of block body render and not have to check for sufficient capacity throughout the block body render.

Another design choice I made for this PR is to do error handling for the whole block body render, rather than for each variable render. This way we can reduce the state saving cost of rb_rescue from liquid code that doesn't encounter variable render errors. In order to recover from the exception, we just restore the stack size to what it was at the start of the block body render and iterate the instructions to jump just past the variable write.

Since the variable lookup and expression evaluation is still happen as it did before, the most significant change for performance is the filter invocation, which can happen without allocating an arguments array by leveraging the VM stack calling the filter function directly using rb_funcallv. The filter invokable? is still being done at runtime (we need to pass filters as a parse option to do it at parse time) but is still optimized by doing the check directly from C using a hash with symbol keys.

Benchmark

Before this PR (on the #58 branch)

              parse:    146.972  (± 2.0%) i/s -      1.470k in  10.007250s
             render:    162.266  (± 2.5%) i/s -      1.635k in  10.081595s
     parse & render:     70.513  (± 2.8%) i/s -    708.000  in  10.047458s

after

              parse:    163.206  (± 1.8%) i/s -      1.632k in  10.002316s
             render:    172.302  (± 4.1%) i/s -      1.734k in  10.080641s
     parse & render:     78.371  (± 1.3%) i/s -    784.000  in  10.006170s

Copy link
Copy Markdown
Contributor

@macournoyer macournoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments are refactoring & optimization ideas. This is shaping up nicely 👏 Performance improvements are adding up fast!

Great decision on going the stackless route (re-using the stack for nested blocks) 👍

However, I'm still doubtful about the constant pointer approach.

Comment thread ext/liquid_c/block.c
if (*ip == OP_WRITE_RAW) {
size_t *size_ptr = &const_ptr[1];
if (*size_ptr) {
*size_ptr = 0; // effectively a no-op
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you instead introduce a OP_NOP instruction, and replace the whole instruction?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although it would then have an unused constant arguments. I imagine that a OP_NOP would be faster, but I'm not concerned about further optimizing this edge case. Primarily these blank strings are handled in this way to minimize impact on code that isn't relying on this feature while keeping backwards compatibility, so this seemed simpler than a no-op instruction.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping backwards compatibility

For #nodelist? I guess this discussion will answer #58 (comment) too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding backwards compatibility, I was referring to removing/ignoring blank strings from blank blocks. I think our only remaining dependency on nodelist is for tags.

Comment thread ext/liquid_c/vm.c
}
}

void liquid_vm_next_instruction(const uint8_t **ip_ptr, const size_t **const_ptr_ptr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me confused at first. I think it should be renamed to liquid_vm_next_instruction_and_const.

But I still don't get the advantage of this approach over passing the constant index via an operand.

And I think it is painting us in a corner for many optimizations (block_body_remove_blank_strings is an example).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants are pointer sized operands to the instruction, similar to one the byte sized ones.

In hindsight, the instruction pointer should be been a struct with a pointer to both to provide a bit of a zero-cost abstraction around them. We are also missing an abstraction for accessing those operands from the block body code. However, this PR is at least a step in the right directly by moving the vm assembler out of block.c.

However, all of this is still internal to liquid-c, so we can definitely change the VM instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really an operand if it's not in the bytecode, in an instruction. So it's a bit like having two sets of instructions. The problem we'll have is keeping them in sync all the time if we modify the bytecode.

Not sure what you mean RE the struct with a pointer.

But all good, you're right, step in the right direction!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that rather than using a separate ip_ptr and const_ptr_ptr the code would use an ip struct that has equivalent ip_ptr and const_ptr_ptr members. When calling any instruction-related functions you would always pass the ip struct around.

In the end the semantics of "operand" are somewhat arbitrary, at the moment the constants buffer just has pointers/small bits of metadata in it and could be interleaved into the instructions buffer and everything would be quite equivalent, with slightly different performance for some tasks.

@dylanahsmith what are the reasons you ended up going with the separate constants buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I meant for the instruction pointer struct.

I want the constants separate, because they will require serialization and deserialization, unlike the byte code which is already serialized and doesn't require deserialization for execution. Ruby's in-memory VM code interleaves the pointers immediately into the instruction, which ended up leading to pointer size instructions (to avoid alignment issues) and requires RubyVM::InstructionSequence.load_from_binary to deserialize the instructions.

Of course, we can also have a constant table, which is ideal for constants that get re-used (e.g. filter names, variable names and object keys are common to re-use) and allows only those ruby constants to be deserialized. I want to switch to that after adding serialization support.

Comment thread ext/liquid_c/vm.c Outdated
Comment thread ext/liquid_c/vm.c Outdated
Comment thread ext/liquid_c/vm.c

if (new_size > capacity) {
do {
capacity *= 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubling seems excessive. Max Fixnum value can be stored in 10 bytes. Lets say the output buffer is 10KB, doubling for the sake of adding 10 bytes?

Maybe this could rewritten to use rb_str_buf_cat_ascii, and let Ruby handle the resizing? Writing the fixnum to a char number_str[11] on the stack.

Or replace the whole thing w/ rb_str_buf_append(output, rb_fix2str(fixnum));?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unlikely to be the last 10 bytes, so doubling is meant to amortize the cost of resizing the output string.

Lets say the output buffer is 10KB, doubling for the sake of adding 10 bytes?

Maybe this could rewritten to use rb_str_buf_cat_ascii, and let Ruby handle the resizing?

You would actually have approximately the same behaviour. I created https://github.com/dylanahsmith/ruby-string-buffer a long time ago to make it easier to inspect the capacity.

irb> str = "1" * (10 * 1024);
irb> StringBuffer.capacity(str)
=> 10240
irb> str << "2";
irb> StringBuffer.capacity(str)
=> 20481

which actually allocates 1 extra byte, which keeps the allocation from being page aligned.

Writing the fixnum to a char number_str[11] on the stack.

We would still have to resize the output buffer to copy it from the stack to the output buffer, so this approach avoids the copy step.

Or replace the whole thing w/ rb_str_buf_append(output, rb_fix2str(fixnum));?

That would trade-off performance for simplicity, since it would allocate a new string object to write the result to as well as having to copy the result to the output buffer. The primary reason for writing write_fixnum was to avoid that string allocation overhead.

Comment thread ext/liquid_c/vm.c Outdated
Comment thread ext/liquid_c/vm.c
size_t hash_size = *ip++;
size_t num_keys_and_values = hash_size * 2;
VALUE hash = rb_hash_new();
VALUE *args_ptr = vm_stack_pop_n_use_in_place(vm, num_keys_and_values);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This falls into place so nicely, wow! Everything about this block of code is awesome 👌

// since this could be called in the middle of parsing
const uint8_t *end_ip = code->instructions.data_end;
while (ip < end_ip) {
switch (*ip++) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of loop is done in a few places, could be refactored using a table storing info about each instructions:

typedef struct instruction_def_t {
    size_t consts;
    size_t operands;
} instruction_def_t;

instructions_def_t instructions_defs[OP_MAX] = {
  /* OP_LEAVE */, { .consts = 0, operands = 0 },
  /* OP_WRITE_RAW */, { .consts = 2, operands = 0 },
  /* OP_WRITE_NODE */, { .consts = 1, operands = 0 },
  /* OP_POP_WRITE_VARIABLE */, { .consts = 0, operands = 0 },
  /* OP_PUSH_CONST */, { .consts = 1, operands = 0 },
  /* OP_HASH_NEW */, { .consts = 0, operands = 1 },
  /* OP_FILTER */, { .consts = 0, operands = 1 },
  /* OP_PUSH_EVAL_EXPR */, { .consts = 1, operands = 0 },
  /* OP_RENDER_VARIABLE_RESCUE */, { .consts = 0, operands = 0 },
};

Then you no longer have to review each of those loops each time an instruction is added or modified:

switch (*ip++) {
  // Special cases here
  case ...:
    break;
  default:
    const_ptr += instructions_defs[ip-1].consts;
    ip += instructions_defs[ip-1].operands;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced liquid_vm_next_instruction in this PR which is now used in 3 places to reduce duplication. The only remaining two, vm_assembler_gc_mark and vm_render_until_error have to handle all the opcodes. I think the only remaining duplication is between vm_assembler_gc_mark and liquid_vm_next_instruction, but they differ in that vm_assembler_gc_mark also needs to know which constants are ruby objects, so we would need to encode that into the instruction definition as well.

I was considering adding the table based approach to support the disassembler, so perhaps I could re-use it for other purposes if I don't switch to another approach for marking.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just an idea. I've seen it used often in VMs. Here's the one in .NET Core: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/opcode.def

Could be useful for building a debugger & disassembler later.

Comment thread ext/liquid_c/vm_assembler.h
Comment thread test/liquid_test.rb
Copy link
Copy Markdown
Contributor

@pushrax pushrax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially done this review - will finish tomorrow.

Comment thread ext/liquid_c/vm.c
}
}

void liquid_vm_next_instruction(const uint8_t **ip_ptr, const size_t **const_ptr_ptr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that rather than using a separate ip_ptr and const_ptr_ptr the code would use an ip struct that has equivalent ip_ptr and const_ptr_ptr members. When calling any instruction-related functions you would always pass the ip struct around.

In the end the semantics of "operand" are somewhat arbitrary, at the moment the constants buffer just has pointers/small bits of metadata in it and could be interleaved into the instructions buffer and everything would be quite equivalent, with slightly different performance for some tasks.

@dylanahsmith what are the reasons you ended up going with the separate constants buffer?

Comment thread ext/liquid_c/block.c
if (nodelist != Qnil)
return nodelist;
nodelist = rb_ary_new_capa(body->instructions.size / sizeof(VALUE));
nodelist = rb_ary_new_capa(body->render_score);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this addresses #58 (comment) which you can now ignore.

Comment thread ext/liquid_c/block.c
if (*ip == OP_WRITE_RAW) {
size_t *size_ptr = &const_ptr[1];
if (*size_ptr) {
*size_ptr = 0; // effectively a no-op
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping backwards compatibility

For #nodelist? I guess this discussion will answer #58 (comment) too

Comment thread ext/liquid_c/block.c
const char *text = (const char *)*const_ptr++;
size_t size = *const_ptr++;
const char *text = (const char *)const_ptr[0];
size_t size = const_ptr[1];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a bit faster to read.

Comment thread ext/liquid_c/c_buffer.c
buffer->data = xrealloc(buffer->data, new_capacity);
buffer->capacity = new_capacity;
capacity *= 2;
} while (capacity < required_capacity);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a non-branching bit twiddling trick for this https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2

Though that micro-optimization is irrelevant in this code given it's about to do a heap allocation :) Just a fun bit of art.

Comment thread ext/liquid_c/parser.c
{
switch (p->cur.type) {
case TOKEN_IDENTIFIER:
case TOKEN_OPEN_SQUARE:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have TOKEN_OPEN_ROUND for ranges. Also potentially TOKEN_DOTDOT for completeness though it doesn't look necessary based on the places this is called.

I confirmed that a range is a valid argument to a filter and is a valid variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't too concerned about this function identifying all constants, since it is only used as part of an optimization for calling filters with constant arguments. I did add support for constant ranges in the corresponding try_parse_constant_expression function in the follow-up PR (#60), so I could backport that case to this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also potentially TOKEN_DOTDOT for completeness though it doesn't look necessary based on the places this is called.

TOKEN_DOTDOT isn't a valid token for starting a constant expression, so handling it with the default case is fine.

Comment thread ext/liquid_c/variable.c Outdated
Comment thread ext/liquid_c/variable.c

static inline void parse_and_compile_expression(parser_t *p, vm_assembler_t *code)
{
bool is_const = will_parse_constant_expression_next(p);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mostly an optimization over checking for #evaluate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was mostly a forward thinking approach, since we want to avoid even creating a VariableLookup or RangeLookup when parsing expressions in the future. However, I did also think it would be faster than calling rb_respond_to

Comment thread ext/liquid_c/vm.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants