Compile variable nodes into the Liquid::C::BlockBody VM code#59
Conversation
macournoyer
left a comment
There was a problem hiding this comment.
All comments are refactoring & optimization ideas. This is shaping up nicely 👏 Performance improvements are adding up fast!
Great decision on going the stackless route (re-using the stack for nested blocks) 👍
However, I'm still doubtful about the constant pointer approach.
| if (*ip == OP_WRITE_RAW) { | ||
| size_t *size_ptr = &const_ptr[1]; | ||
| if (*size_ptr) { | ||
| *size_ptr = 0; // effectively a no-op |
There was a problem hiding this comment.
Could you instead introduce a OP_NOP instruction, and replace the whole instruction?
There was a problem hiding this comment.
Yes, although it would then have an unused constant arguments. I imagine that a OP_NOP would be faster, but I'm not concerned about further optimizing this edge case. Primarily these blank strings are handled in this way to minimize impact on code that isn't relying on this feature while keeping backwards compatibility, so this seemed simpler than a no-op instruction.
There was a problem hiding this comment.
keeping backwards compatibility
For #nodelist? I guess this discussion will answer #58 (comment) too
There was a problem hiding this comment.
Regarding backwards compatibility, I was referring to removing/ignoring blank strings from blank blocks. I think our only remaining dependency on nodelist is for tags.
| } | ||
| } | ||
|
|
||
| void liquid_vm_next_instruction(const uint8_t **ip_ptr, const size_t **const_ptr_ptr) |
There was a problem hiding this comment.
This got me confused at first. I think it should be renamed to liquid_vm_next_instruction_and_const.
But I still don't get the advantage of this approach over passing the constant index via an operand.
And I think it is painting us in a corner for many optimizations (block_body_remove_blank_strings is an example).
There was a problem hiding this comment.
These constants are pointer sized operands to the instruction, similar to one the byte sized ones.
In hindsight, the instruction pointer should be been a struct with a pointer to both to provide a bit of a zero-cost abstraction around them. We are also missing an abstraction for accessing those operands from the block body code. However, this PR is at least a step in the right directly by moving the vm assembler out of block.c.
However, all of this is still internal to liquid-c, so we can definitely change the VM instructions.
There was a problem hiding this comment.
It's not really an operand if it's not in the bytecode, in an instruction. So it's a bit like having two sets of instructions. The problem we'll have is keeping them in sync all the time if we modify the bytecode.
Not sure what you mean RE the struct with a pointer.
But all good, you're right, step in the right direction!
There was a problem hiding this comment.
I think the idea is that rather than using a separate ip_ptr and const_ptr_ptr the code would use an ip struct that has equivalent ip_ptr and const_ptr_ptr members. When calling any instruction-related functions you would always pass the ip struct around.
In the end the semantics of "operand" are somewhat arbitrary, at the moment the constants buffer just has pointers/small bits of metadata in it and could be interleaved into the instructions buffer and everything would be quite equivalent, with slightly different performance for some tasks.
@dylanahsmith what are the reasons you ended up going with the separate constants buffer?
There was a problem hiding this comment.
Yes, that is what I meant for the instruction pointer struct.
I want the constants separate, because they will require serialization and deserialization, unlike the byte code which is already serialized and doesn't require deserialization for execution. Ruby's in-memory VM code interleaves the pointers immediately into the instruction, which ended up leading to pointer size instructions (to avoid alignment issues) and requires RubyVM::InstructionSequence.load_from_binary to deserialize the instructions.
Of course, we can also have a constant table, which is ideal for constants that get re-used (e.g. filter names, variable names and object keys are common to re-use) and allows only those ruby constants to be deserialized. I want to switch to that after adding serialization support.
|
|
||
| if (new_size > capacity) { | ||
| do { | ||
| capacity *= 2; |
There was a problem hiding this comment.
Doubling seems excessive. Max Fixnum value can be stored in 10 bytes. Lets say the output buffer is 10KB, doubling for the sake of adding 10 bytes?
Maybe this could rewritten to use rb_str_buf_cat_ascii, and let Ruby handle the resizing? Writing the fixnum to a char number_str[11] on the stack.
Or replace the whole thing w/ rb_str_buf_append(output, rb_fix2str(fixnum));?
There was a problem hiding this comment.
This is unlikely to be the last 10 bytes, so doubling is meant to amortize the cost of resizing the output string.
Lets say the output buffer is 10KB, doubling for the sake of adding 10 bytes?
Maybe this could rewritten to use
rb_str_buf_cat_ascii, and let Ruby handle the resizing?
You would actually have approximately the same behaviour. I created https://github.com/dylanahsmith/ruby-string-buffer a long time ago to make it easier to inspect the capacity.
irb> str = "1" * (10 * 1024);
irb> StringBuffer.capacity(str)
=> 10240
irb> str << "2";
irb> StringBuffer.capacity(str)
=> 20481
which actually allocates 1 extra byte, which keeps the allocation from being page aligned.
Writing the fixnum to a
char number_str[11]on the stack.
We would still have to resize the output buffer to copy it from the stack to the output buffer, so this approach avoids the copy step.
Or replace the whole thing w/
rb_str_buf_append(output, rb_fix2str(fixnum));?
That would trade-off performance for simplicity, since it would allocate a new string object to write the result to as well as having to copy the result to the output buffer. The primary reason for writing write_fixnum was to avoid that string allocation overhead.
| size_t hash_size = *ip++; | ||
| size_t num_keys_and_values = hash_size * 2; | ||
| VALUE hash = rb_hash_new(); | ||
| VALUE *args_ptr = vm_stack_pop_n_use_in_place(vm, num_keys_and_values); |
There was a problem hiding this comment.
This falls into place so nicely, wow! Everything about this block of code is awesome 👌
| // since this could be called in the middle of parsing | ||
| const uint8_t *end_ip = code->instructions.data_end; | ||
| while (ip < end_ip) { | ||
| switch (*ip++) { |
There was a problem hiding this comment.
This type of loop is done in a few places, could be refactored using a table storing info about each instructions:
typedef struct instruction_def_t {
size_t consts;
size_t operands;
} instruction_def_t;
instructions_def_t instructions_defs[OP_MAX] = {
/* OP_LEAVE */, { .consts = 0, operands = 0 },
/* OP_WRITE_RAW */, { .consts = 2, operands = 0 },
/* OP_WRITE_NODE */, { .consts = 1, operands = 0 },
/* OP_POP_WRITE_VARIABLE */, { .consts = 0, operands = 0 },
/* OP_PUSH_CONST */, { .consts = 1, operands = 0 },
/* OP_HASH_NEW */, { .consts = 0, operands = 1 },
/* OP_FILTER */, { .consts = 0, operands = 1 },
/* OP_PUSH_EVAL_EXPR */, { .consts = 1, operands = 0 },
/* OP_RENDER_VARIABLE_RESCUE */, { .consts = 0, operands = 0 },
};Then you no longer have to review each of those loops each time an instruction is added or modified:
switch (*ip++) {
// Special cases here
case ...:
break;
default:
const_ptr += instructions_defs[ip-1].consts;
ip += instructions_defs[ip-1].operands;
}There was a problem hiding this comment.
I introduced liquid_vm_next_instruction in this PR which is now used in 3 places to reduce duplication. The only remaining two, vm_assembler_gc_mark and vm_render_until_error have to handle all the opcodes. I think the only remaining duplication is between vm_assembler_gc_mark and liquid_vm_next_instruction, but they differ in that vm_assembler_gc_mark also needs to know which constants are ruby objects, so we would need to encode that into the instruction definition as well.
I was considering adding the table based approach to support the disassembler, so perhaps I could re-use it for other purposes if I don't switch to another approach for marking.
There was a problem hiding this comment.
Yeah just an idea. I've seen it used often in VMs. Here's the one in .NET Core: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/opcode.def
Could be useful for building a debugger & disassembler later.
f88d1a7 to
19be073
Compare
37dfbfa to
3540250
Compare
ea00f8b to
5a5b34a
Compare
233322a to
e00cdf9
Compare
5a5b34a to
9d8a3d7
Compare
pushrax
left a comment
There was a problem hiding this comment.
Partially done this review - will finish tomorrow.
| } | ||
| } | ||
|
|
||
| void liquid_vm_next_instruction(const uint8_t **ip_ptr, const size_t **const_ptr_ptr) |
There was a problem hiding this comment.
I think the idea is that rather than using a separate ip_ptr and const_ptr_ptr the code would use an ip struct that has equivalent ip_ptr and const_ptr_ptr members. When calling any instruction-related functions you would always pass the ip struct around.
In the end the semantics of "operand" are somewhat arbitrary, at the moment the constants buffer just has pointers/small bits of metadata in it and could be interleaved into the instructions buffer and everything would be quite equivalent, with slightly different performance for some tasks.
@dylanahsmith what are the reasons you ended up going with the separate constants buffer?
| if (nodelist != Qnil) | ||
| return nodelist; | ||
| nodelist = rb_ary_new_capa(body->instructions.size / sizeof(VALUE)); | ||
| nodelist = rb_ary_new_capa(body->render_score); |
There was a problem hiding this comment.
Ah this addresses #58 (comment) which you can now ignore.
| if (*ip == OP_WRITE_RAW) { | ||
| size_t *size_ptr = &const_ptr[1]; | ||
| if (*size_ptr) { | ||
| *size_ptr = 0; // effectively a no-op |
There was a problem hiding this comment.
keeping backwards compatibility
For #nodelist? I guess this discussion will answer #58 (comment) too
| const char *text = (const char *)*const_ptr++; | ||
| size_t size = *const_ptr++; | ||
| const char *text = (const char *)const_ptr[0]; | ||
| size_t size = const_ptr[1]; |
There was a problem hiding this comment.
This is definitely a bit faster to read.
| buffer->data = xrealloc(buffer->data, new_capacity); | ||
| buffer->capacity = new_capacity; | ||
| capacity *= 2; | ||
| } while (capacity < required_capacity); |
There was a problem hiding this comment.
There's a non-branching bit twiddling trick for this https://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2
Though that micro-optimization is irrelevant in this code given it's about to do a heap allocation :) Just a fun bit of art.
| { | ||
| switch (p->cur.type) { | ||
| case TOKEN_IDENTIFIER: | ||
| case TOKEN_OPEN_SQUARE: |
There was a problem hiding this comment.
I think this should have TOKEN_OPEN_ROUND for ranges. Also potentially TOKEN_DOTDOT for completeness though it doesn't look necessary based on the places this is called.
I confirmed that a range is a valid argument to a filter and is a valid variable.
There was a problem hiding this comment.
I wasn't too concerned about this function identifying all constants, since it is only used as part of an optimization for calling filters with constant arguments. I did add support for constant ranges in the corresponding try_parse_constant_expression function in the follow-up PR (#60), so I could backport that case to this PR.
There was a problem hiding this comment.
Also potentially TOKEN_DOTDOT for completeness though it doesn't look necessary based on the places this is called.
TOKEN_DOTDOT isn't a valid token for starting a constant expression, so handling it with the default case is fine.
|
|
||
| static inline void parse_and_compile_expression(parser_t *p, vm_assembler_t *code) | ||
| { | ||
| bool is_const = will_parse_constant_expression_next(p); |
There was a problem hiding this comment.
Is this mostly an optimization over checking for #evaluate?
There was a problem hiding this comment.
It was mostly a forward thinking approach, since we want to avoid even creating a VariableLookup or RangeLookup when parsing expressions in the future. However, I did also think it would be faster than calling rb_respond_to
9d8a3d7 to
239b210
Compare
c05a7ca to
4be2f77
Compare
239b210 to
eeae3a1
Compare
eeae3a1 to
a2c628d
Compare
a2c628d to
25be6e4
Compare
25be6e4 to
c32997e
Compare
c32997e to
818b8a0
Compare
818b8a0 to
6f561d6
Compare
Depends on #58 and Shopify/liquid#1294This is a continuation of #58 that compiles the strict parseable liquid variables directly in the block body VM code for performance.
For simplicity, variables and expressions in tags do not yet leverage this VM code and the variable VM code still relies on the existing expression evaluation code (e.g.
context_evaluateof Liquid::VariableLookup and Liquid::RangeLookup). These will be optimized in following PR(s).All the commits except the (currently) last one (Compile variable nodes into the Liquid::C::BlockBody VM code) are refactors:
I've moved the VM instructions and constants out of block_body_t into a vm_assembler_t struct, since we will want to embed the VM code in a future Liquid::C::Variable. In the future, I intend to wrap the assembler in a Liquid::C::Assembler as a safe API for compiling tags from ruby code(edit: extracted to Extract a vm_assembler_t struct from block_body_t #69)I changed the c_buffer struct to store pointers for the end of the data and capacity, instead of storing the sizes and capacity directly. This is because we are mostly getting and updating the data end pointer, which is especially true in order to use this for the VM stack where we reserve space for the block and don't have to check the capacity for each write/push.(edit: extracted to Refactor c_buffer to make it suitable to re-use for a stack #70)I've also added support to c_buffer to be initialized with size 0 (without an allocation) which is the one case where we can't just double the capacity until we have the requested extra capacity.(edit: extracted to Refactor c_buffer to make it suitable to re-use for a stack #70)I've added a liquid_vm_next_instruction function to reduce duplication for code iterating instructions by using it for both Liquid::C::BlockBody#remove_blank_strings and Liquid::C::BlockBody#nodelist. This way I can further leverage it for variable error handling.(edit: extracted to Add liquid_vm_next_instruction to reduce duplication #72)The VM stack is one of the significant additions from this PR. I chose to re-use the stack for nested blocks, which means that it needs to be expandable, hence the use of c_buffer_t for its memory allocation. We can't globally determine the maximum stack size, since this stack could even be re-used in a template partial, but this PR does calculate the maximum stack size needed for a block body so that we can reserve stack space at the start of block body render and not have to check for sufficient capacity throughout the block body render.
Another design choice I made for this PR is to do error handling for the whole block body render, rather than for each variable render. This way we can reduce the state saving cost of
rb_rescuefrom liquid code that doesn't encounter variable render errors. In order to recover from the exception, we just restore the stack size to what it was at the start of the block body render and iterate the instructions to jump just past the variable write.Since the variable lookup and expression evaluation is still happen as it did before, the most significant change for performance is the filter invocation, which can happen without allocating an arguments array by leveraging the VM stack calling the filter function directly using
rb_funcallv. The filterinvokable?is still being done at runtime (we need to pass filters as a parse option to do it at parse time) but is still optimized by doing the check directly from C using a hash with symbol keys.Benchmark
Before this PR (on the #58 branch)
after