ZJIT: Method Inlining

# Background

Method call inlining is a common and profitable optimization for compilers. In Ruby, nearly all operations are method calls giving us an opportunity to affect performance widely. At the most basic level, avoiding the setup and clean-up of method calls saves execution time and memory. Perhaps more importantly, however, is that inlining a method gives the optimizer more context to make better optimization decisions. Interprocedural optimization is expensive and so its application is quite limited. Inlining a method body into another method turns that into intraprocedural optimization and that is something ZJIT does extensively today.

## Constraining Polymorphism

Inline caches have been an immensely powerful tool for speeding up applications in dynamic programming languages. Research has shown that at any given call site, the set of receiver types that a method is called on is typically very low (often, only 1). An inline cache avoids the expensive method resolution phase by caching the method handle locally and guarding it by a quick type check (e.g., reference comparison on cached shape and current receiver shape).

However, a compiled method must function correctly at all call sites. Across many call sites, we tend to see more diversity in both the receiver type and the argument types. So, while a given call site may be monomorphic, the compiled call target may be polymorphic or even megamorphic.

Previous work, focused on Ruby, has demonstrated that call target splitting can reduce the level of polymorphism. Here, we duplicate a method and allow the multiple copies to be compiled, only needing to support a smaller subset of call sites. Taken to the extreme, each call site would have its own copy of the call target. That, in turn, very frequently makes the call target monomorphic, which tends to have a cascading effect. The compiler can make bigger bets, producing more optimized code.

Method inlining provides a natural opportunity to support splitting. I propose that ZJIT inlines unoptimized methods. I.e., instead of taking the body a method that has already been processed by the optimization loop, we create a new copy of the uninitialized method by going back to the ISEQ list. Once the body of that method has been inlined into the caller and the CFG cleaned up to meet our (extended) basic block invariants, the combined list of instructions can be processed all at once.

There is a trade-off here: each inlined method will require memory space and, being compiled independently, will spend more time in the compiler. The benefit of this approach, however, is we can specialize the inlined method precisely for its use by the caller. This will typically allow the compiled code to be smaller than it would if it needed to support other call sites. There's a snowball effect as we inline more of the call graph.

# Inlining Policy

Inlining can be a finicky optimization. In many cases it improves performance, while in others it can have no impact at all, or worse, can slow down execution. The more we inline, the slower we make the compilation process. If we inline too much, we start to have a noticeable impact on the process's memory usage. Larger methods can cause thrashing in the CPU's instruction cache. Unbounded recursion could prevent the application from ever running at all.

To limit this effect, we will need an inlining policy. The policy will make the determination of whether a method should be inlined or not. We can make that decision based on data known to us at compile time, such as:

* Size of the target method in instructions
* Profile info that could tell us how frequently a method is called or a branch taken
* Method name and receiver type
* Total amount inlined so far

The inlining policy will need to be tuned over time. I suggest as a first step that we adopt the heuristics of other optimizing Ruby runtimes. Presumably, they've done some amount of work in devising their heuristics. While different VMs and compilers have different concerns, at a high level, inlining is more about the runtime behavior of Ruby code rather than the VM's execution of that code.

# Considerations

Inlining in Ruby can be a bit tricky. We have methods defined statically and others defined dynamically at runtime. We allow classes to be reopened and methods to be redefined at run time. We have blocks and procs, which are special types of methods, and in addition to returning, they can `break`. We have `set_trace_func` and `TracePoint`. We have the ability to fetch and manipulate backtraces. We have `binding`. Anything that needs access to the frame is going to need special handling since inlining logically removes that data. We have a rich set of metaprogramming methods. We have `throw` and `raise`, which affect control flow. We also have inspection tools, such as debuggers and profilers.

We also have methods implemented in C and Ruby &mdash; we will only inline Ruby methods into Ruby methods. That means to make the most of inlining, we'll want to have Ruby-based alternative implementations of common core library methods (e.g., we've previously done this with `Array#each`). As a community, and at Shopify in particular, we'll want to shift away from native extensions where a gem will suffice. For native extensions that exist only to bind to native libraries, we'll want to look hard at FFI and Fiddle.

I suggest we ignore most, if not all, of the special cases in this first pass to the extent that we can detect them, leaving them as regular method calls. Put another way, our initial inlining policy should be quite conservative in what it admits. We will inline simple methods and blocks with straightforward control flow and that don't need to materialize frames. We can always tweak from there. Boiling the ocean, as it were, would result in difficult to debug code propagating all throughout. Handling simple cases first will give us a firm foundation to build upon.

# Compilation Strategy

If all we do is take a method's body and insert into it the caller, we haven't gained a whole lot. Yes, we can reduce method call overhead and that has value. But, inlining gives us extra context to make better decisions in speculative optimization. Moreover, we should loop our optimization pass so we can revisit earlier phases after later phases have simplified or transformed the HIR for a method. Classically, we'd loop until we reach a fixed point, but we could establish an upper-bound (either hard-coded or config option) on loop iteration count. Notably, we must defer inlining until after type specialization in order to correctly identify the call target. However, we'd be inlining an unspecialized method body, so we'll ultimately need to run type specialization a second time to properly process the inlined method.

When the runtime state changes such that our optimized code is invalid (e.g., an inlined method is redefined), we need to unwind our inlined method, restore state, and side-exit to the interpreter. This is a process we call deoptimization. It implies that we have a way to detect an invalidation and a mechanism to recover our state. Beyond a performance hit, there should be no observable change to the run time behavior.

Once we've thrown away compilation, we can either alter our existing profile or start from scratch (it's unclear which would be most profitable at this time). With the new data, we may find it makes sense to attempt compiling again. In pathological cases, the runtime behavior never stabilizes or cycles through so many states that we'd label it megamorphic. To avoid decompilation loops, we'll want to know how many times we've inlined a method and had to throw it away.

# Related Projects

In order for inlining to really make sense, we need to have [lightweight frames](https://github.com/Shopify/ruby/issues/909). We can start the inlining project without lightweight frames, but having to inline the heavy frame logic will constrain performance.

Having a Ruby-based core library will be necessary to get the full advantage of ZJIT and method inlining. The Rubinius project started down this path many years ago. TruffleRuby has carried on from there. Chris Seaton [made the case](https://youtu.be/-iVh_8_J-Uo?si=ErMXheVT1o1lvMlU) for a Ruby-based core library at RubyConf 2022. It may make sense to coordinate with the JRuby and TruffleRuby teams. One challenge there, however, is each VM is likely to need a different set of primitives in different places, even if the overall implementation is Ruby.

Carrying on from migrating native code to Ruby, we'll want to have a faster FFI. Currently, binding to a native library and invoking a foreign function is simply slower than using a native extension. Often that performance hit isn't significant, but when it is, FFI becomes a poor choice. Even in cases where it wouldn't be a hot spot, it's hard to expect a slower solution to gain traction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZJIT: Method Inlining #967

Background

Constraining Polymorphism

Inlining Policy

Considerations

Compilation Strategy

Related Projects

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ZJIT: Method Inlining #967

Description

Background

Constraining Polymorphism

Inlining Policy

Considerations

Compilation Strategy

Related Projects

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions