Codebase list libmawk / bullseye-backports/main doc / developer / vm.html

Tree @bullseye-backports/main (Download .tar.gz)

vm.html @bullseye-backports/mainraw · history · blame

<h1> The Virtual Machine (VM) </h1>

The VM lives in execute.c function mawk_execute_().
In principle it takes an array of instructions (INST), executing them
one by one, operating on the current <a href="stack.html"> evaluation
stack</a> (or stack for short). Most of the execution really goes in
order as the compiler prepares up everything to achieve linearity.
This makes the execution loop relatively simple and efficient: it is
just a large <i>while(dont_need_to_exit) execute_next_instruction;</i>.
The actual instruction execution is a real large switch on the instruction
An usual example on how this is implemented in practice can be
obtained by <i>lmawk -Wdump -f test.awk</i> on some simple arithmetic
<table border=1>
<tr><th> awk <th> asm (VM instructions)
BEGIN { a = 3 + 4 * 5 }
000 pusha	a
002 pushd	3
004 pushd	4
006 pushd	5
008 mul
009 add
010 assign
011 pop
012 exit0
First the lvalue (target variable, left side) of the assignment is pushed,
then the expression (right side). The stack is, from top to down: {5, 4, 3, a}.
The top of the stack is 5, the second element is 4 by the time <i>mul</i> runs.
<i>Mul</i> will replace these two elements by 20, decreasing the stack size by
one, leaving the result on the top. Next <i>add</i> does a similar job,
replacing the top two items of the stack {20, 3} with their sum, 23. At
the end <i>assign</i> runs on the stack {23, a}, removing both items, copying
the value 23 to the global variable <i>a</i>. At the end it also puts
the result on the top of the stack, leaving the stack as {23} - this is
the result (output) of the assignment operation. Since the script doesn't
need to use the result, it runs a <i>pop</i> that removes and discards
the top item, leaving the stack empty. Since the script didn't have main
or END parts, the script can quit at this point, executing the <i>exit0</i>
instruction (exiting with value 0 - the implicit exit).
NOTE: currently there's absolutely no optimization in the parser: everything
is calculated as written in the script and some values are saved just to be
discarded by the next instruction.
An interesting and important feature of execute_() is that it can save all
states and return to the caller <b>at any point</b> of the execution, i.e.
between any two instruction in the code. It can also resume execution from
the next instruction. This provides the host application full control over
scheduling the script, while the script can be built of sequential, blocking

<h2> Jumps and conditions </h2>

There are a few instructions that have to break linear execution flow, tho:
	<li> <a href="parser_if.html"> if()</a> needs to make a jump depending on some condition
	<li> conditional code (e.g. <i>/^A/ { ... }</i>) is the same
	<li> range conditional code (e.g. <i>/^A/,/^B/ { ... }</i>) is still the same with some extra complications (internal state tracking whether the script is within the range)
	<li> <a href="parser_for.html">loops</a> obviously need to do a similar conditional jump back
	<li> function calls have to temporarily suspend executing the current code and jump to the function code from which a <i>ret</i> or <i>ret0</i> instruction jumps back
Some of the above are implemented using conditional and unconditional jumps
to direct addresses (first column on the asm). For example a simple if
is compiled to contain 2 jumps:
<table border=1>
<tr><th> awk <th> asm (VM instructions)
	if (bool)
		a = 6
		a = 7
000 pushi	bool
002 jz		012
004 pusha	a
006 pushd	6
008 assign
009 pop
010 jmp		018
012 pusha	a
014 pushd	7
016 assign
017 pop
018 exit0
The first one is a conditional jump, "jump if [top of the stack is] zero"
(<i>jz</i>) - this makes the VM jump to the <i>else</i> branch at address 10.
The <i>then</i> branch ends in an unconditional jump to the next instruction
after the if (which is the implicit exit in this example), bypassing
the code of the <i>else</i> branch.
A jump is carried out by a simple modification of the "next instruction"
pointer before running the next iteration of the execution loop.

<h2> Recursion: function calls </h2>

A slightly more complicated mechanism is used when jumps are of recursive
nature: the code has to jump to somewhere to do some work and then
return here and continue execution from the next instruction. A typical
example on this is executing user functions.
The original mawk implementation simply called mawk_execute_() recursively.
This meant the C compiler took care of saving all internal states on the
C stack for the detour. However, this wouldn't allow the code to be suspended
during such detour as it would be problematic to rebuild the C stack on a resume.
Thus libmawk's mawk_execute_() does not recurse on C level but on VM level.
For example when a function is called (using the <i>call</i> instruction):
	<li> 1. prepare the eval stack for the function (as mawk did): push arguments and local variables
	<li> 2. push execution state on top of the eval stack
	<li> 3. unconditional jump to the function body
Upon a <i>ret</i> instruction from the function:
	<li> 1. save and remove the return value from the top of the stack
	<li> 2. restore previous execution state from the top of the stack
	<li> 3. remove all arguments and local variables from the stack (remove the frame)
	<li> 4. push the return value on top of the stack
	<li> 5. continue execution normally - because step 2., this will run the instruction right after the <i>call</i>

<h3> additional cases of recursion </h3>
A range pattern is recursive as well: it needs to evaluate one or two pattern
matching before it decides whether to execute the action and/or update
the state. The range check starts with instruction _RANGE_CHK which
encodes expression code offsets and state in the next few instruction slots. It
recurses to evaluate expressions which are terminated by the _RANGE_STOP command.
Entering an expression evaluation is similar to a function call while
_RANGE_STOP is very similar to a ret.

	<h3> deep recursion </h3>
	At any time the eval stack has to have enough space after sp for
	evaluation the longest awk expression. Any user function recursion will
	bump sp leaving less room for expressions and further recursion. Relocating
	the stack (with a realloc()) is not a good idea as there might be cell
	pointers pointing to stack elements all around.
	Instead, mawk limits expression length in compile time to a fixed
	maximum. If entering a new function would not leave at list this
	amount of eval stack above sp, "deep recursion" is performed. This
	starts by allocating an entire new stack for the call. Call stacking
	saves enough pointers so that the code can switch back to the
	previous stack easily. The allocation is done using zmalloc(), the overhead
	is minimal. Since the original stack/stacks is/are kept intact, any
	pointer stays valid. sp points into the new stack block and will increase
	there until another deep recursion.
	This wastes some stack space on the old stack (potentially max expression
	length minus one slot) but guarantees that:
	 - checks and special things need to be done only at entering/leaving functions
	 - even that happens rarely as a stack block is large enough to host
	   many functions besides the longest expression
	 - the stack can grow as big as it likes to, without having to allocate
	   one large block of memory
	 - all allocation is done from normal instance memory - allocation limit,
	   and auto cleanup at the end are granted

<h2> Resuming execute_() </h2>

Since the far most common thing in an embedded setup is to resume a
code interrupted by execution limit or a blocking getline, mawk_execute_() is
doing that by default. The top few slots in the eval stack is always a full
state dump, the same thing used in recursion. Entering mawk_execute_() pops this
section and initializes all internal states from it. When execution needs
to be interrupted, mawk_execute_() saves internal states onto the top
of the stack.

<h2> Entering execute_() (fresh start) </h2>

Entering in run state involves setting up internal states pointing to
the beginning of the code in question, pushing these states on top of
the stack and calling mawk_execute_() which will "resume" from these states.
Similar thing happens when the application calls an awk function.
It may be that the execution is interrupted in the middle of running of
a large block of code, for example in BEGIN. The top of the stack holds
the current execution state so that mawk_execute_() will be able to
continue execution. The application may decide to run an awk
function before resuming the code: this operation would push a new
set of execution state on top of the stack and call mawk_execute_().
When the current state finishes at the _RET instruction, mawk_execute_()
would take the next frame from the stack and would automatically resume
execution of the interrupted BEGIN block. This would cause the return value
of the function to be lost and would attempt to resume BEGIN as a side effect
of the function call!
To avoid such confusion, any new enter to mawk_execute_() is required to
push two sets of states: an EXEST_EXIT and the actual state it wants
to "resume" at (start execution at). When mawk_execute_() hits the _RET
instruction in the above example, it does pop the next frame, but that
frame would be the EXEST_EXIT which would cause it to interrupt immediately.
This leaves the stack exactly as it looked like before the function call, 
and the application later may decide to resume execution.
Fresh start entries:
	<li> running BEGIN (last stage of initialization)
	<li> running main on new input, in non-interrupted state
	<li> running END (first stage of uninitialization)
	<li> an awk function is called by the application