Codebase list libmawk / upstream/1.0.2 doc / execution.html
upstream/1.0.2

Tree @upstream/1.0.2 (Download .tar.gz)

execution.html @upstream/1.0.2raw · history · blame

<HTML>
<BODY>
<H1> awk script execution </H1>
Libmawk runs the bytecode of the script in a <a href="developer/vm">virtual machine</a>.
The VM takes the bytecode as a series of instructions that operate on data
stored on the execution stack and in global states of the script instance
(libmawk_state_t).
<p>
There is only one thing at a time an instance is doing, however that
one thing may be interrupted and resumed any time. This one thing is
always one of these:
<ul>
	<li> running BEGIN
	<li> running END
	<li> running main (the stdin read and pattern match rules)
	<li> running an awk function called from the application
	<li> nothing: empty stack; before starting or after finishing any of the above activities
</ul>
<p>
BEGIN, END, main and awk functions are the four entry points of executing
the script. Normally BEGIN is run right after setting up the script, then
main is run on all input and END is run when the script exits, right
before uninitialization of the script instance. This is a 1:1 copy
of the standard way awk works. The fourth, calling awk functions directly
from the application is an extra entry point.
<p>
The script is not doing anything unless the application commands it to. Some
of the simplified API does this automatically, but the raw API (staged
init/uninit) always lets the app decide when to <b>start</b> running the script.
This document calls an <i>execution transaction<i> when the application calls
the API to start running a script.
<p>
Any execution related call is non-blocking, thus it will return after a
reasonable time spent running the script and will never stuck running
an infinite loop. When such an API call returns, the return value
is a mawk_exec_result_t that indicates the reason of the return:
<ul>
	<li> 1. if the script attempts to read a file/pipe that would block, it interrupts execution and returns (with <i>MAWK_EXER_INT_READ</i>) instead
	<li> 2. when reaching the run limit (a given number of instructions has been executed), the script is interrupted (return value is <i>MAWK_EXER_INT_RUNLIMIT</i>)
	<li> 3. the script may finish executing the current execution transaction (<i>MAWK_EXER_DONE</i> or <i>MAWK_EXER_FUNCRET</i>)
	<li> 4. the script may decide to exit
</ul>
<p>
<i>Execution transaction<i> are collected on the evaluation stack. If
the application requests an execution and the API call returns before
finishing, the transaction is still active. The application is
free to initiate a new <i>execution transaction<i>, without
first finishing the previous one. However, the VM will always resume and
progress running the most recent <i>execution transaction<i>. This means
<i>execution transactions<i> are sort of nested. When the top, most recent
<i>execution transaction<i> finishes (return 3), the next resume request
will go on with the previous transaction.
<p>
Note, however, that the script has global states. The most obvious state
is the exit state: if the script runs exit(), it will discard all open
transactions. For example consider a script that is running a main part
processing the input. When the application is in this phase, the topmost
transaction is always a "running main" transaction that returned
previously because there was no more input to be processed. If the
application calls an awk function that decides to do an exit(), that will
affect not only discard the function transaction but the pending "running main"
transaction as well. Whenever the application requests a resume on
the code, that will start running the END section.


<h2> return path 1.: MAWK_EXER_INT_READ </h2>
Assume stdin is a FIFO between the application and the script. The
first script tries to prefix each line:
<pre>
{
	print "prefix:", $0
}
</pre>
The application fills the FIFO with some data that may contain one or
more full records, potentially ending with a partial (unterminated)
record. If the application resumes the script, it will try to
read all full records and process them. It will interrupt
execution and return MAWK_EXER_INT_READ the first time a full
record can't be read. This always happens "before the {}".
<p>
A slightly more complicated script prefixes odd and even lines differently:
<pre>
{
	print "odd:", $0
	getline
	print "even:", $0
}
</pre>
This script may return with MAWK_EXER_INT_READ either before {}
or in the getline instruction. This means the application should not
assume that when main returns it was not in the middle of such
a block. (In the actual VM main starts with an implicit getline so
there's no difference between the two cases).
<p>
A similar situation is when an awk function is executing getline on a FIFO:
the application that calls the function shall not expect that the function
finishes and produces its return value in the initial execution request.
Instead the request will create a new <i>execution transaction<i> and
multiple resume calls may be needed until the function actually returns.
<p>
Obviously the application shall fill the FIFO while executing resumes:
if there is no new input and the script is waiting for new input, the
resume call will return immediately.


<h2> return path 2.: <i>MAWK_EXER_INT_RUNLIMIT</i> </h2>
When runlimit is set the VM returns after executing a certain amount of
instructions. The application shall decide whether to simply resume or
to stop executing the script.
<p>
This feature is useful when the application is implemented as a single
threaded async loop: running a blocking script would block the entire loop.


<h2> return path 3.: <i>MAWK_EXER_DONE</i> or <i>MAWK_EXER_FUNCRET</i> </h2>
When BEGIN or main or END finishes <i>MAWK_EXER_DONE</i> is returned. When
an awk function called by the application returns, <i>MAWK_EXER_FUNCRET</i>
is returned and the retc argument is filled with the return value cell
(which may be of cell type NOINIT in case there was no return value).
<p>
The application <b>shall never</b> expect the initial call that
created the new <i>execution transaction<i> will end in
<i>MAWK_EXER_DONE</i> or <i>MAWK_EXER_FUNCRET</i>; when it does not,
a subsequent resume call eventually will.

<h2> return path 4.: <i>MAWK_EXER_EXIT</i> </h2>
Similar to <i>MAWK_EXER_DONE</i>, but means the script called exit.
This is legal from even an awk function call, in which case the
function will never have a return value (as the code can not be resumed
any more). Normal awk rules apply: calling exit() from BEGIN or main
(or subsequent functions, called by the script or the application) puts
the script in exit mode and next resume will run END. Calling exit from
END will exit immediately leaving the script in non-runnable state.


<h2> conclusion: script execution </h2>
It is safe to assume calling any script execution will return with
a conclusion if, and only if:
<ul>
	<li> the script is not allowed to use getline on FIFOs (which can not be guaranteed!) or there are no FIFOs or otherwise blocking input (i.e. all files are plain files); and
	<li> there is no run limit configured
</ul>
<p>
Since these are not guaranteed in most common use cases, the code should prepare
to:
<ul>
	<li> start executing the code and check if it's already finished
	<li> resume until it actually does finish
	<li> if the script returned <i>MAWK_EXER_INT_READ</i>: fill FIFOs or if that's not possible stop resuming as there won't be any progress
</ul>
<p>
Thus following c-pseudo-code should be used:
<pre>
TODO
</pre>