Codebase list libmawk / bullseye-backports/main doc / developer / array_vs_func.txt

Tree @bullseye-backports/main (Download .tar.gz)

array_vs_func.txt @bullseye-backports/mainraw · history · blame

API: Virtual array vs. function

1. Introduction: virtual arrays

Libmawk implements support for virtual arrays: arrays that do not simply
store data in a hash. This feature is designed for maximal flexibility:
awk arrays have a set of hooks (function pointers) and whenever the
bytecode interpreter has to read, list or modify an array, it will
call the hook functions. The original array implementation is just
a set of hooks simply storing data without side effects. The ENVIRON[]
array is set up with another set of hooks that syncs the environment
with awk array (and calls the original hooks for the actual data storage).

2. API: arrays are function calls at the end of the day

This also means that an application developer has two alternative paths
to provide bindings to the application code:
 - explicit function calls
 - implicit function calls through an awk array

For example if the application wants to expose direct I/O port access,
it may:
 - implement functions io_out(port, value) and io_in(port)
 - implement an array IO[port]

However, having side effects is usually not desirable. An awk program that
uses functions not declared in the awk code makes it clear that these functions
are external, whereas an array may be just a global awk variable, side effects
are not obvious.

3. how to chose?

While technically the two ways of the API design are equivalent in the sense
that both end up in function calls, there are always considerations that
may make one better than the other for a specific application. Below are
the pros and cons:

Function calls, pro:
 - easy to recognize external calls in the awk code
 - can implement much more than lookup, set and list
 - may provide faster listing

Function calls, con:
 - longer awk source
 - when there are multiple different implementations under different names,
   an awk function that needs to operate on all may need to get a function
   name prefix and do dynamic function calls that makes the awk
   code look more complicated

Array, pro:
 - simple and short awk code, especially for listing ("for in")
 - when there are multiple different implementation, passing one of them
   to an awk function (by reference) hides the differences, keeping awk
   function code simple
 - the whole set of data can be handled together: output of split, generic
   array print or load functions

Array, con:
 - hidden side effects (risks awk source readability)
 - always have to implement lookup, set and list

NOTE: a major advantage of arrays is listing (the "for in" construction).
In mawk this is implemented by saving a list of all indices that exist at
the time of entering the loop. While most of the time this means duplicating
string references only, it may still be slow and may take considerable amount
of memory if the array is large. What counts as large may vary, but generally
a 10^6 indices may cause memory allocations in the megabyte range already.

In practice the following considerations could easily decide the question:
 A. if there are more operations than lookup/set/list, use functions
 B. if only set/get is required, check if listing looks useful or not;
    if useful, go for arrays (where listing is done via "for in");
    similarly: if there's a function based API for set/get/list, reconsider
    using an array instead of custom listing. Unless arrays are large!
 C. are generic array functions useful in common applications? If so, arrays
    may be better. Generic array functions include:
     - awk function that prints all indices of an array
     - awk function doing some complex lookup, e.g. regex search on all indices
     - loading the array from a string using split(); useful when indices
       are small integers, typically counting from 0 or 1
 D. would there be alternative implementations and generic awk functions
    operating on them depending on their arguments? If so, arrays may be better
    as they can be passed as reference

4. examples

According to these, the I/O port example is better implemented with functions
as arrays offer no benefit in any of the above points:
 A. has only set and get, at this point arrays are as good as functions
 B.  "for in" listing is not a typical application: array has no benefit
 C. printing all ports is rarely useful; complex lookups are not common;
    loading I/O space with split() is not useful;
    no obvious example of generic array code being useful on an IO array:
    array has no benefit
 D. having multiple alternative I/O spaces and passing one of these
    to an awk function as array is not probable: array has no benefit

An example where array is more suitable is an interface for network interfaces
(ifconfig): arrays NIC_IP[], NIC_NETMASK[], NIC_MTU[], etc, indexed by
the name of the nic:
 A. has lookup, set and list; array is as good as functions
 B. it's a reasonable application to list all interfaces: "for in" is useful,
    array looks better
 C. printing all interfaces makes sense; complex lookup
    (e.g. "all alias interfaces") makes sense; loading the array may make
    sense (e.g. restoring network settings); split wouldn't work, tho; array
    looks better
 D. no obvious alternative arrays to be passed in arg; array has no benefit

In point A. and D. arrays are not better than functions (but not worse either),
but in B. and C. arrays definitely have an advantage for this app.