API: Virtual array vs. function ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Introduction: virtual arrays Libmawk implements support for virtual arrays: arrays that do not simply store data in a hash. This feature is designed for maximal flexibility: awk arrays have a set of hooks (function pointers) and whenever the bytecode interpreter has to read, list or modify an array, it will call the hook functions. The original array implementation is just a set of hooks simply storing data without side effects. The ENVIRON[] array is set up with another set of hooks that syncs the environment with awk array (and calls the original hooks for the actual data storage). 2. API: arrays are function calls at the end of the day This also means that an application developer has two alternative paths to provide bindings to the application code: - explicit function calls - implicit function calls through an awk array For example if the application wants to expose direct I/O port access, it may: - implement functions io_out(port, value) and io_in(port) - implement an array IO[port] However, having side effects is usually not desirable. An awk program that uses functions not declared in the awk code makes it clear that these functions are external, whereas an array may be just a global awk variable, side effects are not obvious. 3. how to chose? While technically the two ways of the API design are equivalent in the sense that both end up in function calls, there are always considerations that may make one better than the other for a specific application. Below are the pros and cons: Function calls, pro: - easy to recognize external calls in the awk code - can implement much more than lookup, set and list - may provide faster listing Function calls, con: - longer awk source - when there are multiple different implementations under different names, an awk function that needs to operate on all may need to get a function name prefix and do dynamic function calls that makes the awk code look more complicated Array, pro: - simple and short awk code, especially for listing ("for in") - when there are multiple different implementation, passing one of them to an awk function (by reference) hides the differences, keeping awk function code simple - the whole set of data can be handled together: output of split, generic array print or load functions Array, con: - hidden side effects (risks awk source readability) - always have to implement lookup, set and list NOTE: a major advantage of arrays is listing (the "for in" construction). In mawk this is implemented by saving a list of all indices that exist at the time of entering the loop. While most of the time this means duplicating string references only, it may still be slow and may take considerable amount of memory if the array is large. What counts as large may vary, but generally a 10^6 indices may cause memory allocations in the megabyte range already. In practice the following considerations could easily decide the question: A. if there are more operations than lookup/set/list, use functions B. if only set/get is required, check if listing looks useful or not; if useful, go for arrays (where listing is done via "for in"); similarly: if there's a function based API for set/get/list, reconsider using an array instead of custom listing. Unless arrays are large! C. are generic array functions useful in common applications? If so, arrays may be better. Generic array functions include: - awk function that prints all indices of an array - awk function doing some complex lookup, e.g. regex search on all indices - loading the array from a string using split(); useful when indices are small integers, typically counting from 0 or 1 D. would there be alternative implementations and generic awk functions operating on them depending on their arguments? If so, arrays may be better as they can be passed as reference 4. examples According to these, the I/O port example is better implemented with functions as arrays offer no benefit in any of the above points: A. has only set and get, at this point arrays are as good as functions B. "for in" listing is not a typical application: array has no benefit C. printing all ports is rarely useful; complex lookups are not common; loading I/O space with split() is not useful; no obvious example of generic array code being useful on an IO array: array has no benefit D. having multiple alternative I/O spaces and passing one of these to an awk function as array is not probable: array has no benefit An example where array is more suitable is an interface for network interfaces (ifconfig): arrays NIC_IP[], NIC_NETMASK[], NIC_MTU[], etc, indexed by the name of the nic: A. has lookup, set and list; array is as good as functions B. it's a reasonable application to list all interfaces: "for in" is useful, array looks better C. printing all interfaces makes sense; complex lookup (e.g. "all alias interfaces") makes sense; loading the array may make sense (e.g. restoring network settings); split wouldn't work, tho; array looks better D. no obvious alternative arrays to be passed in arg; array has no benefit In point A. and D. arrays are not better than functions (but not worse either), but in B. and C. arrays definitely have an advantage for this app.