This is the updated base of a paper I wrote with Horst.
It provides a good introduction to Dillo's internals.
(Highly recommended if you plan to patch or develop in Dillo)
It may not be exactly accurate (it's quite old), but it explains
the theory behind in some detail, so it's more than recommended
reading material.
--Jcid
-----------------------------------------------------
Parallel network programming of the Dillo web browser
-----------------------------------------------------
Jorge Arellano-Cid <jcid@dillo.org>
Horst H. von Brand <vonbrand@inf.utfsm.cl>
--------
Abstract
--------
Network programs face several delay sources when sending or
retrieving data. This is particularly problematic in programs
which interact directly with the user, most notably web browsers.
We present a hybrid approach using threads communicated through
pipes and signal driven I/O, which allows a non-blocking main
thread and overlapping waiting times.
------------
Introduction
------------
The Dillo project didn't start from scratch but mainly working
on the code base of gzilla (a light web browser written by Raph
Levien). As the project went by, the code of the whole source was
standardized, and the networking engine was replaced with a new,
faster design. Later, the parser was changed, the streamed
handling of data and its error control was put under the control
of the CCC (Concomitant Control Chain), and the old widget system
was replaced with a new one (Dw). The source code is currently
regarded as "very stable beta", and is available at
<http://www.dillo.org>. Dillo is a project licensed under the GNU
General Public License.
This paper covers basic design aspects of the hybrid approach
that the Dillo web browser uses to solve several latency
problems. After introducing the main delay-sources, the main
points of the hybrid design will be addressed.
-------------
Delay sources
-------------
Network programs face several delay-sources while sending or
retrieving data. In the particular case of a web browser, they
are found in:
DNS querying:
The time required to solve a name.
Initiating the TCP connection:
The three way handshake of the TCP protocol.
Sending the query:
The time spent uploading queries to the remote server.
Retrieving data:
The time spent expecting and receiving the query answer.
Closing the TCP connection:
The four packet-sending closing sequence of the TCP protocol.
In a WAN context, every single item of this list has an
associated delay that is non deterministic and often measured in
seconds. If we add several connections per browsed page (each one
requiring at least the 4 last steps), the total latency can be
considerable.
-----------------------------------
The traditional (blocking) approach
-----------------------------------
The main problems with the blocking approach are:
When issuing an operation that can't be completed
immediately, the process is put to sleep waiting for
completion, and the program doesn't do any other
processing in the meantime.
When waiting for a specific socket operation to complete,
packets that belong to other connections may be arriving,
and have to wait for service.
Web browsers handle many small transactions,
if waiting times are not overlapped
the latency perceived by the user can be very annoying.
If the user interface is just put to sleep during network
operations, the program becomes unresponsive, confusing
and perhaps alarming the user.
Not overlapping waiting times and processing makes
graphical rendering (which is arguably the central function
of a browser) unnecessarily slow.
---------------------
Dillo's hybrid design
---------------------
Dillo uses threads and signal driven I/O extensively to
overlap waiting times and computation. Handling the user
interface in a thread that never blocks gives a good interactive
``feel.'' The use of GTK+, a sophisticated widget framework for
graphical user interfaces, helped very much to accomplish this
goal. All the interface, rendering and I/O engine was built upon
its facilities.
The design is said to be ``hybrid'' because it uses threads
for DNS querying and reading local files, and signal driven I/O
for TCP connections. The threaded DNS scheme is potentially
concurrent (this depends on underlying hardware), while the I/O
handling (both local files and remote connections) is
definitively parallel.
To simplify the structure of the browser, local files are
encapsulated into HTTP streams and presented to the rest of the
browser as such, in exactly the same way a remote connection is
handled. To create this illusion, a thread is launched. This
thread opens a pipe to the browser, it then synthesizes an
appropriate HTTP header, sends it together with the file to the
browser proper. In this way, all the browser sees is a handle,
the data on it can come from a remote connection or from a local
file.
To handle a remote connection is more complex. In this case,
the browser asks the cache manager for the URL. The name in the
URL has to be resolved through the DNS engine, a socket TCP
connection must be established, the HTTP request has to be sent,
and finally the result retrieved. Each of the steps mentioned
could give rise to errors, which have to be handled and somehow
communicated to the rest of the program. For performance reasons,
it is critical that responses are cached locally, so the remote
connection doesn't directly hand over the data to the browser;
the response is passed to the cache manager which then relays it
to the rest of the browser. The DNS engine caches DNS responses,
and either answers them from the cache or by querying the DNS.
Querying is done in a separate thread, so that the rest of the
browser isn't blocked by long waits here.
The activities mentioned do not happen strictly in the order
stated above. It is even possible that several URLs are being
handled at the same time, in order to overlap waiting and
downloading. The functions called directly from the user
interface have to return quickly to maintain interactive
response. Sometimes they return connection handlers that haven't
been completely set up yet. As stated, I/O is signal-driven, when
one of the descriptors is ready for data transfer (reading or
writing), it wakes up the I/O engine.
Data transfer between threads inside the browser is handled by
pipes, shared memory is little used. This almost obviates the
need for explicit synchronization, which is one of the main areas
of complexity and bugs in concurrent programs. Dillo handles its
threads in a way that its developers can think of it as running
on a single thread of control. This is accomplished by making the
DNS engine call-backs happen within the main thread, and by
isolating file loading with pipes.
Using threads in this way has three big advantages:
The browser doesn't block when one of its child threads
blocks. In particular, the user interface is responsive
even while resolving a name or downloading a file.
Developers don't need to deal with complex concurrent
concerns. Concurrency is hard to handle, and few developers
are adept at this. This gives access a much larger pool of
potential developers, something which can be critical
in an open-source development project.
By making the code mostly sequential, debugging the code
with traditional tools like gdb is possible. Debugging
parallel programs is very hard, and appropriate tools are
hard to come by.
Because of simplicity and portability concerns, DNS querying
is done in a separate thread. The standard C library doesn't
provide a function for making DNS queries that don't block. The
alternative is to implement a new, custom DNS querying function
that doesn't block. This is certainly a complex task, integrating
this mechanism into the thread structure of the program is much
simpler.
Using a thread and a pipe to read a local file adds a
buffering step to the process (and a certain latency), but it has
a couple of significative advantages:
By handling local files in the same way as remote
connections, a significant amount of code is reused.
A preprocessing step of the file data can be added easily,
if needed. In fact, the file is encapsulated into an HTTP
data stream.
-----------
DNS queries
-----------
Dillo handles DNS queries with threads, letting a child thread
wait until the DNS server answers the request. When the answer
arrives, a call-back function is called, and the program resumes
what it was doing at DNS-request time. The interesting thing is
that the call-back happens in the main thread, while the child
thread simply exits when done. This is implemented through a
server-channel design.
The server channel
------------------
There is one thread for each channel, and each channel can
have multiple clients. When the program requests an IP address,
the server first looks for a cached match; if it hits, the client
call-back is invoked immediately, but if not, the client is put
into a queue, a thread is spawned to query the DNS, and a GTK+
idle client is set to poll the channel 5~times per second for
completion, and when it finally succeeds, every client of that
channel is serviced.
This scheme allows all the further processing to continue on
the same thread it began: the main thread.
------------------------
Handling TCP connections
------------------------
Establishing a TCP connection requires the well known
three-way handshake packet-sending sequence. Depending on network
traffic and several other issues, significant delay can occur at
this phase.
Dillo handles the connection by a non blocking socket scheme.
Basically, a socket file descriptor of AF_INET type is requested
and set to non-blocking I/O. When the DNS server has resolved the
name, the socket connection process begins by calling connect(2);
{We use the Unix convention of identifying the manual section
where the concept is described, in this case
section 2 (system calls).}
which returns immediately with an EINPROGRESS error.
After the connection reaches the EINPROGRESS ``state,'' the
socket waits in background until connection succeeds (or fails),
when that happens, a callback function is awaked to perform the
following steps: set the I/O engine to send the query and expect
its answer (both in background).
The advantage of this scheme is that every required step is
quickly done without blocking the browser. Finally, the socket
will generate a signal whenever I/O is possible.
----------------
Handling queries
----------------
In the case of a HTTP URL, queries typically translate into a
short transmission (the HTTP query) and a lengthy retrieval
process. Queries are not always short though, specially when
requesting forms (all the form data is attached within the
query), and also when requesting CGI programs.
Regardless of query length, query sending is handled in
background. The thread that was initiated at TCP connecting time
has all the transmission framework already set up; at this point,
packet sending is just a matter of waiting for the
write signal (G_IO_OUT) to come and then sending the data. When
the socket gets ready for transmission, the data is sent using
IO_write.
--------------
Receiving data
--------------
Although conceptually similar to sending queries, retrieving
data is very different as the data received can easily exceed the
size of the query by many orders of magnitude (for example when
downloading images or files). This is one of the main sources of
latency, the retrieval can take several seconds or even minutes
when downloading large files.
The data retrieving process for a single file, that began by
setting up the expecting framework at TCP connecting time, simply
waits for the read signal (G_IO_IN). When it happens, the
low-level I/O engine gets called, the data is read into
pre-allocated buffers and the appropriate call-backs are
performed. Technically, whenever a G_IO_IN event is generated,
data is received from the socket file descriptor, by using the
IO_read function. This iterative process finishes upon EOF (or on
an error condition).
----------------------
Closing the connection
----------------------
Closing a TCP connection requires four data segments, not an
impressive amount but twice the round trip time, which can be
substantial. When data retrieval finishes, socket closing is
triggered. There's nothing but a IO_close_fd call on the socket's
file descriptor. This process was originally designed to split
the four segment close into two partial closes, one when query
sending is done and the other when all data is in. This scheme is
not currently used because the write close also stops the reading
part.
The low-level I/O engine
------------------------
Dillo I/O is carried out in the background. This is achieved
by using low level file descriptors and signals. Anytime a file
descriptor shows activity, a signal is raised and the signal
handler takes care of the I/O.
The low-level I/O engine ("I/O engine" from here on) was
designed as an internal abstraction layer for background file
descriptor activity. It is intended to be used by the cache
module only; higher level routines should ask the cache for its
URLs. Every operation that is meant to be carried out in
background should be handled by the I/O engine. In the case of
TCP sockets, they are created and submitted to the I/O engine for
any further processing.
The submitting process (client) must fill a request structure
and let the I/O engine handle the file descriptor activity, until
it receives a call-back for finally processing the data. This is
better understood by examining the request structure:
typedef struct {
gint Key; /* Primary Key (for klist) */
gint Op; /* IORead | IOWrite | IOWrites */
gint FD; /* Current File Descriptor */
gint Flags; /* Flag array */
glong Status; /* Number of bytes read, or -errno code */
void *Buf; /* Buffer place */
size_t BufSize; /* Buffer length */
void *BufStart; /* PRIVATE: only used inside IO.c! */
void *ExtData; /* External data reference (not used by IO.c) */
void *Info; /* CCC Info structure for this IO */
GIOChannel *GioCh; /* IO channel */
guint watch_id; /* glib's event source id */
} IOData_t;
To request an I/O operation, this structure must be filled and
passed to the I/O engine.
'Op' and 'Buf' and 'BufSize' MUST be provided.
'ExtData' MAY be provided.
'Status', 'FD' and 'GioCh' are set by I/O engine internal
routines.
When there is new data in the file descriptor, 'IO_callback'
gets called (by glib). Only after the I/O engine finishes
processing the data are the upper layers notified.
The I/O engine transfer buffer
------------------------------
The 'Buf' and 'BufSize' fields of the request structure
provide the transfer buffer for each operation. This buffer must
be set by the client (to increase performance by avoiding copying
data).
On reads, the client specifies the amount and where to place
the retrieved data; on writes, it specifies the amount and source
of the data segment that is to be sent. Although this scheme
increases complexity, it has proven very fast and powerful. For
instance, when the size of a document is known in advance, a
buffer for all the data can be allocated at once, eliminating the
need for multiple memory reallocations. Even more, if the size is
known and the data transfer is taking the form of multiple small
chunks of data, the client only needs to update 'Buf' and
BufSize' to point to the next byte in its large preallocated
reception buffer (by adding the chunk size to 'Buf'). On the
other hand, if the size of the transfer isn't known in advance,
the reception buffer can remain untouched until the connection
closes, but the client must then accomplish the usual buffer
copying and reallocation.
The I/O engine also lets the client specify a full length
transfer buffer when sending data. It doesn't matter (from the
client's point of view) if the data fits in a single packet or
not, it's the I/O engine's job to divide it into smaller chunks
if needed and to perform the operation accordingly.
------------------------------------------
Handling multiple simultaneous connections
------------------------------------------
The previous sections describe the internal work for a single
connection, the I/O engine handles several of them in parallel.
This is the normal downloading behavior of a web page. Normally,
after retrieving the main document (HTML code), several
references to other files (typically images) and sometimes even
to other sites (mostly advertising today) are found inside the
page. In order to parse and complete the page rendering, those
other documents must be fetched and displayed, so it is not
uncommon to have multiple downloading connections (every one
requiring the whole fetching process) happening at the same time.
Even though socket activity can reach a hectic pace, the
browser never blocks. Note also that the I/O engine is the one
that directs the execution flow of the program by triggering a
call-back chain whenever a file descriptor operation succeeds or
fails.
A key point for this multiple call-back chained I/O engine is
that every single function in the chain must be guaranteed to
return quickly. Otherwise, the whole system blocks until it
returns.
-----------
Conclusions
-----------
Dillo is currently in very stable beta state. It already shows
impressive performance, and its interactive ``feel'' is much
better than that of other web browsers.
The modular structure of Dillo, and its reliance on GTK1 allow
it to be very small. Not every feature of HTML-4.01 has been
implemented yet, but no significant problems are foreseen in
doing this.
The fact that Dillo's central I/O engine is written using
advanced features of POSIX and TCP/IP networking makes its
performance possible, but on the other hand this also means that
only a fraction of the interested hackers are able to work on it.
A simple code base is critical when trying to attract hackers
to work on a project like this one. Using the GTK+ framework
helped both in creating the graphical user interface and in
handling the concurrency inside the browser. By having threads
communicate through pipes the need for explicit synchronization
is almost completely eliminated, and with it most of the
complexity of concurrent programming disappears.
A clean, strictly applied layering approach based on clear
abstractions is vital in each programming project. A good,
supportive framework is of much help here.