Web Autonomy and Non-determinism

The Internet is composed of autonomous nodes and suffers from various forms of temporary and permanent failure. These attributes are permanent features of any global computer network - there can be no panacea for the Internet. Thus, the semantics of the Web domain are inherently non-deterministic. This non-determinism takes the form of uncertainty, since there are very few assumptions one can make before invoking a Web query. This uncertainty complicates the task of Web automation. This is the writing of programs that operate over or interact with the Web.

The relationship between a Web application and the Web medium is akin to the relationship between a desktop application and the underlying file-system. There is, however, a fundamental difference between the two due to the greater degree of determinism associated with local file-system access. Program failures arising from file-system access are absolute, and are usually repeatable.

Conversely, failure in accessing the Internet is intermittent, and is not absolute in that it often requires interpretation given a number of factors. Moreover, the frequency and intermittence of failure means that the enforcing of program preconditions with respect to Web access is not viable - no assumptions can be made before Web access. Conversely, when executing over a file-system, the existence of a program's dependent files is usually a precondition for successful execution. To summarise, in forming Web queries there is uncertainty as to

Although each of these has an analog in local file-system access, for each case the Web uncertainty is orders of magnitude greater. On host lookup via a Domain-Name Server (DNS), intermittent error may result in failure to resolve a particular hostname. Human browsers learn to recognise the symptoms of this behaviour and simply retry the lookup two or three times before giving up and assuming that the hostname does not exist. With a file-system, however, a failed attempt to open a file is unlikely to succeed no matter how many times it is retried.

When reading a file into memory programmers rarely consider the time that the read operation will take to complete, since it is usually so small as to be irrelevant. However, for a Web query this time is indeterminate, and can often be lengthy, possibly to the extent that the query may never complete. Thus, time is of major significance. Human browsers often have a vague notion of 'timeout' for Web transfers, generally developed as a result of previous experience with accessing particular URLs, domains, or even countries.

An important observation to make at this point is that despite the non-determinism of Web fetches, like file system accesses they are still synchronous operations. Due to the length of time involved in Web fetch, human browsers can be observed downloading several documents concurrently in order to achieve better performance. The documents may be unrelated, or may contain the same or similar information and the surfer will simply accept the first document completely transferred. Such concurrency must be assumed an important part of any Web automation scheme.

Transfer rate is usually irrelevant when accessing local-file systems. However, the rate of a Web transfer may quicken, slow down, or even drop to zero. Experienced Web surfers are good at interpreting these symptoms, and may terminate the transfer, retry it, or seek alternative sources of the same information.

Programmers usually assume that if a file was previously written in a specific format, then it can be read according to that same format. This is particularly true in the case where the programmer exclusively controls access to that file. However, Web host autonomy means that the structure of Web documents can never be presupposed. This is only an inconvenience for human browsers, but is potentially catastrophic for applications dependent on remote documents.

In general, assumptions about the state of the Web can never be made, since there is inherent uncertainty in every access. If the development of a reliable Web application is to be achieved, the task of coping with this uncertainty must pervade all aspects of its design. Dealing with a broad class of non-deterministic behaviour is a complex task that exacerbates programmer errors. Abstractions must be sought over uncertainty and failure that allow applications to progress in the face of frequent, intermittent failure, and uncertainty as to whether failure has even occurred. In essence, the human 'skill' of browsing is the diagnosis of failure from visible symptoms in the context of many other factors such as geography, time of day, and perceived network congestion. This 'vague' interpretation of uncertainty and failure from observation is difficult to reproduce in contemporary programming languages that were originally designed to operate over more deterministic file-systems. It is the purpose of this proposal to outline a programming mechanism that facilitates this task.

Copyright Keith Sibson (keith@cs.strath.ac.uk), 1998.