Seafood is the name of a software suite developed as TERENA small project for analysis of data extracted from web cache log files. Log files using the widespread Squid-style  native format are offline processed and stored into an SQL database. A web-based user interface to access typical selections is part of Seafood. The long-termed and easy-to-access storage allows for trend analysis as well as complex views and correlations between different data sets and members of a cache mesh.
Keywords: Log file analysis; rational database; SQL; Squid
Seafood is a software suite for offline analysis of access.log files from the widespread Squid web caches  and similar formats, as provided by various cache vendors like NetCache , Inktomi  and CacheFlow . The resulting summary can be viewed as a text file containing tables in the tradition of Calamaris . Another output option gathers data extracts into an SQL database in order to create a long-term, easy-to-access store. The necessary scripts for collecting summaries and retrieving graphical results via a web user interface are part of the Seafood software suite. Using an SQL database allows for trend analysis, complex views and correlations between different data sets as well as providing a source for data mining.
A large number of administrators in today's caching community maintain cache servers based on the Squid software . The development of the Squid software during the last year has resulted in new versions with an amazing improvement of performance, flexibility and stability. In contrast to the development of the caching software itself, there is a lack of accompanying tools to administer a Squid cache and to measure its effectiveness. The purpose of this project was to develop tools to gather and extract statistics from a number of Squid caches and present the results in a visual way.
The Squid software generates log files which can enable the administrator to find out interesting information about the caching service - if the necessary numbers can be successfully extracted from the sheer amount of data. Due to the size of the log files, and for performance issues in generating long-term statistics, many cache maintainers only look into these log files occasionally to get an impression of the current situation.
The log file is usually processed offline, which offers a number of advantages. Online processing would require constant polling of the data of interest from the Squid Cache whilst the cache was operating. Offline processing allows the administrator to arbitrarily select the time period and the detail of the output whilst processing the file.
Still, most solutions suffer from the volume of possible and interesting data. Also, most data gets more interesting, if combined with other related data. Thus the log file parsing results must be easy to incorporate into an SQL database. From this database, different sets of simple data can be put together into a more complex view of the behaviour of a cache or set of caches, all with simple SQL statements. However, most users, and even administrators, prefer easy-to-handle tools. Thus a custom-tailored web interface, based on the database, should be able to produce graphs on the fly based on the data sets selected by the user.
Currently, there are a few log file processors available, but all suffer from limitations. Some processors will give numbers; some will generate long ASCII based reports, whilst others are designed to produce images with coloured graphs. Additionally, there is no Squid statistic tool which supports an interface to standard databases.
A very well known processor is 'Calamaris'  which is implemented in Perl. Each weekday the 10 caches in the DFN caching service accumulate well over 4 GB of log file data. Calamaris spends over 18 hours on a high-performance workstation in order to process this data. Other log file processors might be faster, but less detailed in their output.
For performance reasons and to get an impression on Calamaris' potential, a prototype implementation of Calamaris was developed in C++. This port is sufficiently faster, and might even be sped up further for multiprocessor and/or multihost environments, but currently lacks at least in support for the up to date version 2.x of Squid.
A promising prototype approach of the well-known Calamaris tool to the better performing C++ is already done, as far as Squid 1.x log files are concerned. So far, this prototype only produces textual results. The proposed project is to look into parsing Squid 2.x and (perhaps) NetCache  results, and to return the results in a way which is easy to incorporate into a database. Furthermore, a prototype web interface module is to be prepared which shows how different on-the-fly views of data are achieved.
The scope of the work consists of several issues:
In co-operation with the TF-CACHE group, acquire a set of interesting, yet simple, information to be extracted from the log files, preferably based on the information already available in Calamaris.
Port the C++ version to parse Squid-2 log files and preferably also NetCache log files. The results from the previous phase should be incorporated into this phase.
Design an efficient scheme to store results in a relational database. A linear growth of the database might or might not be desirable; the topic will have to be discussed with TF-CACHE. If a linear growth is not desirable, a scheme to reduce older data will have to be developed.
Implementation of a prototype web interface to extract interesting views from the database and generate on-the-fly graphs to present the results.
The TERENA small project "Extended Cache Statistics" was accepted on 4 December 1998, and ran from 1 April 1999 till 31 December 1999. The project was conducted by staff working for the DFN Caching project at the Institute for Computer Networks and Distributed Systems  of the University of Hanover, Germany. The DFN caching project is sponsored by the German Ministry of Education, Science, Research and Technology (BMBF - ) through the German Research Network Association (DFN Verein - ). The Trans-European Research and Education Networking Association (TERENA - ) sponsors the "Extended Cache Statistics" project, which is aimed at the exploration of data by correlation through a database. A side product is the long-termed availability of data. The project was proposed by TF-CACHE, the TERENA Caching task force .
The project "Extended Cache Statistics" focuses on delivering a long term view of data obtained from cache log files originating from cache servers running Squid or possibly NetCache software. The purpose of the specification is to outline a mandatory set of data to retrieve from the log files. Note that problems related to the extraction of the data are not addressed at this point. The database design will also be addressed later.
With the help of a SQL database, different views to the various performance data can be obtained. Two intervals I1 and I2 have to be defined during the database setup. The first interval I1 defines a period of time for which most of the data is being aggregated, e.g. a daily period. For the performance evaluation, a finer granularity is necessary, resulting in an interval I2 for the performance, e.g. an hourly view. After initial assignment, the intervals cannot be changed.
All data should be put into the database in a form which allows for maximum precision. For instance, the percentage calculations should be based on the stored absolute values. The change into percentages will be done during the retrieval and output phase, e.g. by a matching SQL statement. Almost all the data examined needs a similar set of sums to be stored into different tables:
Although the data will not be stored in any order within the database, the output needs to match contradictory sorting criteria. For most administrators, two alternative kinds of views are believed to be of use:
The duration mentioned above can be used to calculate the bandwidth requirement of the transmitted data. However, one has to be aware that the duration from a Squid log file is a highly inaccurate value, and thus results based on calculations containing this value should be regarded with suspicion. The value stored into the database is usually a sum of single durations. Unless mentioned otherwise, the sum is independent of hits or misses.
As shown in figure 2.1, three kinds of traffic can be observed. All queries to the cache are logged. Thus, the traffic from the client side matches the complete traffic as seen in the log file. The difference to the server side traffic is the volume saved by the cache.
Figure 2.1: Traffic relations.
The well-known Calamaris  log file processor terms the client side traffic as "incoming". Using a different perspective, the direction of the data stream can also be viewed as flowing from the source, the web server, to the sink, the browser. Hence, from this point of view, one would call the traffic leaving the cache "outgoing". In order to reduce misunderstandings, the neutral phrase client side will be used in this document.
In order to avoid further misunderstandings, a few more phrases are defined here. For instance, the client querying this cache can be either a browser or another cache. Dependent caches are sometimes called children.
The server side sees only that part of the traffic which cannot be satisfied by the cache. At the server side, queries can either be sent to other caches or to the origin site. When talking about cache hierarchies, the other caches can either be on the same hierarchy level as the current cache, or they are on a higher level upward in the hierarchy. Both kinds of caches are called neighbours (Squid calls it neighbor, using the US spelling) or peers. In order to distinguish the hierarchical level of the queried cache, caches on the same level of the hierarchy are called siblings. Caches on a higher level are called parents.
Due to the frequent changes of the names used for different hits and for the hierarchy codes, the definition of both should be kept variable, preferably as part of a configuration file. Thus it is hoped to maintain a certain degree of upward compatibility and flexibility in order to cope with future changes. A variable definition should also simplify the parsing of foreign log file formats. Erroneous requests should be recognized by their HTTP status code, as Squid-2 no longer logs the special ERR code. The scheme of keeping the definitions variable can also be used to accommodate particular needs in the definition of "what is a hit" which may vary from administrator to administrator.
Although Squid-1 is no longer supported, it still has a considerable user base. Therefore a certain degree of downward compatibility seems desirable. However, Squid-2 will be the default.
Of importance is all data which enables the administrator to look at a related group of caches as a whole. The report should account for the amount of traffic flowing into the group, the amount going out of the group and the amount of traffic generated by the group itself. In addition, the inter cache communication may use some bandwidth, and this needs to be looked at. On the other hand, an administrator of a downstream dependent cache might be more interested in how much bandwidth or latency a cache saves him. The sections shown in this chapter constitute minimum requirements.
The server side traffic can be viewed as three different kinds of traffic:
The peak values are maintained in order to give a rough overview of the general cache behaviour. (Note: Peak values are not the same as the classical Calamaris-1 approach for finding peak values.
For reasons of backward compatibility the
should be counted as is seen fit.
The results generated are presented as bar charts. The diagrams are intended as an executive summary. The distinction between size and requests as seen in Calamaris-2, as well as the further distinction between direct, parent and sibling traffic also seems desirable.
This section focuses on TCP traffic. Many cache administrators express an interest in the domains for which their cache is queried. From the 2nd level domains the top-level domains (TLD) could possibly be generated using appropriate SQL statements. However, the almost unlimited number of 2nd level domains forces one to store only a subset of the results. For this reason, the TLDs cannot be determined from the 2nd level domains.
The two artificial top-level domains
<error> were introduced. The
<numeric> domain is used for dotted quad destinations
and single big integer destinations. The latter may not be truly portable, but
are frequently seen in cache log files. The
many operating systems can automagically convert single big integers into
correct Internet addresses. The
<error> is used for all
unparsable domains, and for all top-level domains which were not
explicitly configured. Since URLs are basically user input, they are prone
to errors, and should be regarded with suspicion.
This section focuses on the client side TCP traffic. Output is sorted
by the rules laid out in the introduction. As with the domains, the MIME
types and subtypes suffer a similar proliferation at the 2nd level.
For this reason, the counted data is limited to those types
and subtypes specifically mentioned in the configuration file. Any media
type or subtype not listed in the configured file is counted as
There are different methods to access an object on a web server, as laid
out in the respective standards. At the time of writing, Squid-2 knew
about the following methods:
ICP_QUERY. The latter is used for
inter cache communication using the ICP protocol. Most of the methods are
non cacheable. Unknown methods are logged as
This section answers the question "who asks me". It will look at the complete traffic in a coarse overview, sorted by the HIT or MISS status. For TCP, a further distinction into HIT, MISS and ERR is used. Please note that queries can originate from browsers as well as caches. Queries from browsers and caches are collectively referred to as querying instances.
This section focuses on the question "whom do I ask". It looks at the TCP traffic generated by this cache. Usually, caches at the top-level of a hierarchy go to the source in a direct fashion. A certain percentage of the queries can also be answered with the help of peers. Thus, a distinction between direct, parent and sibling traffic is necessary.
Additionally, for each peer the way an object was retrieved from the peer must be distinguished. Usually, an object will be found on a peer employing an ICP query, and transferring the object via HTTP. As new inter cache communication protocols arise, peer objects can also be found with the help of those protocols, e.g. cache digests. Further protocols like the Cache Array Routing Protocol (CARP) are also supported. Due to the variety of inter cache communication protocols the generated traffic issue must be kept versatile in order to match future methods of inter cache communication. Also note that the traffic generated by inter cache communication is usually only recorded in the log files of the peering caches, not the log file of the retrieving cache.
Two different views are possible. One is sorted by the hierarchy code as returned by Squid. The other view additionally prints statistics for all peers queried. Essentially, the same basic table is being referred to.
In a similar way to Calamaris-1, the traffic generated by each querying instance should be listed separately. It might be possible to combine this view with the one mentioned in section 2.3.5. It is of interest to separate the TCP and UDP traffic and to further separate the TCP traffic into HIT, MISS or OTHER. Once again, the problem of vastly growing tables is noted, but not addressed at this point.
Whereas section 2.3 lists the mandatory minimum requirements, extensions constitute those features which are "nice to have".
When going directly to the origin site, it might be of interest to know which border gateways of the network were used, and to what extent. The destination AS number can in turn be resolved into the border gateway address with the help of routing protocols outside the scope of this project. The yield is the amount of traffic on the external links of the network.
In order to determine the AS number, the host name of the destination first must be resolved into a numerical address. When converting the host address into a class C network address, the network address can be resolved with the help of a local whois mirror and thus turned into an AS number.
Looking at the TCP traffic for the three classes HIT, MISS and OTHER, the object size and request duration can be summed up using fixed sized classes. The classes are calculated using the logarithm to the base of 2, although the zero is a valid input, and thus needs special treatment. Three kinds of diagrams can be generated from this statistical distribution:
The problems arising from inaccurate duration as seen in the log file are noted.
Of interest for all TCP-based queries is the gateway protocol as seen in the URL fed to the cache. The protocol is the first part of the URL before the "://", as defined in RFC 2396 .
Like the tables in sections 2.2 and 2.3, it is possible for this table to grow without
bounds. A possible solution might be to limit the number of protocols
parsed, and record all valid names into the configuration file. All
otherwise unknown protocols are counted as
<unknown>. As a standard the
gopher should be part
of distinct sums.
The Seafood log file analyser is known to parse Squid-2
access.log files. Older versions of Squid might be supported,
but were not tested. Adaptation to older log files only need some changes
to the configuration file. If you are familiar with the Calamaris  log file analyser, you will recognize the textual
Please refer to Deliverable 2-3  for details on
the compilation and installation process. Seafood is known to be compilable
by the GNU c++ 2.8 series, the new g++ 2.95, or a native C++ compiler on a
Unix platform which understands about
bool, for-scoped loop variables, templates and essential STL
constructs . Egcs might do the trick, too. Seafood
was successfully compiled and tested on the systems mentioned in table 3.1. With Irix, only the native compiler will work.
|Solaris 2.6, 7||SUN Workshop Compiler 5.0, g++ 2.8.1|
|Irix 6.2, 6.5||Irix CC 7.2|
|Linux 2.2 w/ glibc2||g++ 2.8.1, g++ 2.95.1|
The Seafood log file processor can read compressed files. Compression is
an option you can turn off during compile time. Seafood is also capable of
detecting the number of CPUs currently online, and will
use an external decompressor in a separate process, if there is more than
one CPU available, and if the input is seekable. The decompressors
be searched for using the run-time
PATH environment variable.
The environment variables will be used from the parent process with the exception of the locale, which will be set to "C" at the start of Seafood. Using the internal decompressor is useful for single CPU machines, and for trying to decompress when reading from pipes or any other non-seekable sources.
Seafood makes use of GNU software. For instance, the
Makefile is written for a GNU-style
the parsing of the configuration file, flex and bison are used. The
standard Unix variants lex and yacc might or might not do the job; they
were never tested. Past experience showed that it is just too difficult to
accommodate every possible flavour of lex.
If you are planning on parsing Squid-2 log files, you won't need to make
many changes to the
seafood.conf configuration file. However,
please read the comments in
seafood.conf, and configure to
match your needs. There are probably a few things you would like to change,
e.g. the location of your nearest whois mirror. If you are located in
Europe, you might want to try the RIPE whois service .
Install the binary Seafood at a place of your convenience. Currently,
you need to store the configuration file
seafood.conf into the
same directory as the Seafood binary. Alternatively, you can use the
-f conffile command line option to supply a different location
of the configuration file. Table 3.2 shows the
currently understood configuration options.
|debug_level||<int> [ , <int> ]||0,0|
|irr_server||<string> | none||"whois.ripe.net"|
|method_list||<id list>||see below|
|hierarchy_list||<id list>||see below|
|status_list||<id list>||see below|
|scheme_list||<id list>||see below|
|mediatype_list||<string list>||see below|
|media_subtype||<string> <string list>||see below|
|domain_list||<string list>||see below|
The configuration file uses a Squid-like style. The grammar is (almost)
format free and case sensitive. Comments are started with a hash
#) character and extend to the end-of-line. All
configuration items need to be terminated with a semicolon - this is more
like Pascal than C. The
<bool> type may either take the
false regardless of the case. The
<int> type should be clear.
is a double quoted string which must not extend over line boundaries. A
quote character might be introduced into the string by prefixing it with a
backslash. A backslash must for that reason also be escaped by a
For further details on the configuration syntax and configuration details, refer to Deliverable 2-3 . Some configuration items contain list values. Lists also include an aliasing feature which allows the creation of a kind of symbolic link to a previously mentioned list item. Some lists are rather specialized:
The method list is by default empty, but the
file contains all methods known to RFC 2616 (HTTP/1.1 - ), Squid-2 (
PURGE) and those extensions used
by RFC 2518 (WebDAV - ). You should use the
methods from the supplied configuration file. There are no modifiers or
The hierarchy list is by default empty. The configuration file
should list all possible values for the hierarchy tag in column 9 of the
access.log file. Additionally, each item may have a
modifier direct, peer, parent, or
none appended to it. Some people prefer to count a
PARENT_HIT the same as a
there is virtually no difference between those two. You can configure
the appropriate list item to suit your needs.
The status list concerns the status tag from column 3 of the
access.log file. The supplied configuration file lists the
status items known to Squid-2.2. Each item must have one of
the modifiers tcp, udp or none. Additionally,
each item considered a hit should have the hit modifier
attached to it, too. Please note that not all status tags ending in hit
imply not going to the origin server. For the very
reason that your notion of a hit might be different than mine, hits are
The scheme list is by default empty. The configuration file
supplies a string list with reasonable values. You should include the
error scheme in order to be able to distinguish between Squid
generated errors logged with the scheme name
other kinds of error like unknown schemes.
The media list of the configuration file contains the media types assigned by the IANA, and some additional media types seen in local log files. The media type is the part before the dash in the HTTP content type header. As this value is basically user input (e.g. author of web pages and server side includes), many strange things can be seen here. For the reason of media types being user input without Squid sanity checks, a warning option was not deemed feasible (yet).
For each media type, you can define the sub types you are interested
in. Each media type may have only one sub type list attached to it. If
you use more than one, currently the new list will overwrite the old
list. All sub types not mentioned will be counted as sub type
The domain list contains all top-level domains currently
known. Please check that none are missed. The virtual top-level domain
<numeric> counts all those URL hosts entered as
dotted quad or as big integer number. The virtual top-level domain
<empty> represents malformed URLs. The
<unknown> value will be used for anything not in the
The configuration options
irr_server control the use of external look-up services. If
you set the Seafood option
log_fqdn to true, the clients in
the client request tables will be printed symbolically. While most clients
are usually local, this will take a small amount of time.
irr_server has got a high time consumption. If
you are using AS based look-ups, you should have applied the
log_ip_on_direct patch  by Henrik
Nordström . Otherwise, the destination address
must be looked up before the AS number can be determined. Also, the whois
server might be quite slow. Setting the
irr_server to the
none disables the AS look-up feature.
Both, DNS look-ups and AS look-ups, will be cached for later use. The cached information resides in main memory during the runs, and will be stored in separate GDBM files between runs. Please note that the GDBM caches can grow quite large.
The interval any DNS and AS look-up will be cached can be determined
separately for DNS and AS as well as for successful and failed look-ups.
Due to the locking nature of the look-up calls, external helper programs
irrhelper are started. The number
of those helpers can be configured separately for DNS and AS look-ups. The
number of helper processes is limited between 1 (minimum) and 255
(maximum), with some further limitations imposed by the operating
There exist five kinds of results which have an abundant or nearly unlimited amount of data: The top-level domains, the 2nd level domains, the UDP clients, the TCP clients and the AS results. Those results can be limited in their number of entries, using a configuration option. The default is not to limit any result.
The top-level domain results are further limited to the number of explicitly configured domains. The number of 2nd level domains is nearly boundless, and for a typical weekday on our cache servers, the number of 2nd level domains found can exceed 30,000. In a multi-level caching hierarchy, the top-level caches may only see a small number of sibling caches, and thus produce small client results. But local university caches usually see all browsers from almost all computers in the network. Therefore, a limitation of the client tables is necessary. The AS limitations are only taken into account, if an AS result will actually be produced.
Seafood can be run on one or many of your log files. Each run will summarize all log files poured into it. Each log files may be plain, gzip compressed or bzip2 compressed. Before running Seafood, check your configuration file, and make really sure that you have configured all the options you want.
A typical run, assuming a Bourne or compatible shell, would look like this:
$ ./seafood /some/where/access.log > result 2> errors & $ tail -f errors
stderr warning messages, informational messages and
other material will show up. The informational messages start out with a
# character. All actual warnings start out with the line
number of the offending log file line first.
The messages start out with the run-time endianess detection for the
net mask code. Only big- and little endian architectures are recognized.
Next the operating system imposed limits on file descriptors and number of
processes are checked in order to determine the number of external helper
programs. In the third step, DNS and AS caches are connected to their
respective GDBM file, and read into the main memory. If using compressed input
on multi-CPU machines, an external uncompressor will be started. Single CPU
machines will use an internal decompression functions. Please
note that the
uncompress algorithm is not available as an
internal decompressor. The message above is from a two CPU machine. Further
informational messages are a kind of progress indicator. They may be
interrupted with actual warnings.
45643: neither TCP nor UDP, counting as TCP + timestamp=933546744, duration=3, client="xxx.xxx.xxx.xxx" + status="NONE/400", size=2406, method="GET" + url="http:///2.htm" + ident="-", hier="NONE/-", mime="-"
The above message is a warning about line 45643. The log line constitutes neither TCP nor UDP, but will be counted as TCP. It is just the informational warning, that your results may be off, though in this case, counting it as TCP is correct.
After the log file has been processed, the output will be generated. Each section of the output has a log line associated with it. Since there can be textual tabular output as well as a database intermediate file, you might see the output progress indicators twice, once for each output format. If the DNS and IRR helpers were used, the internal cache contents are dumped into their respective textual database files. DONE signals the finish of the output.
Do not be disturbed if Seafood seems to sit waiting on one of the positions mentioned in table 3.3.
|2lds...||Seafood is sorting several tens of thousands of domains. On slower machines this may take a while.|
|asn...||If you configured to use IRR, Seafood is querying the IRR. Since the IRR server may be slow in answering, up to 30 s per request, this may eat a considerable amount of time. Also, if your destination is not logged as dotted quad, the destination must be looked up before its AS number can be determined.|
|clients...||Seafood usually reverse resolves the client IP address into something symbolic.|
There are a number of textual tables written to
after a successful parsing. Due to possible look-ups, the generation of the
sorted tables may take some time, too. If you are familiar with Calamaris
, you will recognize the output format as being
The overview section contains a simple table just displaying the sums of all TCP, all UDP and SUM traffic, including hit rates.
The status section contains a UDP table, which is sorted by HIT and MISS, and a TCP table, sorted by hit, miss and none.
The hierarchy section displays an overview of the server side connections with the different hierarchy tags sorted by direct, parent and peer. Please note that even though peer includes mostly sibling, for some hierarchy codes it also includes parents. The tag lines contain a hit rate, which is nonsense, and serves as a validator for your configuration file. The second hierarchy is sorted by the peer or parent contacted, and contains the hierarchy tags used with that particular peer.
The hierarchy information is followed by the request method table.
RFC 2518 (WebDAV - ) is part of the configuration
file, and will thus be counted. Unknown methods are counted as
The URL scheme is part of another table, sorted by requests.
Unknown schemes are counted as
The top-level domains as configured are displayed in the following section. The first table is sorted by requests, the second one by volume. The number of domains shown can be limited to the top-N.
The 2nd level domains are displayed in two tables. Again, the first table is sorted by requests, the second one by volume. Both tables can be limited in their extent to a top-N chart.
If you configured an IRR server, your direct hierarchy
destination hosts will be grouped by the destination Autonomous System
Number (ASN) the server resides in. Two tables are generated, sorted by
requests and sorted by volume. The virtual entry
<NOIRR> implies that for the given host, an origin AS
could not be determined. Both tables can be limited to the top-N.
The AS information will be obtained by querying an external helper
irrhelper. For each request, the helper
process opens a connection to the configured whois server, sends the
query, parses the answers and closes its connection. Any complete whois
accessible mirror of an IRR can be queried.
Waiting for any non-local whois server is really slow, up to 30 seconds
per request. For instance, processing a log file took 12 seconds, and
waiting for the whois server to answer took 705 seconds. Using your own
mirror is recommended, instructions and software can be found at the RADB
site . Such a whois mirror server can also be used
for Squid, see
as_whois_server in your
squid.conf file. Refer to the ACL schemes
The media types and sub types are a run-time configurable option. Again, two tables, sorted by requests and sorted by volume, are the output. The tables are sorted primarily by the media type sums, and secondarily by the sub type accumulation.
The textual output contains a summary of ports which were used. In order to reduce the output volume, the ports are grouped by 1024 ports. Port 80 is mentioned separately.
The UDP client side table is just sorted by requests. The table displays the UDP traffic and the part of the hit traffic.
This is a little awkward table. The contents are sorted by requests of the client, and for each client, the hit, miss and none amount is shown.
The performance data are associated with the term peaks
for historical reasons. For each
peak_interval from the
configuration file, all requests during that interval are grouped and
displayed consecutively. The first table deals with requests, the second
table with volume. The third and final table deals with the request
duration, even if you configured to use bandwidth for the other
The first three columns display the UDP sum, the UDP hit amount and the relative UDP hit : UDP amount. The next three columns do the same for TCP. The final six columns deal with direct, parent and peer traffic, and the relative number based on the TCP amount. Please note that the sum of the last four percentage columns in a line do not yield 100. The missing part is the none traffic, which is not logged (yet).
The duration table omits the UDP columns, as ICP requests are usually logged to take no time - even if that is not true, and even if other log software like Calamaris assumes it to take a microsecond each.
If you configured to show the distribution tables, three further tables will be part of the result. The first table shows the distribution in time and size of TCP hit, the second of TCP miss and the third of TCP none requests which include some errors. The interior of the tables is the distribution in time and size. The final sum row of each table is the accumulated distribution in size, and the final sum column of each table the accumulated distribution in time. A special program can be used to generate 3D diagrams and VRMLs from the data.
The final table displays some statistics about the analyser, e.g., how much time it spent in the parser loop and how much time it needed overall. Of interest may be the lps value in parenthesis which is the lines-per-second amount parsed.
The URL is a user entered input, and may thus contain grossly malformed input. Figure 3.1 shows the finite state automaton which extracts the method, host name and port from a URL. Start state is state 0. The final state 5 is not explicitly shown.
Figure 3.1: Finite state automaton for parsing URLs.
The connotation at the arcs may look a little weird. My apologies for not using standard notation. The Mealy automaton performs its actions during the transition to a new state. If the arc is only labelled with a single character, this character will be eaten without doing anything with it. Otherwise the action is associated with the mentioned character or character class, and separated by a slash.
The character class
alnum contains alphanumeric characters.
The virtual character class
else contains all characters which
are not any other arc leaving the node. The action is abbreviated, too. The
s means to add the character to the scheme,
h to add to the host and
p to add to the port.
E() empties the arguments.
The numbering of the state nodes is arbitrary and was chosen in order to simplify the implementation of the automaton. The weird arc from state 3 to state 2 was added in order to be able to parse the host name and port of tunnelled connections, e.g. to be able to parse the following log line:
[...] CONNECT some.host.domain:443 - DIRECT/some.host.domain -
This section will deal with a few choice internals. For more implementation details, please refer to appendix C.
Seafood is capable of reading plain textual input and compressed input
in a variety of formats. If more than one CPU is available, decompression
takes place using an external decompressor which is run as an input filter.
If there is only one CPU available, decompression is achieved internally
using appropriate library calls. Compression formats understood are
compress. The latter
is only available on multi-CPU platforms. Due to its design, Seafood can be
extended to use decryption.
The parsing API of the Seafood software provides high-level functions to return different integer flavours, floats and words from the input stream. In order to achieve maximum speed, care was taken to touch each character only once during parsing.
The symbol tables which contain status tags like
TCP_MEM_HIT or hierarchy tags like
are dynamically generated during run-time from the information supplied in
the configuration file
seafood.conf. For efficiency reasons,
the symbol tables used for matching are implemented using trie data
When comparing the match function of a trie with a generic hash table, both need a time proportional to the length of the word looked at. But as a trie by then has arrived at the stored (token-) value, a generic hash needs to check its collision avoidance algorithms first.
When looking at the matching speed of words not in either container, the trie excels even further. As soon as the first letter not in the trie is encountered, the trie will 'know' a mismatch. A generic hash table on the other hand will need to touch all letters of a mismatch and possibly do some collision avoidance in order to conclude a mismatch.
The drawback of a trie is that the insertion of new elements takes considerably more time than insertion into a generic hash. Therefore, tries are used for static information which is read in once from the configuration file during start-up, and which is just matched against, while hash maps are used to store dynamic content found during the parsing of the log file.
When porting the first Perl prototype to C++, as high-level a construct
was needed as Perl supplies to its users. The necessary constructs included
associative arrays which could be symbolically indexed, excessive use of
regular expressions called and the basic data types of a
char* pointers and character
vectors are kept for performance reasons. The string classes are primarily
used for indexing associative arrays and returning symbolic information
from functions. All maps are indexed by a
String and may use
an arbitrary value class.
All table names start with the
sf_* prefix in order to
avoid cluttering with other tables already defined in the current
namespace. Figure 4.1 shows the physical model of the
set of tables which depend on the daily interval I1. Due to the
less abstract notation, Postgres data types were used instead of SQL types.
As mentioned during the database design phase, the volume results should be
stored using at least 64-bit integers. The design was optimised in favour of
less tables, as it is easier to select and group results from a single
table than to aggregate from different tables. A single character was chosen as
Boolean type for predicates. Some tables contain redundant data in order to
simplify post processing of the data.
Figure 4.1: Physical model of tables depending on the daily interval.
During development Oracle 7.3, Oracle 8i and PostGreSQL 6.3.* were used. Other database products were also considered during the design of the tables and relationships. The schema can easily be converted to different databases.
Figure 4.1 shows a circular arrangement of tables
around the central table
sf_stamp1. The primary key fragment
I1ID is an abstract number. Table
an alternate key consisting of the combination of the cache host name and
the beautified start time stamp
FIRST of interval
I1, the real key for the central table
The real start time stamp is contained in the
I1ID provides efficient glue between the
tables shown in figure 4.1.
Due to the fact that there does not exist generic support for
sequences, row identifiers, or triggers, the interval id generation is
provided by the insertion script. If the database product allows, the
dependency of the outer tables' interval id on the central id should be
modelled using a foreign key or a
references constraint on the
outer tables' id column. A redesign should focus on a powerful generic
abstract database and leave these details to be programmed, if necessary,
in the database actually used. In this regard, the database design is
neither as optimal as it could be nor as abstract as it should be.
The central table
sf_stamp1 contains the timestamp
information about the daily interval for which Seafood was run. It also
uses the name of the cache it was run for, or the logical name of the cache
group. Part of the table is also the offset to UTC in order to provide
daylight savings transitions information for output processes. Finally,
the interval for which the sum was run, and the configured interval are
part of the central table.
All tables on the ring contain the (R,S,T) triple with the following meaning:
Please note that the UDP related data usually has no noticeable transfer time, at least from the view of the log file.
If one is using a different database product, e.g. one which supports native arithmetic types, R and S should be assigned eight byte integers, and T an eight byte float. Even though Seafood only uses a four byte integer for its internal request counters, the database may focus on a much larger interval. For this reason, the sum of a year's worth of requests might overflow some internal database registers, and some database products report the wrong results instead of an error.
Some tables contain the related (HITR,HITS, HITT) triples. These triples
refer to the number of HITs as perceived by Seafood.
Please note that a HIT triple always contains values less or equal to the
plain (R,S,T) triple in the same table and row. Furthermore, the
sf_tcp_client table contains a (MISSR,MISSS,MISST) triple for
MISSes and an (ERRR,ERRS,ERRT) triple for errors and those non-HITs with a
hierarchy code of NONE.
Starting on the left side of the ring, generic TCP and UDP counts are
combined into one table
sf_status. The glue and the status
code themselves are not sufficient keys when trying to distinguish misses
from errors, e.g. a TCP_MISS status might be the answer to a miss as well
as an error. Therefore, the
HMN_FLAG needs to be part of the
IS_TCP predicate on the other hand is a piece of
redundant data, as the status code sufficiently distinguishes between the
transport protocols. A future design might keep the status code in a
separate table, and assign the predicate a stored procedure.
The top of figure 5.1 shows the tables dealing with
the server side traffic of the Squid cache. The hierarchy code is part of
the primary key. For a sibling and a parent, the host contacted is also
part of the key. Siblings and parents are collapsed into one
sf_peer table. The
IS_PARENT predicate is
redundant again, because the hierarchy code sufficiently distinguishes
between parent and sibling. When going directly to the origin site, the
destination host is neglected in the database, even though Seafood knows
about them internally.
On the right side of the ring, the request method, the scheme, the
top-level domain, the 2nd-level domain, and the MIME types are counted. All five
tables contain the HIT count besides the plain count. Part of the primary
key of the method table, the scheme table, the top-level domains and the
MIME types are only those items which were configured in
seafood.conf by the time Seafood was run. Furthermore, the
number of entries in the domain tables for a particular cache host and
timestamp are limited to the top N domains as configured with Seafood. The
top-level and 2nd-level domain tables could theoretically be combined. But
as the administrator is expected only to store the top N 2nd-level
domains, the top-level domain information is not completely contained in
the 2nd-level domain table.
The direct hosts can be combined into networks and autonomous systems,
if a whois service is available, and Seafood was thus configured. The
sf_as table contains the destination AS number of origin
servers which were visited directly. Additionally, the description for the
AS as seen during the Seafood run is part of the database. There is no HIT
information to the
sf_as table, because the destination AS
information is obtained from the direct fetches.
Finally, the client tables
sf_tcp_client deal with the client side of the cache. Part of
the primary key is the client host address. The format of the client host,
symbolic or numeric, is taken as-is from the Seafood provided output. Note
that the number of clients is limited to the top N clients as configured in
Seafood. Especially for the TCP clients, it seems feasible to provide a
cut-off threshold of some percentage of the requests in order to weed out
clients whose access is denied. Please note that such an alternative is not
part of the current project.
Figure 4.2 shows the
which deals with the hourly interval I2. Again the start
timestamp and the cache hostname are alternate keys for the interval id.
Please note that this time interval id
I2ID is used.
Figure 4.2: Remaining tables.
Only the performance statistics table
sf_peak depends on
the second stamp table. For each hourly time stamp, it collects the UDP and
TCP data in HITs and MISSes, as well as the hierarchy data. Not part of the
table is the count for objects with the hierarchy code NONE.
sf_meta table records the highest number the
interval IDs currently have. The insertion script makes use of these
numbers, increases them, and generates appropriate SQL statements using the
new interval IDs. A redesign should assume the availability of sequences,
and leave this detail to the concrete implementation for a database.
The examples shown in this section are based on Oracle 7.3, however, due to the fact that a scripting language like Perl is used, you should be able to easily modify it to suit your needs. The database scripts use Perl's generic functions of the abstract DBI interface. The interface itself, using a connect string, maps the calls to the underlying concrete database driver (DBD) which communicates with the database. Java and ODBC allow for similar abstract database programming, but were not considered in this project.
In order to use Seafood, you will need to create the tables in your
database, once only. Also, the meta information table
needs to be filled with the default pair (0,0). For Oracle, a Perl script
is part of the Seafood software suite. You must modify the
create-ora.sql script before starting it to suit the concrete
database product you are using. For instance, PostGreSQL does not know
about foreign keys, but is able to use a
constraint during table construction.
After creating the necessary tables, data can be filled in. Seafood can create a so-called database intermediate file, which is an unsorted textual file meant to be read by an external Perl script which feeds the sums into the appropriate tables within your database. The example scripts are based on Oracle 7.3 and PostGreSQL. The script needs some environment variables and script variables to be set in order to work correctly. Refer to Deliverable 3  for details.
The insertion script starts by loading the concrete database driver and connecting to the database. In the next step, it tries to verify that all necessary tables are to be found in the data dictionary of your system. Finally, it reads the database intermediate file and stores the data into the database. If an error is detected, the insertion will be rolled back and the script aborted. Otherwise, success will be reported and the changes committed.
At the moment, you are unable to insert the same file twice. A more accurate insertion should be possible, e.g. checking the first stamp and cache host from the time stamp tables. If detected, updates should be used, either by adding to or overwriting existing data. But at the moment, only insertion of new rows can be used.
In case you want to remove all tables from your database, you just need to drop the appropriate tables. Some products like PostGreSQL are not capable of cascading dependent drops, so you might need to expand the uninstall SQL script in order to drop automatically constructed indices.
Attention! Running the uninstall script will remove all Seafood related data from your database! You could lose years worth of data.
Table 4.1 shows the programs and scripts described so far. Please not that the Oracle dialect is used in all the examples. Scripts for other database dialects were also supplied, but not tested.
||log file analyzer.||C++, binary|
||Database table creation script.||SQL|
||Database destruction script.||SQL|
||Database value insertion script, which takes the results from Seafood and puts them into the database.||Perl|
||Helper module for the insertion script.||Perl|
Besides querying for the plain results of a single table stored within the database, the data of greater interest can be found by combining and grouping more than one table. Figure 5.1 is an example of a powerful combination.
Figure 5.1: Example for a combined extract from 10 cache hosts.
The web-based interface is divided into six separate areas. The first two areas deal with the selection of the interval of interest. When a user requests the CGI page from a web server, the CGI script queries the database for the extent of data stored, and thus creates a default of the smallest and largest time stamp available.
Figure 5.2: The web based interface.
The third part lets a user select a cache host. The cache host list is also generated by a query to the database. Please note that at least one cache host must be selected. The fourth section deals with the canvas size and the media type of the output. The media types can easily be extended to any gnuplot supported graphics format.
The fifth section determines the kind of query. The prototypical implementation contains nine different options, which will be covered in the following sections. The kinds of query can easily be extended to options of other interests by modifying the Perl scripts. Also, a selection by requests and by volume is part of the selection. The final column either submits the choices to the next stage, or resets the form to its default.
The following sections describe the different kinds of query which were selectable in area five.
The first kind of query in the selection menu is a simple selection from the peaks table. The peak table is kept at the finer granularity I2, which allows a good overview of daily or weekly behaviour comparisons of up to two caches. The selection is straight-forward: for the specified interval, a graph with the requests and HITs is drawn for each selected host.
The non-GET method option allows for a quick sum of all methods which were not GET methods. Please note that a more flexible interface would allow the user to specify the methods to exclude. Also, note that ICP_QUERY is excluded, as it constitutes about 75% of the queries we see, and can be determined from UDP traffic (not in the web GUI).
Within the selected interval, for each selected cache host, the sum of its non-GET queries and respective hits are graphed. The logarithmic scale was chosen to simplify the two regions of interest. The values in logarithmic scale are off by 1 in order to overcome the log(0) problem.
In order to determine the amount of traffic generated by a cache mesh as the client to itself, like fetching hits from neighbours, or obtaining objects from parents, the special TCP client option allows to select a subset from the TCP client table. A more powerful interface would allow the user to specify the set of hosts to select. For the sake of a prototype, the search set is hard coded into the query, but can easily be changed. Please note that when porting the queries to suit your needs, you should always match both, the symbolic and the numeric host names.
The IN, OUT and AMONG option allow for a more powerful combination than the ones previously shown. It allows the evaluation of the efficiency of a cash mesh, the savings as OUT-IN, and the overhead - the traffic AMONG the caches. Refer to figure 5.1 as an example. Unfortunately, the traffic bypassing the cache mesh, labelled with a question mark in figure 5.3, cannot be computed, as it does not appear in the cache log files.
Figure 5.3: Viewing a cache mesh as a source.
Please note that the AMONG traffic is - in our case - completely limited to the backbone. The information for OUT is determined from the TCP clients table with the exception of the caches participating in the mesh themselves. The information used for IN is determined from the direct hierarchy table. AMONG is selected from the peer hierarchy table, which contains both parent and sibling calls.
The savings OUT-IN are graphed separately in order to make the savings comparable to the overhead. Figure 5.1 shows the savings and overhead for the DFN caches starting from October 1999 till the end of the year.
There are two options dealing with the top-level domain names. The frequent option uses a fixed set of frequently encountered top-level domain names. The top-N option on the other hand dynamically determines the six most common top-level domains for the specified interval from the top-level domain table. Please note that viewing by requests and viewing by volume might yield a different set of top-N domains for an identical interval.
When viewing the examples, please note that our cache mesh is partitioned into hosts serving *.com requests and hosts serving the logical !.com domain.
Figure 5.4: Example for top-N *.com and !.com domains by request.
The frequent MIME type option selects a fixed set of often encountered MIME types from the MIME table. A more adaptable GUI would allow the user to select the types.
The AS option selects the top six most frequently used destinations which were visited by the selected cache hosts. Note that the AS table is created from direct hierarchy destinations, with the destination host condensed into the AS number. Therefore the AS diagram will give some insight into the extent to which exterior links are used.
A set of different client options allows for separate diagrams for TCP and UDP client traffic, further separated into HIT and summary traffic. For the specified interval and cache host set, the top six clients are determined and plotted.
The hierarchy performance, separated into parent, sibling and direct traffic, can be plotted using the last option. For the selected interval and cache host set, the traffic is plotted into three graphs using a logarithmic scale. The values could have been derived from the peaks table for the finer interval I2 granularity, but in order to stay comparable with other interval I1 based output, the graphs are selected from the hierarchy tables.
A web-based GUI was chosen due to its implementation independent access to the database underneath. Figure 5.5 shows the data flow diagram.
Figure 5.5: CGI data flow of web GUI.
Whenever a user requests the selector's URL, a CGI program
sel.cgi will read the smallest and largest timestamp in the
database, and create a set of possible cache hosts from the entries in the
database. The HTML formatted form is sent to the user's browser, which will
in turn enter the user's selections.
By pressing the submit button, the user's choice of data is sent to the
ctor.cgi script. If the submitted data is consistent, the
database will be queried, usually several times per option. The query
results are constructed into intermediate data files to be fed into a
modified version of gnuplot. While generating the data files, a control
gnuplot is also created. Gnuplot in turn will create
the image file. All files reside in a temporary directory.
Finally, control is returned to
ctor.cgi, which then reads
the temporary image file and copies it to its connection to the
browser. Some cache control headers are added in the reply. All temporary
files are unlinked. Care was taken to unlink these files even in the case of
errors, though abnormal termination on a signal might leave some
A modified version of gnuplot is used which is capable of parsing UTC time stamps, and which can convert the time stamps into a human readable format. A patch is included in the source distribution, and was submitted to the maintainer.
The selection script is the front-end to the database interface. Using two queries after start-up, the script tries to determine the minimum and maximum time stamp available, and the set of cache hosts. The form selectors for the time stamps are created and defaulted to the time stamps found in the database. The rest of the script just prints the form and explanatory messages.
The constructor script is a rather monolithic creator of images. The
script starts by sanitizing the user submitted values. If the values are
OK, more data structures are set up for later perusal. In the next step,
the script connects to the database, creates the gnuplot control file, and
calls a work function to obtain the results of a query. After the function
finishes, the database connection can be closed, and gnuplot will be
called. If gnuplot returns successfully, the image will be copied onto
stdout. Please note that the database calls are bracketed within an
eval block in order to catch errors.
The work functions do not take any arguments. Each function is expected to return one or more gnuplot control statements to be written into the control file on return from the function. Each function will do one or more selections from the database, and arrange the answers in data files suitable for parsing by gnuplot. The data files are created in a temporary folder. The names of the data files are part of the gnuplot control return value from the work function.
The previous sections showed the abstract selection from tables, and the updated database design. This section will deal with the concrete implementation of the SQL queries used to obtain values from the database. All examples are shown for volume selection. The selection by request is part of the source.
Some Perl variables are visible in the queries. A better implementation would be using database stored procedures, and calls to them with variable arguments. Still, the current implementation is also flexible enough for a prototype. Refer to table 5.1 for commonly used variables. Further variables are described in the sections.
|$start||Start time stamp of GUI selected interval.|
|$final||Final time stamp of GUI selected interval.|
|$cache[$i]||One of the GUI selected cache hosts within a loop.|
|$cacheset||All GUI selected cache hosts.|
The example is shown by describing the action taken for the "IN, OUT and AMONG" query. It works on a set of selected cache hosts simultaneously. The output are just four graphs, showing:
The selection of the OUT values is similar to the previous section. For all clients, with the exception of the DFN caches, the sum of all client traffic for each time stamp is calculated. Even though only some top-N client caches are put by name in the database, the "more[..]" pseudo cache contains the sum of caches not mentioned explicitly. Therefore, the OUT value is complete.
select st.first, sum(tc.s) from sf_stamp1 st, sf_tcp_client tc where st.i1id=tc.i1id and st.cache in ( $cacheset ) and st.first between $start and $final and not ( tc.host like \'%.win-ip.dfn.de\' or tc.host like \'193.174.75.%\' ) group by st.first
The IN values are selected from external traffic going directly to the source. Due to the DFN caches being top-level caches of a hierarchy - they have no parents outside themselves - the IN value is complete. Users at a different level of a cache hierarchy might want to add their parent traffic. On the other hand, in many cases the direct traffic is the most expensive.
select st.first, sum(hd.s) from sf_stamp1 st, sf_hier_direct hd where st.i1id=hd.i1id and st.cache in ( $cacheset ) and st.first between $start and $final group by st.first
The savings are calculated for each time stamp as OUT minus IN for
the current stamp. The overhead AMONG is obtained from the sibling and
parent traffic in the
select st.first, sum(hp.s) from sf_stamp1 st, sf_hier_peer hp where st.i1id=hp.i1id and st.cache in ( $cacheset ) and st.first between $start and $final group by st.first
The project proposal refers to correlating data, trends not immediately obvious from the basic data. The goals of "data mining" are to detect, interpret and predict qualitative and quantitative patterns in the data . The proposed "Extended Cache Statistics" will certainly not do all this, but it is believed that storing well-selected basic data is the first step into the right direction, and it is hoped that Seafood is able to provide at least some glimpses of what is possible with the data.
The Seafood software suite and its documentation can be obtained from
The database scripts from section 4 can be found in the
dbase folder. The web interface scripts can be found in the
public_html folder. Further information and pointers to
documentation and binaries can also be obtained at the project home page
|Tel.:||+49 511 762 4726|
|Fax:||+49 511 762 3003|
Jens-S. Vöckler is currently staff member at the Institute
for Computer Networks and Distributed Systems at the University of
Hanover, Germany. His current working areas focus on web-related
technologies. He is involved in the project "Conception of a Caching
Infrastructure in the German Research Network". Jens received his degree
in electrical engineering in 1997 from Hanover University. As part of his
Master's thesis, he developed a proxy and cache for SAD/SAP session
announcements using Java.
Helmut Pralle, University Professor, Dr.-Ing., studied
engineering sciences at the Technische Hochschule Karlsruhe and Hanover.
Afterwards he started to work as an assistant at the Institute for
Applied mathematics in the Division for Electronic Calculation and at the
Computer Centre of the Technical University of Hanover. He received his
doctorate in Control Theory in 1967. From 1968 to 1971 Head of the
Computer Centre of the University of Hanover, since 1971 Director of the
Regional Computing Centre for Lower Saxony. 1986 he got the Chair of the
Institute for Computer Networks and Distributed Systems at the Department
of Electrical Engineering and Information Technology at the University of
Hanover. Since 1996 member of the administrative council and the
executive committee of the DFN Association. He delivers expert reports
and vocational guidance for the German Research Council (DFG), ministries
and the industry. His major interests include distributed information
systems and high-speed data communications.