User level Network Performance Monitoring programme

Authors: Hans Blom and Cees de Laat, Computational Physics, Physics department, University of Utrecht, Netherlands

Introduction

In the second half of 1999 we observed discrepancy between the perceived performance of the TEN-155 network as monitored by DANTE and what was useable by some of its users. To help identify possible causes, University of Utrecht developed a distributed measurement environment which periodically (once per hour) observes thruput and delay behaviour from several edges of the network.

Most network monitoring tools provide information on how much data was transported by core networks. Other tools exist to measure round trip times and packet losses. Many of those tools do not visualise what an end user really is interested in: the goodput. Goodput is defined as actual useful data transfer from/to the user application. Packet losses and retransmissions are of no interest for the goodput, although the round trip time, or, more important, task completion time, in the applications may increase. Reports from core networks may show high volume data transfers but these numbers are useless if it is not known what part of that is retransmission.

The at the Utrecht University developed network monitoring package named RTPL is capable of visualising end user Quality of Service aspects from the (research) network. Currently measurements are performed for several applications including the European research network, a distributed computer cluster system, DAS, scattered over five universities in the Netherlands and the user sites of the collaboratory experiment Dynacore (RE 4005). In the Dynacore project the goal was to see if the current backbones can support the network requirements of a collaboratory, in the Distributed ASCI Supercomputer project the goal was to compare the QoS obtained from an ATM based MBS service to the "normal" over provisioned Internet service. In Dynacore we found that the actual available network resources were insufficient to exploit any form of collaboratory. The situation has improved a lot recently. In the DAS project we can prove that the use of a QoS service is very beneficial for the distributed computing application.

Description RTPL

As explained in the Introduction section, the RTPL (Remote Thruput Ping Load) package is used for periodically net performance measurements between a specified set of hosts. A, so called, control host is used to start the net performance measurements at each host with a remote shell command (ssh, rsh). Also the results of the tests are send back via this remote shell. This scheme is displayed in .
Hosts scheme
 
. The control host starts the measurements at all hosts i with remote shells. Host i executes the measurements to the hosts j. The results are send back by host i to the control host.
At each host i ( ) the tests, listed below, are performed. The tests are executed by generally available and commonly used programs.
Thruput
Definition RFC 1224The maximum rate at which none of the offered frames are dropped by the device.
The standard, public domain command netperf is used for the thruput measurements.
Round trip
The round trip time to a remote host. The measurement procedure, using ICMP, is described in RFC 2152.
The round trip time is measured with the system command ping.
Load
The load is expressed as the # fully active processes at a host. It is no network quantity, but it can relate performance decreases to heavily machine load.
The load is measured at the current host i (see  ), using the system command uptime.
The sampling of the results at the control host and the measurements at the hosts, participating in the tests, are performed by scripts in the scripting language Perl.

The Perl script at the control host collects the results of the measurements for each host i (see ) and stores the results in ZIP compressed data files. The ZIP compression is used to reduce disk space and download time (see below).

The presentation of these results is Web based: a Java Applet is used to load the data from the files into the memory of the browser. Please note that the functionality to read (ZIP compressed) data files from a Web browser is a Java feature.

The HTML scripting language JavaScript is used to dynamically present the user various HTML tables of the data. That is: the user selects a view at the data and the HTML code is generated on demand by JavaScript. JavaScript applies direct calls to Applet methods to obtain the required data for the HTML table to display. A direct call of Applet methods by JavaScript is a common functionality of most Web browsers. The Applet can also be used to present various plots of the data to the user. These plots are displayed in a new window.

The following data files are available to be viewed via the Web:

At the control host there are also Perl scripts available to manipulate the measured data in various ways. These scripts can a.o. be used to:

Site Configuration

A description of the sites, participating in the European monitor, is given in . The names, displayed in the column with the header "Title", are corresponding with the host titles used in the HTML tables from the network performance monitor.

Title Institute Location Hostname
(NRN)
System
(Rate [Mbit/s])
EUCS Computing Services, The University of Edinburgh Edinburgh, United Kingdom qasbah.ucs.ed.ac.uk
(JANET)
Red Hat Linux 6.0
(??)
MCC Manchester Computing, University of Manchester Manchester, United Kingdom nessie.mcc.ac.uk
(JANET)
FreeBSD 3.4
(??)
ULB Université Libre de Bruxelles Bruxelles, Belgium sun7.iihe.ac.be
(BELNET)
Sun Solaris 2.6
(??)
SARA Academic Computing Services Amsterdam Amsterdam, Netherlands 192.87.106.120
(SURFnet)
Red Hat Linux 5.1
(100)
UU-36 Institute of Computational Physics,
Utrecht University
Utrecht, Netherlands hst3736.phys.uu.nl (SURFnet) Sun Solaris 7
(100)
ZAM Central Institute for Applied Mathematics,
Forschungszentrum Jülich
Jüich, Germany zam472-b.zam.kfa-juelich.de
(DFN)
Sun Solaris 7
(100)
CERN CERN Geneva, Switzerland sunstats.cern.ch
(CERN via SWITCH / TEN-155)
Sun Solaris 2.6
(100)
CIC Computing and Information Centre, Czech Technical University in Prague Prague, Czech Republic nms.cvut.cz
(CESNET)
Sun Solaris 2.5
(??)
 
. Description of the sites participating in the European monitor. The names, displayed with the header "Title", are corresponding with the host titles used in the HTML tables from the network performance monitor.
In  the topology map of the part of the TEN-155 network is given which connects the used NRN's and the participating sites. Note that this map does not give a correct geographical representation of the displayed locations. See for more information also the topology map of the TEN-155 network from DANTE.
Topology Map
 
. Topology map of the part of the TEN-155 network which connects the used NRN's and the participating sites.
Please note that the thruput and round trip measurements are not executed for the full matrix which is formed when all sites are pairwise connected: only tests are performed at connections between sites which are selected such that: The performance tests are excecuted with the application parameters given in  .
 
Measurement parameters used in the performance tests.
Round trip
Description Value
Blocksize 64 bytes
Total # measurements 40
The first # values not used 2
Thruput (measured with netperf)
Description Value
Test duration 10 seconds
Test type to perform TCP_STREAM
Send size 32768 bytes
Local socket send / receive buffer sizes 32768 bytes
Remote socket send / receive buffer sizes 32768 bytes

Thruput Results

In this section the average thruput values, calculated as a function of the hour where the measurements were performed, are presented. The mean values are determined separately for the workdays of the week (Monday - Friday) and for the weekend. The average thruput results are calculated for the interval week 06 - week 13. Please note that these average thruput values are also available from the Web presentation of this package, but only at a monthly base. Therefore, we prefer to recalculate these averages for the available weeks offline. This tool contains scripts which can be used to (re)calculate these averages.

The results are presented in the form of plots from thruput measurements between sites, specified in , using the same TEN-155 connection. The plots for the connections listed below are displayed. Please note that only a selection of the TEN-155 selections are listed here. The connections via / to Paris, Bruxelles and Prague are excluded here. Only the busiest connections with the most redundant host information is displayed here. However, all connections will be evaluated in the final report.

London <=> Amsterdam
The workday (weekend) average thruput values are displayed in  ( ).
The TEN-155 route between EUCS and CERN is: London - Amsterdam - Geneva.
Amsterdam <=> Geneva
The workday (weekend) average thruput values are displayed in  ( ).
Amsterdam <=> Frankfurt
The workday (weekend) average thruput values are displayed in  ( ).
Frankfurt <=> Geneva
The workday (weekend) average thruput values are displayed in  ( ).
Additionally  displays the workday throughput values at the SURFnet connection between UU-36 and SARA.

Workday thruput London Amsterdam

 
. Average workday thruput values for the connection London <=> Amsterdam.
Weekend thruput London Amsterdam
 
. Average weekend thruput values for the connection London <=> Amsterdam.
Workday thruput Amsterdam Geneva
 
. Average workday thruput values for the connection Amsterdam <=> Geneva.
Weekend thruput Amsterdam Geneva
 
. Average weekend thruput values for the connection Amsterdam <=> Geneva.
Workday thruput Amsterdam Frankfurt
 
. Average workday thruput values for the connection Amsterdam <=> Frankfurt.
Weekend thruput Amsterdam Frankfurt
 
. Average weekend thruput values for the connection Amsterdam <=> Frankfurt.
Workday thruput Frankfurt Geneva
 
. Average workday thruput values for the connection Frankfurt <=> Geneva.
Weekend thruput Frankfurt Geneva
 
. Average weekend thruput values for the connection Frankfurt <=> Geneva.
Workday thruput SURFnet
 
. Average workday thruput values for the SURFnet connection between UU-36 and SARA.
Below some conclusions are drawn from the displayed thruput plots. Before doing this we would like to emphasize that with the used setup it is not always possible to point precisely to the causes of performance decreases are situated, because not at all PoP's were test sites available. However, the main goal of this project was to get an end user perpective of the European network performance. Further analysis and measurements should help to bring more clarity.

The following conclusions can be given:

Roundtrip Results

In this section some typical minimum round trip values are presented in the form of a figure from the network performance Web pages, see . The minimum round trip values from/to UU-36 are presented from week 13, 2000. The displayed week is typical.



 
. The minimum round trip values for the connections to / from UU-36. Each sixth row from the original table is displayed here.
From  there follows that the round trip times are in reasonable correspondence with the geographical distance. However, some connection show now and then large delays. The explanation may be that a non-default route is followed. Smaller delays may be caused by waiting times in the NRN routers.



More info on rTPL: