Thursday, 23 June 2016

Angler

In my last couple of posts, I've been looking at how UDP network packets are received by the Linux kernel. While diving through the source code, it has been shown that there are a number of statistics available for monitoring receive errors, buffer overruns, and queue depths.

In the course of investigating network throughput issues in our systems at LMAX, we have written some tooling for monitoring the available statistics. The result of that work is a small utility that provides an interface for monitoring system-wide or socket-specific statistics from a Java program.

The code is available in the Angler github repository.


Who is it for?


This utility may be of use to you if you are interested in metrics and alerting around network throughput on Linux. Currently, only UDP socket monitoring is available, though we have plans to add similar functionality for TCP sockets.

Angler works by reading and parsing files in the /proc/ filesystem, and reporting metrics back to your application. It is then up to the user to determine how to handle these data accordingly. Perhaps the correct action is simply to report the numbers to a time-series database for charting or threshold alerting. Another valid use-case would be to apply back-pressure to a publishing system in the event of buffer overflow or increasing queue depth.

Angler is designed for use in latency-sensitive systems, and is garbage-free in steady state. It can, of course, be used in systems where garbage-collection is not an issue.


Available statistics


Angler offers an API to monitor individual sockets specified by either a host:port combination (an instance of java.net.InetSocketAddress), or all sockets listening to a particular IP address (an instance of java.net.InetAddress).

To begin monitoring a socket, use one of the beginMonitoring methods on UdpSocketMonitor:




Once a socket monitoring request has been made, available socket statistics will be provided to the application on the next invocation of the monitor's poll method.

The callback method is invoked for each monitored socket reporting the receive queue depth and drop count:




System-wide statistics are available from /proc/net/softnet_stat and /proc/net/snmp. See previous posts for more information on exactly what is reported in these files.


The softnet data is provided by SoftnetStatsMonitor, and is made available to the following callback method:



Changes in these numbers can indicate that the Linux worker threads are not getting enough time to dequeue incoming packets from the network device.


SNMP data is provided by SystemNetworkManagementMonitor, and is provided on the following callback:




These statistics report a global view of receive errors, which could be caused by buffer overruns, memory exhausation or other factors.

A complete example of these methods can be found in the ExampleApplication.


Production use


At LMAX, we have been using Angler in production for some time, so consider it production-ready. We poll the files in /proc/ at up to 100 times per second on some services, in order to get a more fine-grained view of receive buffer depths. So far, we have not encountered any issues with this approach; a careful review of the kernel source code responsible for supplying the statistics indicates only a very small change of lock contention.

Version 1.0.3 is currently available on maven central.

Contributions and feedback are welcome!

No comments:

Post a Comment