Linux in 60 Seconds

This table provides a methodology for quickly iterating through performance metrics of a typical Linux system with the goal of quickly identifying bottlenecks.

Tool	Rationale
`uptime`	Load averages to identify if load is increasing or decreasing (compare 1, 5, and 15 minute averages)
`dmesg -T	tail`
`vmstat -SM 1`	System-side statistics: run queue length, swapping, overall CPU usage
`mpstat -P ALL 1`	Per-CPU balance: a single busy CPU can indicate poor thread scaling
`pidstat 1`	Per-process CPU usage: identify unexpected CPU consumers, and user/system CPU time for each process
`iostat -sxz 1`	Disk I/O statistics: IOPS and throughput, average wait time, percent busy
`free -m`	Memory usage including the file system cache
`sar -n DEV 1`	Network device I/O: packets and throughput
`sar -n TCP,ETCP 1`	TCP statistics: connection rates, retransmits
`top`	Check overview

uptime

The uptime command is primarily used to get a quick glance at system load averages to quickly understand the characteristics of the current load (i.e. spurious or long-term). The three load average numbers represent 1-minute, 5-minute, and 15-minute intervals respectively.

In Linux, load average refers to system-wide demand of CPUs, disks, and other key resources. Load is measured as the current resource usage (utilization) plus the queued requests (saturation). The average is an exponentially damped moving average.

Note that if the load averages are decreasing (i.e. 15 > 5 > 1), it's possible the sought after event has already occurred and not repeated itself. This helps to identify where in the timeline to pay attention.

dmesg

Tailing the kernel ring buffer is helpful for quickly orienting yourself to where the bottleneck might be located. While this practice may not always produce usable results, it can quickly show memory pressure through OOM errors which can reduce the time needed to locate the issue.

vmstat

The below tables can help in understanding the output.

Procs

Column	Description
`r`	The number of runnable processes (running or waiting for run time)
`b`	The number of processes waiting for I/O to complete

The r column can also be referred to as the run-queue length.

Memory

Column	Description
`swdp`	The amount of swapped memory used
`free`	The amount of idle memory
`buff`	The amount of memory used as buffers
`cache`	The amount of memory used as cache (paging)
`si`	The amount of memory swapped in from disk
`so`	The amount of memory swapped to disk

In most Linux systems, free will be relatively small and most of the usable memory will be allocated to the page cache (cache). This is an intentional design philosophy which prefers utilizing free RAM instead of letting it be wasted. The -a flag can be included which breaks up the page cache into inactive and active memory.

The buffer cache is used for block device caching whereas the page cache is used for general file system caching (most modern systems have the buffer cache contained within the page cache).

If the si and so values are constantly in flux, it indicates that memory swapping is occurring and indiciates memory pressure is being applied.

IO

Column	Description
`bi`	Blocks received from a block device (blocks/s)
`bo`	Blocks sent to a block device (blocks/s)

A large increase in either one of these values can indicate a process is writing or reading a large amount of data to disks (which can show up as load).

System

Column	Description
`in`	The number of interrupts per second, including the clock
`cs`	The number of context switches per second

CPU

Column	Description
`us`	Time spent running non-kernel code (user time, including nice time)
`sy`	Time spent running kernel code (system time)
`id`	Time spent idle
`wa`	Time spent waiting for IO
`st`	Time stolen from a virtual machine

mptstat

The following tables documents the columns resulting from this command:

Column	Description
`CPU`	Logical CPU ID, or `all` for summary
`%usr`	User-time, excluding `%nice`
`%nice`	User-time for processes with a nice'd priority
`%sys`	System-time (kernel)
`%iowait`	I/O wait
`%irq`	Hardware interrupt CPU usage
`%soft`	Software interrupt CPU usage
`%steal`	Time spent servicing other tenants
`%guest`	CPU time spent in guest virtual machines
`%gnice`	CPU time to run a niced guest
`%idle`	Idle

Of note are the %usr, %sys, and %idle columns which can indiciate the ration between user/kernel CPU usage. Additionally, a single hot CPU (%usr + %sys) can indicate a large single-threaded workload.

pidstat

This tool presents CPU utilization information as broken down by process. By default it only shows active (non-idle) processes. Passing the -p ALL flag will instead force it to show statistics for all processes.

The -d flag converts the output to reflect disk I/O statistics instead.

iostat

This tool provides per-disk I/O statistics. By default, without any additional flags or arguments, a summary-since-boot for CPU and disk statistics is printed. The following table details the short extended (-sx) columns presented:

Column	Description
`Device`	The device (or partition) name as listed in /dev
`tps`	Transactions per second (IOPS)
`kB/s`	Kbytes per second
`rqm/s`	Requests queued and merged per second
`await`	Average I/O response time, including time queued in the OS and the I/O response time of the device
`aqu-sz`	Average number of requests both waiting in the driver request queue and active on the device
`areq-sz`	Average request size in Kbytes
`%util`	Percent of time the device was busy processing I/O requests (utilization)

The most important metric in terms of delivered performance is await. What values constitutes good or bad is subjective depending on the workload. A high await can be caused by queueing, larger I/O sizes, random I/O on rotational disks, and device errors.

Nonzero counts in rqm/s indiciate a sequential workload. Small-sizes in areq-sz indicate a random I/O workload.

By omitting the -s flag, the columns are further broken down by read/writes. This can be helpful, especially considering writes are less impactful (due to write-back caching) and can otherwise skew the aggregated value.

free

This tool provides summarized information about memory usage. Most of these statistics reside in vmstat, however, of particular note is the available column which indiciates how much memory is available for reclamation, specifically memory being used in the cache which can be released (i.e. not dirty).

sar

This tool is very configurable and can provide a wide spectrum of data. The full breadth of what is possible is too much to detail here, refer to man sar for more details.

top

This is likely the most popular tool used by beginners for investigating the performance state of a machine. Of note is the %CPU column which shows CPU usage percentage (not normalized, meaning not average over all CPUs), and the TIME+ column which shows total CPU usage time.