Linux in 60 Seconds
This table provides a methodology for quickly iterating through performance metrics of a typical Linux system with the goal of quickly identifying bottlenecks.
Tool | Rationale |
---|---|
uptime |
Load averages to identify if load is increasing or decreasing (compare 1, 5, and 15 minute averages) |
`dmesg -T | tail` |
vmstat -SM 1 |
System-side statistics: run queue length, swapping, overall CPU usage |
mpstat -P ALL 1 |
Per-CPU balance: a single busy CPU can indicate poor thread scaling |
pidstat 1 |
Per-process CPU usage: identify unexpected CPU consumers, and user/system CPU time for each process |
iostat -sxz 1 |
Disk I/O statistics: IOPS and throughput, average wait time, percent busy |
free -m |
Memory usage including the file system cache |
sar -n DEV 1 |
Network device I/O: packets and throughput |
sar -n TCP,ETCP 1 |
TCP statistics: connection rates, retransmits |
top |
Check overview |
uptime
The uptime
command is primarily used to get a quick glance at system load
averages to quickly understand the characteristics of the current load (i.e.
spurious or long-term). The three load average numbers represent 1-minute,
5-minute, and 15-minute intervals respectively.
In Linux, load average refers to system-wide demand of CPUs, disks, and other key resources. Load is measured as the current resource usage (utilization) plus the queued requests (saturation). The average is an exponentially damped moving average.
Note that if the load averages are decreasing (i.e. 15 > 5 > 1), it's possible the sought after event has already occurred and not repeated itself. This helps to identify where in the timeline to pay attention.
dmesg
Tailing the kernel ring buffer is helpful for quickly orienting yourself to where the bottleneck might be located. While this practice may not always produce usable results, it can quickly show memory pressure through OOM errors which can reduce the time needed to locate the issue.
vmstat
The below tables can help in understanding the output.
Procs
Column | Description |
---|---|
r |
The number of runnable processes (running or waiting for run time) |
b |
The number of processes waiting for I/O to complete |
The r
column can also be referred to as the run-queue length.
Memory
Column | Description |
---|---|
swdp |
The amount of swapped memory used |
free |
The amount of idle memory |
buff |
The amount of memory used as buffers |
cache |
The amount of memory used as cache (paging) |
si |
The amount of memory swapped in from disk |
so |
The amount of memory swapped to disk |
In most Linux systems, free
will be relatively small and most of the usable
memory will be allocated to the page cache (cache
). This is an intentional
design philosophy which prefers utilizing free RAM instead of letting it be
wasted. The -a
flag can be included which breaks up the page cache into
inactive and active memory.
The buffer cache is used for block device caching whereas the page cache is used for general file system caching (most modern systems have the buffer cache contained within the page cache).
If the si
and so
values are constantly in flux, it indicates that memory
swapping is occurring and indiciates memory pressure is being applied.
IO
Column | Description |
---|---|
bi |
Blocks received from a block device (blocks/s) |
bo |
Blocks sent to a block device (blocks/s) |
A large increase in either one of these values can indicate a process is writing or reading a large amount of data to disks (which can show up as load).
System
Column | Description |
---|---|
in |
The number of interrupts per second, including the clock |
cs |
The number of context switches per second |
CPU
Column | Description |
---|---|
us |
Time spent running non-kernel code (user time, including nice time) |
sy |
Time spent running kernel code (system time) |
id |
Time spent idle |
wa |
Time spent waiting for IO |
st |
Time stolen from a virtual machine |
mptstat
The following tables documents the columns resulting from this command:
Column | Description |
---|---|
CPU |
Logical CPU ID, or all for summary |
%usr |
User-time, excluding %nice |
%nice |
User-time for processes with a nice'd priority |
%sys |
System-time (kernel) |
%iowait |
I/O wait |
%irq |
Hardware interrupt CPU usage |
%soft |
Software interrupt CPU usage |
%steal |
Time spent servicing other tenants |
%guest |
CPU time spent in guest virtual machines |
%gnice |
CPU time to run a niced guest |
%idle |
Idle |
Of note are the %usr
, %sys
, and %idle
columns which can indiciate the
ration between user/kernel CPU usage. Additionally, a single hot CPU (%usr
+
%sys
) can indicate a large single-threaded workload.
pidstat
This tool presents CPU utilization information as broken down by process. By
default it only shows active (non-idle) processes. Passing the -p ALL
flag
will instead force it to show statistics for all processes.
The -d
flag converts the output to reflect disk I/O statistics instead.
iostat
This tool provides per-disk I/O statistics. By default, without any additional
flags or arguments, a summary-since-boot for CPU and disk statistics is printed.
The following table details the short extended (-sx
) columns presented:
Column | Description |
---|---|
Device |
The device (or partition) name as listed in /dev |
tps |
Transactions per second (IOPS) |
kB/s |
Kbytes per second |
rqm/s |
Requests queued and merged per second |
await |
Average I/O response time, including time queued in the OS and the I/O response time of the device |
aqu-sz |
Average number of requests both waiting in the driver request queue and active on the device |
areq-sz |
Average request size in Kbytes |
%util |
Percent of time the device was busy processing I/O requests (utilization) |
The most important metric in terms of delivered performance is await
. What
values constitutes good or bad is subjective depending on the workload. A high
await
can be caused by queueing, larger I/O sizes, random I/O on rotational
disks, and device errors.
Nonzero counts in rqm/s
indiciate a sequential workload. Small-sizes in
areq-sz
indicate a random I/O workload.
By omitting the -s
flag, the columns are further broken down by read/writes.
This can be helpful, especially considering writes are less impactful (due to
write-back caching) and can otherwise skew the aggregated value.
free
This tool provides summarized information about memory usage. Most of these
statistics reside in vmstat
, however, of particular note is the available
column which indiciates how much memory is available for reclamation,
specifically memory being used in the cache which can be released (i.e. not
dirty).
sar
This tool is very configurable and can provide a wide spectrum of data. The full
breadth of what is possible is too much to detail here, refer to man sar
for
more details.
top
This is likely the most popular tool used by beginners for investigating the
performance state of a machine. Of note is the %CPU
column which shows CPU
usage percentage (not normalized, meaning not average over all CPUs), and the
TIME+
column which shows total CPU usage time.