Out-Of-Memory or OOM is a state in any operating system where the system is running on critically low memory and can lead to undesirable outcomes for the programs or processes running on it. A user is confronted with such a situation when the Linux OS allocates greater memory to processes beyond the limit. This might seem like strange behavior, however, in most of the kernel distributions, processes are allowed to request the kernel for more memory than the system is configured with. This is dictated by the value of "/proc/sys/VM/overcommit_memory", which is set to "0" by default. The Linux kernel will resort to such behavior with the assumption that the processes will not use all the allocated memory immediately or in the long run.
user@xxxxxxxxxxx:~$ cat /proc/sys/vm/overcommit_memory
0
The good news is that in such a situation the kernel will employ a process called "OOM killer" which will call the "oom_kill_task()" function. This function is important to know because it sends the kill or termination signal also known as "SIGKILL" to randomly chosen programs with low priorities. This prevents the kernel from going into a complete panic and crashing the OS. However, these low-priority processes can be routinely used processes like SSH or Apache, thereby affecting the user's day-to-day operations. If the memory utilization of a host goes above 80%, the host will be running out of memory. This might lead to the user not being able to deploy new virtual machines or containers. The OOM killer process will trigger only when the system is configured to overcommit memory.
Common implications of OOM on an affected host can cause virtual machines to reboot, UI unresponsiveness, or significant delay in the loading of the UI. Few factors should be considered when encountering an OOM incident.
1. "Cgroup" memory processes can get killed by the Linux kernel as each process takes more memory than it is allocated.
2. Random processes can get terminated by the Linux kernel as victims of low memory availability.
In this article, I will only be talking about the Linux filesystem. Few commonly seen log messages that can indicate a process running into memory exhaustion, depending on the kind of memory exhaustion can be:
Out of Memory: Killed process [pid] [name]
java.lang.OutOfMemoryError: Java heap space
An OOM traceback message on your screen can vary, depending on whether the process killed by the oom killer is a "cgroup" process or a random process. A command frequently used by administrators to investigate the kernel log and find the oom traceback is "dmesg". Example of using "dmesg" to look for an oom traceback:
dmesg | grep -B5 -A 50 "Call Trace"
Example of the kernel log messages in case of a "cgroup" process termination by the oom killer:
[244032.152975] python2.7 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=100 [244032.152978] CPU: 1 PID: 9122 Comm: python2.7 Tainted: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [244032.152979] Hardware name: xxxxxxxxxxx, BIOS xxx xx/xx/xxxx [244032.152980] Call Trace: [244032.152989] dump_stack+0x50/0x6b [244032.152993] dump_header+0x4a/0x200 [244032.152996] oom_kill_process+0xd7/0x110 [244032.152998] out_of_memory+0x105/0x500 [244032.153001] mem_cgroup_out_of_memory+0xb5/0xd0 [244032.153003] try_charge+0x766/0x7c0 [244032.153007] ? __alloc_pages_nodemask+0x160/0x320 [244032.153009] mem_cgroup_try_charge+0x70/0x190 [244032.153011] mem_cgroup_try_charge_delay+0x1c/0x40 [244032.153015] __handle_mm_fault+0xda5/0x1330 [244032.153017] handle_mm_fault+0xb0/0x1e0 [244032.153021] __do_page_fault+0x28d/0x4c0 [244032.153023] do_page_fault+0x30/0x110 [244032.153027] page_fault+0x39/0x40 [244032.153029] RIP: 0033:0x7f5bfb5f6a10
dmesg | grep "Memory cgroup out of memory: Kill*"
[244032.153223] Memory cgroup out of memory: Killed process 9122 (python2.7) total-vm:1320040kB, anon-rss:820808kB, file-rss:14344kB, shmem-rss:0kB, UID:1000 pgtables:2200kB oom_score_adj:100
Example of the kernel log messages in case of a random process termination by the oom killer:
dmesg | grep -B5 -A 50 "Call Trace"
[299443.100998] popen_helper invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [299443.101000] CPU: 10 PID: 9311 Comm: popen_helper Tainted: x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx [299443.101001] Hardware name: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, BIOS x.xx xx/xx/xxxx [299443.101002] Call Trace: [299443.101010] dump_stack+0x50/0x6b [299443.101014] dump_header+0x4a/0x200 [299443.101016] oom_kill_process+0xd7/0x110 [299443.101018] out_of_memory+0x105/0x500 [299443.101020] __alloc_pages_slowpath+0x9d3/0xd30 [299443.101022] __alloc_pages_nodemask+0x2d8/0x320 [299443.101024] pagecache_get_page+0xb4/0x230 [299443.101025] filemap_fault+0x571/0x880 [299443.101028] ? page_add_file_rmap+0x12e/0x180 [299443.101031] ? alloc_set_pte+0xf7/0x590 [299443.101034] ? xas_load+0x8/0x80 [299443.101035] ? xas_find+0x16c/0x1b0 [299443.101036] ? filemap_map_pages+0x18c/0x380 [299443.101057] ext4_filemap_fault+0x2c/0x40 [ext4] [299443.101058] __do_fault+0x53/0xe4 [299443.101060] __handle_mm_fault+0xce0/0x1330 [299443.101064] handle_mm_fault+0xb0/0x1e0 [299443.101066] __do_page_fault+0x28d/0x4c0 [299443.101068] do_page_fault+0x30/0x110 [299443.101069] page_fault+0x39/0x40 [299443.101071] RIP: 0033:0x7f5719d8b7b1
dmesg | grep "Out of memory: Kill*"
[133909.531159] Out of memory: Killed process 11692 (python2.7) total-vm:188444kB, anon-rss:45012kB, file-rss:864kB, shmem-rss:0kB, UID:1000 pgtables:400kB oom_score_adj:900
OOM can be the result of some of the following issues:
Memory allocation handling issues by the kernel
Memory leak
Imbalance in memory consumption by service leaving the other services starving for memory
Issue with the hardware memory or DIMM module
Lack of available memory at the hardware level
The easiest way to recover from OOM issues is to allocate additional swap space. Swap space is commonly configured during the OS installation process but there are ways to extend the swap space after the installation as well. However, in the case of memory unavailability at the hardware level, allocating additional swap is not an option and administrators will have to explore the options for DIMM capacity upgrades. Disabling the memory overcommit setting in the "/proc/sys/vm/overcommit_memory" configuration file is another option. Code fixes might be able to fix the memory handling issues by the host. In case of a certain service acting as a memory hog, the system can be restarted for the service or process to release the memory back to the system. Individual services can also be restarted, however, one must exercise caution while doing so in a distributed architecture.
This was a relatively brief introduction to out of memory events or OOM events on Linux kernels and how one should approach debugging the system to find out the root cause. Keep an eye out for more such debugging fun!
Comments