根据Linux Kernel Kdump 文档,可以使用如下工具在Host上分析kernel crash的现场
根据Kdump的文档,需要准备两个kernel和两个文件系统。
两个Linux Kernel
- System kernel - 正常使用的的Linux kernel
- Dump-capture kernel - 当正常使用的kernel出现crash的时候,需要运行这个kernel来获取正常使用kernel的运行状态。
两个文件系统
- 正常rootfs,需要包含kexec-tools,当kernel crash的时候,需要借助kexec 来运行 Dump-capture kernel
- Dump-capture 文件系统,可以是一个很小的文件系统,主要是把crash 的core 文件dump出来,可以包含工具makedumpfile。
注意事项
这里有一些要特别注意的,这里这些工具的版本和Linux kernel版本可能是强关联的,根据实际测试的结果,使用最新的Linux(v6.9)和这些工具就不太兼容。下面是这里使用的版本信息:
- Linux kernel - v6.8
- kexec-tools - v2.0.29
- makedumpfile - v1.7.5
- crash-utility - v8.0.5
Linux kernel 配置
前面提到这里需要使用两个kernel,一个系统System kernel 和一个Dump-capture kernel。 System kernel 如果要是用kdump的话,需要有如下配置:
CONFIG_KEXEC=y
CONFIG_SYSFS=y
CONFIG_DEBUG_INFO=Y
如果要使用crash-utility来分析kernel dump的结果,这里还需要去掉一个配置:
# CONFIG_DEBUG_INFO_REDUCED is not set
否则使用crash的时候会出新如下错误:
Type "apropos word" to search for commands related to "word"...
WARNING: CONFIG_DEBUG_INFO_REDUCED=y
crash: ../linux/vmlinux: no debugging data available
crash: neither runqueue nor rq structures exist
Dump-capture kernel 它的主要目的是把/proc/vmcore里面的信息复制到host系统里面,所以这个kernel跟System kernel比起来,就不需要太多的配置。 下面这些是必须要有的配置:
CONFIG_CRASH_DUMP=y
CONFIG_PROC_VMCORE=y
CONFIG_RELOCATABLE=y
当然System kernel 和Dump-capture kernel也可以使用同一个配置。 但在实际的系统里面,应该会使用不同的配置,因为肯定希望Dump-capture kernel使用更少的资源。包括只boot一个CPU就可以了。
System kernel之前在什么地址运行,这里还可以使用之前的配置。没有额外的更改。 但是Dump-capture kernel 在什么地址运行,这里需要我们指定。这也就是需要在boot command里面指定:
CONFIG_BOOTARGS="console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 crashkernel=512M "
可以看到在正常系统boot的时候,会给reserved一段内存给Dump-capture kernel来使用。
#0 __parse_crashkernel(cmdline = 0xFFFF800081C85008 "console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 crashkernel=512M crash_kexec_post_notifiers=1", system_ram = 4278190080, crash_size = (long long unsigned int*) 0xFFFF800082523D30, crash_base = (long lo
ng unsigned int*) 0xFFFF800082523D28, suffix = (const char*) 0x0) at crash_core.c:269
#1 parse_crashkernel(cmdline = 0xFFFF800081C85008 "console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 crashkernel=512M crash_kexec_post_notifiers=1",
, crash_size = (long long unsigned int*) 0xFFFF800082523D30, crash_base = (long long unsigned int*) 0xFFFF800082523D28, low_size = (long long unsigned int*) 0xFFFF800082523D20, high = 0xFFFF800082523D1F "") at crash_core.c:324
#2 arch_reserve_crashkernel() at init.c:109
#3 bootmem_init() at init.c:112
#4 request_standard_resources() at setup.c:359
#5 get_boot_config_from_initrd(_size = (size_t*) 0x0) at main.c:898
#6 setup_command_line(command_line = (char*) 0x0) at main.c:634
#7 __primary_switched() at head.S:524
至于要给预留多大,主要是看Dump-capture kernel的配置,因为我们这里使用同一个kernel 配置,所有预留的比较大。
最后系统解析完成之后会把结果放到如下变量, 当使用kexec load Dump-capture kernel,就是从这里去找空间。
struct resource crashk_res = {
.name = "Crash kernel",
.start = 0,
.end = 0,
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
.desc = IORES_DESC_CRASH_KERNEL
};
struct resource crashk_low_res = {
.name = "Crash kernel",
.start = 0,
.end = 0,
.flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
.desc = IORES_DESC_CRASH_KERNEL
};
文件系统
这里使用buildroot来制作文件系统,因为buildroot里面已经支持了kexec-tools 和makedumpfile,所以只需要加上这两个配置就可以:
BR2_PACKAGE_KEXEC=y
BR2_PACKAGE_KEXEC_ZLIB=y
BR2_PACKAGE_MAKEDUMPFILE=y
如果两个系统使用不同的rootfs,这个时候在正常系统里面加上kexec-tools,在Dump-capture 系统里面加上makedumpfile这两个包就可以。
为了能够加载 Dump-capture系统,这里就需要把Dump-capture Linux kernel 的Image 放在正常系统的rootfs里面。 下面的实验就把Image放在 /root/Image。
这里为了实验简单,就使用同一个文件系统。
为了把kernel crash 后的core dump 数据复制出来,下面的实验是把core dump的数据复制到文件系统,然后在复制到host系统。 所以需要给rootfs预留比较大的空闲空间。
如果系统里面使用网络的话,就可以通过网络的方式把core dump 数据复制出来。这应该有很多手段,这个怎么方便可以怎么来。
Dump-capture kernel 加载
准备好两个系统之后,可以正常情况加载系统。 只是在u-boot 的Linux boot commands 需要加上:
CONFIG_BOOTARGS="console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 crashkernel=512M "
系统boot 成功之后,可以看到有如下打印:
[ 0.000000] crashkernel reserved: 0x00000000db600000 - 0x00000000fb600000 (512 MB)
这里为Dump-capture Linux kernel 预留了这部分物理地址。
因为Dump-capture Linux kernel 已经放在/root/Image 里面了,接下来就可以在目标系统使用如下命令加载Dump-capture Linux kernel 上面那个地址:
kexec -p /root/Image --reuse-cmdline --append="maxcpus=1 reset_devices"
执行结果如下:
# kexec -p /root/Image --reuse-cmdline --append="maxcpus=1 reset_devices" -d
kernel symbol _text vaddr = ffff800080000000
kernel symbol _stext vaddr = ffff800080010000
kernel symbol __init_begin vaddr = ffff800081b80000
arch_process_options:178: command_line: console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 maxcpus=1 reset_devices
arch_process_options:180: initrd: (null)
arch_process_options:182: dtb: (null)
arch_process_options:185: console: (null)
Try gzip decompression.
Try LZMA decompression.
elf_arm64_probe: Not an ELF executable.
[ 8103.741670] kernel: (____ptrval____) kernel_size: 0x29b2200
[ 8103.741816] Crash PT_LOAD ELF header. phdr=(____ptrval____) vaddr=0xffff000000000000, paddr=0x80000000, sz=0x5b600000 e_phnum=11 p_offset=0x80000000
[ 8103.741981] Crash PT_LOAD ELF header. phdr=(____ptrval____) vaddr=0xffff00007b600000, paddr=0xfb600000, sz=0x3a00000 e_phnum=12 p_offset=0xfb600000
[ 8103.742145] Crash PT_LOAD ELF header. phdr=(____ptrval____) vaddr=0xffff000800000000, paddr=0x880000000, sz=0x80000000 e_phnum=13 p_offset=0x880000000
[ 8103.742315] Loaded elf core header at 0xfb5f0000 bufsz=0x1000 memsz=0x1000
[ 8103.743956] RNG is not initialised: omitting "kaslr-seed" property
[ 8103.744036] RNG is not initialised: omitting "rng-seed" property
[ 8103.745069] Loaded dtb at 0xfb400000 bufsz=0x2eec memsz=0x3000
[ 8103.745166] Loaded kernel at 0xdb600000 bufsz=0x29b2200 memsz=0x2a60000
[ 8103.745291] nr_segments = 3
[ 8103.745354] segment[0]: buf=0x(____ptrval____) bufsz=0x29b2200 mem=0xdb600000 memsz=0x2a60000
[ 8103.820075] segment[1]: buf=0x(____ptrval____) bufsz=0x1000 mem=0xfb5f0000 memsz=0x1000
[ 8103.820205] segment[2]: buf=0x(____ptrval____) bufsz=0x2eec mem=0xfb400000 memsz=0x3000
[ 8103.848444] machine_kexec_post_load:119:
[ 8103.848521] kexec kimage info:
[ 8103.848581] type: 1
[ 8103.848646] head: 4
[ 8103.848708] kern_reloc: 0x0000000000000000
[ 8103.848785] el2_vectors: 0x0000000000000000
[ 8103.848860] kexec_file_load: type:1, start:0xdb600000 head:0x4 flags:0xe
因为这里使用--reuse-cmdline的缘故,dtb和rootfs都使用正常系统之前的给的,但是这个命令会把device tree里面的memory node给修改成crashkernel 传进来的那部分,并且boot command 的crashkernel选项会被删除掉。
这些修改也比较好理解,如果还是用之前的device tree,那就肯定会覆盖掉之前系统运行的痕迹。 如果boot command里面还有crashkernel,那系统在预留这部分内存,系统就没有地方可以运行了。
当然这里也可以根据自己的情况适当调整device tree ,rootfs 和boot command。
触发系统crash
这里可以自己写一个模块来模拟触发kernel crash,为了简单,这里使用如下命令出发crash:
echo c > /proc/sysrq-trigger
执行结果如下:
# echo c > /proc/sysrq-trigger
[38426.348804] sysrq: Trigger a crash
[38426.348871] Kernel panic - not syncing: sysrq triggered crash
[38426.348947] CPU: 0 PID: 147 Comm: sh Kdump: loaded Not tainted 6.8.0 #2
[38426.349070] Hardware name: FVP Base RevC (DT)
[38426.349143] Call trace:
[38426.349198] dump_backtrace+0x90/0xe8
[38426.349340] show_stack+0x18/0x24
[38426.349479] dump_stack_lvl+0x48/0x60
[38426.349623] dump_stack+0x18/0x24
[38426.349763] panic+0x380/0x3b4
[38426.349884] sysrq_reset_seq_param_set+0x0/0x98
[38426.350024] __handle_sysrq+0xe4/0x1cc
[38426.350158] write_sysrq_trigger+0xdc/0xf0
[38426.350298] proc_reg_write+0x9c/0xf0
[38426.350417] vfs_write+0xd4/0x360
[38426.350555] ksys_write+0x74/0x10c
[38426.350695] __arm64_sys_write+0x1c/0x28
[38426.350840] invoke_syscall+0x48/0x110
[38426.350976] el0_svc_common.constprop.0+0x40/0xe0
[38426.351121] do_el0_svc+0x1c/0x28
[38426.351253] el0_svc+0x34/0xb4
[38426.351353] el0t_64_sync_handler+0x120/0x12c
[38426.351468] el0t_64_sync+0x190/0x194
[38426.351572] SMP: stopping secondary CPUs
[38426.351647] Kernel Offset: disabled
[38426.351706] CPU features: 0x0,00000000,0062cd4a,3346773f
[38426.351790] Memory Limit: none
[38426.352166] Starting crashdump kernel...
[38426.352229] Bye!
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd0f0]
[ 0.000000] Linux version 6.8.0 (qixxu01@a015921) (aarch64-none-linux-gnu-gcc (Arm GNU Toolchain 13.2.rel1 (Build arm-13.7)) 13.2.1 20231009, GNU ld (Arm GNU Toolchain 13.2.rel1 (Build arm-13.7)) 2.41.0.20231009) #1 SMP PREEMPT Wed Jul 31 17:32:39 CST 2024
[ 0.000000] KASLR disabled due to lack of seed
[ 0.000000] Machine model: FVP Base RevC
[ 0.000000] earlycon: pl11 at MMIO 0x000000001c090000 (options '')
[ 0.000000] printk: legacy bootconsole [pl11] enabled
[ 0.000000] efi: UEFI not found.
[ 0.000000] OF: fdt: Reserving 4 KiB of memory at 0xfb5f0000 for elfcorehdr
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000018000000, size 8 MiB
[ 0.000000] OF: reserved mem: initialized node vram@18000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000018000000..0x00000000187fffff (8192 KiB) nomap non-reusable vram@18000000
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: Faking a node at [mem 0x00000000db600000-0x00000000fb5fffff]
[ 0.000000] NUMA: NODE_DATA [mem 0xfb4ec9c0-0xfb4eefff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00000000db600000-0x00000000fb5fffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00000000db600000-0x00000000fb5fffff]
这里可以看到在Bye 之后马上就运行那个Dump-capture系统,运行的物理地址可以看到也就是之前crashkernel预留的那部分地址。 运行成功之后,可以看到在/proc下面会多出 /proc/vmcore文件。
这个文件就是我们需要的core dump文件。
这个文件是如何生成的呢,在前面使用kexec 加载Dump-capture kernel的时候,就提前把正常kernel运行相关的信息记录在crash kernel 预留的一块内存里, 参考:
#0 crash_prepare_elf64_headers(mem = (struct crash_mem*) 0xFFFF0008002F9240, need_kernel_map = 1, addr = (void**) 0xFFFF8000834D3C18, sz = (long unsigned int*) 0xFFFF8000834D3C20) at crash_core.c:477
#1 prepare_elf_headers(addr = (void**) 0xFFFF8000834D3C18, sz = (long unsigned int*) 0xFFFF8000834D3C20) at machine_kexec_file.c:77
#2 image_load(image = (struct kimage*) 0xFFFF00080001AC00, kernel
, kernel_len = 43721216, initrd = (char*) 0x0, initrd_len = 0, cmdline = 0xFFFF0008002F8C00 "console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda1 rw ip=dhcp debug loglevel=9 crash_kexec_post_notifiers=1 maxcpus=1 reset", cmdline_len
) at kexec_image.c:103
#3 PTR_ERR(ptr = <Value currently has no location>) at kexec_file.c:73
#4 kimage_file_prepare_segments(image = (struct kimage*) 0xFFFF00080001AC00, kernel_fd = <Value currently has no location>, initrd_fd = <Value optimised away by compiler>, cmdline_ptr = <Value optimised away by compiler>, cmdline_len = 139, flags = 6) at kexec_file.c:259
#5 kimage_file_alloc_init(rimage
, kernel_fd = <Value currently has no location>, initrd_fd = <Value currently has no location>, cmdline_ptr = <Value currently has no location>, cmdline_len = 139, flags = 6) at kexec_file.c:307
#6 __do_sys_kexec_file_load(kernel_fd = 43721216, initrd_fd = <Value optimised away by compiler>, cmdline_len = 139, cmdline_ptr = <Value optimised away by compiler>, flags = 6) at kexec_file.c:308
#7 __arm64_sys_kexec_file_load(regs ERROR(CMD367-IMG96):
在Dump-capture kernel boot阶段在把这段内存信息给映射成/proc文件,参考函数: vmcore_init
复制vmcore到host
最简单把vmcore 文件copy的方式,可以使用如下命令:
cp /proc/vmcore /root/vmcore
然后再在Host里面使用如下命令把rootfs里面的的vmcore给复制到 host磁盘:
sudo losetup -f -P rootfs/grub-busybox.img
sudo mount /dev/loop17p1 tmp
sudo cp tmp/root/vmcore .
# 完成后卸载和清理
sudo umount tmp
sudo losetup -d /dev/loop17
这里有个非常大的缺点就是vmcore文件非常大(G级别),可能不是所有的目标板都有那么大的空间,这个时候就可以使用makedumpfile了。 这里可以使用:
makedumpfile -c -d 30 /proc/vmcore /root/vmcore
这个vmcore 大概可以压缩到M级别。 有了vmcore文件接下来就可以使用crash-utility来分析这个crash了。
crash-utility编译和使用
因为target是arm64,所以不能使用Host系统自带的crash-utility. 编译命令:
wget https://github.com/crash-utility/crash/archive/refs/tags/8.0.5.tar.gz
tar -zxvf 8.0.5.tar.gz
cd crash-8.0.5/
make target=ARM64
接下来就可以使用crash命令来分析vmcore:
$ src/crash-8.0.5/crash src/linux/vmlinux vmcore
For help, type "help".
Type "apropos word" to search for commands related to "word"...
please wait... (determining panic task)
WARNING: cannot determine starting stack frame for task ffff800082533f00
WARNING: cannot determine starting stack frame for task ffff00080033b300
WARNING: cannot determine starting stack frame for task ffff00080033c400
WARNING: cannot determine starting stack frame for task ffff00080033d500
WARNING: cannot determine starting stack frame for task ffff00080033e600
WARNING: cannot determine starting stack frame for task ffff000800348000
WARNING: cannot determine starting stack frame for task ffff00080034a200
WARNING: cannot determine starting stack frame for task ffff000801c3e600
KERNEL: src/linux/vmlinux
DUMPFILE: vmcore
CPUS: 8
DATE: Thu Aug 1 09:24:51 CST 2024
UPTIME: 00:04:56
LOAD AVERAGE: 0.01, 0.02, 0.00
TASKS: 113
NODENAME: ArmBaseFVP
RELEASE: 6.8.0
VERSION: #2 SMP PREEMPT Wed Jul 31 17:54:17 CST 2024
MACHINE: aarch64 (unknown Mhz)
MEMORY: 4 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 145
COMMAND: "sh"
TASK: ffff000801c3e600 [THREAD_INFO: ffff000801c3e600]
CPU: 6
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 145 TASK: ffff000801c3e600 CPU: 6 COMMAND: "sh"
#0 [ffff80008347bb40] __crash_kexec at ffff80008014f0bc
#1 [ffff80008347bbc0] panic at ffff80008008dbf8
#2 [ffff80008347bc50] sysrq_handle_crash at ffff80008082f548
#3 [ffff80008347bc60] __handle_sysrq at ffff80008082fc50
#4 [ffff80008347bcb0] write_sysrq_trigger at ffff8000808303d8
#5 [ffff80008347bd00] proc_reg_write at ffff80008035737c
#6 [ffff80008347bd80] vfs_write at ffff8000802cebc4
#7 [ffff80008347bdd0] ksys_write at ffff8000802ceff8
#8 [ffff80008347be00] __arm64_sys_write at ffff8000802cf0ac
#9 [ffff80008347be10] invoke_syscall at ffff800080026abc
#10 [ffff80008347be40] el0_svc_common.constprop.0 at ffff800080026bc4
#11 [ffff80008347be70] do_el0_svc at ffff800080026c80
#12 [ffff80008347be80] el0_svc at ffff800081050fd8
#13 [ffff80008347bea0] el0t_64_sync_handler at ffff800081051408
#14 [ffff80008347bfe0] el0t_64_sync at ffff800080011d48
PC: 0000ffffa896e128 LR: 0000aaaae35485f0 SP: 0000fffff8909b50
X29: 0000fffff8909b50 X28: 0000000000000000 X27: 0000000000000000
X26: 0000aaab22fed6e0 X25: 0000000000000020 X24: 0000fffff8909be0
X23: 0000aaab22ff1a30 X22: 0000aaaae3602000 X21: 0000000000000002
X20: 0000aaab22ff1a30 X19: 0000000000000001 X18: 0000000000000004
X17: 0000ffffa896e100 X16: 0000aaaae3601988 X15: 0000000000000001
X14: 0000000000000001 X13: 0000aaab22fed6cb X12: 0000aaaae35fc0b4
X11: 0000aaaae35e246f X10: 0000000000000000 X9: 0000000000000040
X8: 0000000000000040 X7: 0000000000000ef1 X6: 0000000000000063
X5: fffffffffffffffe X4: 0000000000000001 X3: 0000000000000001
X2: 0000000000000002 X1: 0000aaab22ff1a30 X0: 0000000000000001
ORIG_X0: 0000000000000001 SYSCALLNO: 40 PSTATE: 80000000
crash> log
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd0f0]
[ 0.000000] Linux version 6.8.0 (qixxu01@a015921) (aarch64-none-linux-gnu-gcc (Arm GNU Toolchain 13.2.rel1 (Build arm-13.7)) 13.2.1 20231009, GNU ld (Arm GNU Toolchain 13.2.rel1 (Build arm-13.7)) 2.41.0.20231009) #2 SMP PREEMPT Wed Jul 31 17:54:17 CST 2024
[ 0.000000] KASLR disabled due to lack of seed
[ 0.000000] Machine model: FVP Base RevC
[ 0.000000] earlycon: pl11 at MMIO 0x000000001c090000 (options '')
[ 0.000000] printk: legacy bootconsole [pl11] enabled
[ 0.000000] efi: UEFI not found.
[ 0.000000] Reserved memory: created DMA memory pool at 0x0000000018000000, size 8 MiB
[ 0.000000] OF: reserved mem: initialized node vram@18000000, compatible id shared-dma-pool
[ 0.000000] OF: reserved mem: 0x0000000018000000..0x00000000187fffff (8192 KiB) nomap non-reusable vram@18000000
[ 0.000000] NUMA: No NUMA configuration found
[ 0.000000] NUMA: Faking a node at [mem 0x0000000080000000-0x00000008ffffffff]
[ 0.000000] NUMA: NODE_DATA [mem 0x8ff7f29c0-0x8ff7f4fff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000080000000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000100000000-0x00000008ffffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000080000000-0x00000000feffffff]
[ 0.000000] node 0: [mem 0x0000000880000000-0x00000008ffffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x00000008ffffffff]
crash-utility 技巧还是有很多的,可以参考 crash whitepaper
参考:
- Linux kernel crash-dump mechanism
- Using Kdump for examining Linux Kernel crashes
comments