- Linux Kernel Live Debugging with VMware Workstation
- The Goal
- Step-by-Step
- Virtual Serial Port
- ttyS0 or ttyS1?
- Guest: KGDB Setup
- SysRq Config Host/Guest
- Kernel Break/Resume
- Kernel Module Debug
- Miscellaneous
- Reference
- KGDB
- SysRq
- Related Posts
- Debugging the Linux kernel with VMware
- VM config
- Preparing the guest OS
- Using GDB
Linux Kernel Live Debugging with VMware Workstation
Recently, I’ve been assigned a task to fix our vmhgfs kernel module for Ubuntu 15.10 which sports a cutting edge Linux 4.2 Kernel. It is not crashing but the file system just not work. As a sustaining engineer, live debugging is so valuable to jungle through the code base i am not so familiar with…
The Goal
Debug Guest OS Linux kernel over virtual serial port with VMware Workstation.
Step-by-Step
Virtual Serial Port
you can add a serial port from the VMware workstation UI as well, but the effect is the same, vmx configure file now has a few more lines for serial port.
serial1.present = "TRUE" serial1.yieldOnMsrRead = "TRUE" serial1.fileType = "pipe" serial1.fileName = "/tmp/com_ubuntu1510"
ttyS0 or ttyS1?
There might already be another virtual hardware occupies ttyS0 .
- on one terminal run socat -d -d /tmp/com_ubuntu1510 tcp-listen:9999 to redirect named pipe to a socket.
- on another terminal run telnet 127.0.0.1 9999
Inside the Guest as root
check if “whatever” shows on the host telnet terminal, if it is there, then it is ttyS0 . If it is not, try ttyS1 .
Guest: KGDB Setup
The first question is whether your Guest OS Linux Kernel support KGDB …
hfu@ubuntu:~$ grep KGDB /boot/config-$(uname -r) CONFIG_SERIAL_KGDB_NMI=y CONFIG_HAVE_ARCH_KGDB=y CONFIG_KGDB=y
Now, configure the kernel to start kgdb server, simply put, add kgdboc=ttyS0,115200 kgdbwait to the kernel cmdline.
- https://www.kernel.org/doc/Documentation/kernel-parameters.txt
- kgdboc=ttyS0,115200 - kgdb over console, serial port ttyS0 , and baud rate is 115200
- kgdbwait - “Stop kernel execution and enter the kernel debugger at the earliest opportunity.”
Here we create a grub menu entry in /etc/grub.d/40-custom . (I have to complain about the grub2 design, the configuration so arcane compare to grub1 that makes you don’t even want to touch it…)
# /etc/grub.d/40-custom, copied from /boot/grub/grub.cfg menuentry 'UbuntuKGDB' --class ubuntu --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-4492a93c-91e2-4979-b5b9-71e32901511c' insmod gzio insmod part_msdos insmod ext2 set root='hd0,msdos1' search --no-floppy --fs-uuid --set=root 4492a93c-91e2-4979-b5b9-71e32901511c linux /boot/vmlinuz-4.2.0-14-generic root=UUID=4492a93c-91e2-4979-b5b9-71e32901511c ro find_preseed=/preseed.cfg auto noprompt priority=critical locale=en_US kgdboc=ttyS0,115200 kgdbwait initrd /boot/initrd.img-4.2.0-14-generic >
If you cannot see the grub menu, in /etc/default/grub , comment out the 2 lines.
#GRUB_HIDDEN_TIMEOUT=10 #GRUB_HIDDEN_TIMEOUT_QUIET=false
after all these modification, don’t forget to run update-grub
Reboot Guest with the newly added kernel entry, and see if kgdb server is waiting for connection.
SysRq Config Host/Guest
Since we are going to use SysRq to break into debugger, we need to disable sysrq on the host. Or SysRq will be captured by the host.
echo 0 > /proc/sys/kernel/sysrq # Or with command, # sysctl kernel.sysrq = 0 # And to survive reboot, # echo 'kernel.sysrq = 0' >> /etc/sysctl.conf
Same thing on guest, just to enable all.
echo 1 > /proc/sys/kernel/sysrq
Kernel Break/Resume
For how to get the source code and debug symbols check my previous post Linux Kernel DebugInfo. attach gdb to the kgdb server,
gdb debuginfo/usr/lib/debug/boot/vmlinux-4.2.0-14-generic (gdb) set substitute-path /build/linux-xyuzCP/linux-4.2.0 /home/hfu/debuginfo/linux-source-4.2.0/linux-source-4.2.0 (gdb) target remote localhost:9999 (gdb) c
# press Alt+SysRq+g # OR echo g > /proc/sysreq-trigger
will break into the debugger.
Kernel Module Debug
build your module with gcc debugging option -g , and load the module, e.g. modprobe vmhgfs
_dir=/sys/module/vmhgfs/sections cmd="add-symbol-file ~/vmhgfs.ko $(cat $_dir/.text) -s .bss $(cat $_dir/.bss) -s .data $(cat $_dir/.data)" echo "$cmd" > add_vmhgfs_symbol.gdb
Copy the symbol file loading gdb script to the host and break into debugger,
(gdb) source add_vmhgfs_symbol.gdb (gdb) break HgfsSendRequest
Miscellaneous
Reference
KGDB
The kernel debugger kgdb, hypervisors like QEMU or JTAG-based hardware interfaces allow to debug the Linux kernel and its modules during runtime using gdb. Gdb comes with a powerful scripting interface for python. The kernel provides a collection of helper scripts that can simplify typical kernel debugging steps. This is a short tutorial about how to enable and use them. It focuses on QEMU/KVM virtual machines as target, but the examples can be transferred to the other gdb stubs as well.
I actually use VMware Workstation as the virtualization solution…
SysRq
It is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. . On x86 - You press the key combo 'ALT-SysRq-'. Note - Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot handle so many keys being pressed at the same time, so you might have better luck with "press Alt", "press SysRq", "release SysRq", "press ", release everything.
Not like a user world application, we use SysRq to break running kernel into debugger.
Related Posts
NOTE: the post dates are not that relavant as it shows.
© Dyno Fu, 2015 — built with Jekyll using Lagom theme with modification.
Debugging the Linux kernel with VMware
I am playing with emulated HID devices in Linux and found a kernel bug when using the usb_f_hid and dummy_hcd kernel modules. I won’t go into details of what I am trying to achieve (saving this for a future post) but focus on how I troubleshooted this particular bug. As of this writing, it is 100% reproducible with 4.15.0-45 kernel and these steps:
- Load libcomposite and dummy_hcd into the kernel
- Create an emulated HID device with configfs
- Write something to /dev/hidg0
After executing step 3, the machine hangs in a way which makes it clear that it’s not a userspace problem but a kernel one.
I know that VMware Workstation has support for kernel debugging, so I thought this is a great chance to give it a try. I loaded Ubuntu 18.04 into a VM and executed the steps above. The VM hanged and the corresponding vmware-vmx process in the host OS went to 100% CPU usage. This means that when the problem occurs, the kernel goes into some kind of busy loop. Let’s see how we can debug this further.
VM config
VMware Workstation has a nice feature which allows to debug the Linux kernel running inside the VM with gdb on the host. This is enabled by adding a single line to the VM’s configuration file:
debugStub.listen.guest64 = "TRUE"
Now when the VM is started, port 8864 is opened on the host and we can connect to it with gdb for remote debugging.
Preparing the guest OS
First, we want to disable KASLR as it will make the debugging harder. This is done by booting the kernel with the nokaslr option.
Open /etc/default/grub , find the line starting with GRUB_CMDLINE_LINUX_DEFAULT and append nokaslr at the end. Then update GRUB:
Next, we want to obtain debug symbols for the running kernel and its modules. For Ubuntu there is a dedicated repository:
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \ sudo tee -a /etc/apt/sources.list.d/ddebs.list sudo apt install ubuntu-dbgsym-keyring sudo apt-get update
Install kernel with debug symbols:
sudo apt-get install linux-image-`uname -r`-dbgsym
Copy the following files to the host OS as we need to load them into gdb :
/usr/lib/debug/boot/vmlinux-4.15.0-45-generic /usr/lib/debug/lib/modules/4.15.0-45-generic/kernel/drivers/usb/gadget/udc/udc-core.ko /usr/lib/debug/lib/modules/4.15.0-45-generic/kernel/drivers/usb/gadget/function/usb_f_hid.ko dummy_hcd.ko
Note: dummy_hcd is missing in Ubuntu (see this bug), I built it by myself.
Using GDB
Now that we have the debug symbols, we are ready to fire off gdb in the host OS:
$ gdb vmlinux-4.15.0-45-generic (gdb) set architecture i386:x86-64 (gdb) target remote localhost:8864 (gdb) c
Hitting Ctrl+C will pause the VM:
so we can inspect the state in the debugger.
We also need to load the symbols for the kernel modules that we want to debug. However, modules can be dynamically loaded at any address and we need to feed this information into gdb . Let’s take for example usb_f_hid . We can get the corresponding addresses in the guest OS like this:
$ cd /sys/module/usb_f_hid/sections $ sudo cat .text .data .bss 0xffffffffc06d7000 0xffffffffc06da000 0xffffffffc06da740
Now we can add the symbol file in gdb using the addresses from above:
(gdb) add-symbol-file usb_f_hid.ko 0xffffffffc06d7000 -s .data 0xffffffffc06da000 -s .bss 0xffffffffc06da740
We do the same for udc-core and dummy_hcd . Now that we have all symbols loaded, we can trigger the bug in the guest OS and then hit Ctrl+C in gdb to inspect the backtrace:
^C Program received signal SIGINT, Interrupt. 0xffffffff810de405 in rep_nop () at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/processor.h:647 647 /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/processor.h: No such file or directory. (gdb) bt #0 0xffffffff810de405 in rep_nop () at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/processor.h:647 #1 cpu_relax () at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/processor.h:652 #2 virt_spin_lock (lock=) at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/qspinlock.h:69 #3 native_queued_spin_lock_slowpath (lock=0xffff88007196984c, val=1) at /build/linux-uQJ2um/linux-4.15.0/kernel/locking/qspinlock.c:305 #4 0xffffffff8199f407 in pv_queued_spin_lock_slowpath (val=, lock=) at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/paravirt.h:669 #5 queued_spin_lock_slowpath (val=, lock=) at /build/linux-uQJ2um/linux-4.15.0/arch/x86/include/asm/qspinlock.h:30 #6 queued_spin_lock (lock=) at /build/linux-uQJ2um/linux-4.15.0/include/asm-generic/qspinlock.h:90 #7 do_raw_spin_lock_flags (flags=, lock=) at /build/linux-uQJ2um/linux-4.15.0/include/linux/spinlock.h:172 #8 __raw_spin_lock_irqsave (lock=) at /build/linux-uQJ2um/linux-4.15.0/include/linux/spinlock_api_smp.h:119 #9 _raw_spin_lock_irqsave (lock=0xffff88007196984c) at /build/linux-uQJ2um/linux-4.15.0/kernel/locking/spinlock.c:152 #10 0xffffffffc06d7410 in f_hidg_req_complete (ep=, req=) at /build/linux-uQJ2um/linux-4.15.0/drivers/usb/gadget/function/f_hid.c:328 #11 0xffffffffc06a390a in usb_gadget_giveback_request () #12 0xffffffffc06cdff2 in dummy_queue () #13 0xffffffffc06a2b96 in usb_ep_queue () #14 0xffffffffc06d7eb6 in f_hidg_write (file=, buffer=, count=5, offp=) at /build/linux-uQJ2um/linux-4.15.0/drivers/usb/gadget/function/f_hid.c:394 #15 0xffffffff8127730b in __vfs_write (file=, p=, count=, pos=) at /build/linux-uQJ2um/linux-4.15.0/fs/read_write.c:481 #16 0xffffffff812774d1 in vfs_write (file=0xffff880077b77c00, buf=0x55841858c7c0 "fira\n", count=, pos=0xffffc90000bbbef8) at /build/linux-uQJ2um/linux-4.15.0/fs/read_write.c:569 #17 0xffffffff81277725 in SYSC_write (count=, buf=, fd=) at /build/linux-uQJ2um/linux-4.15.0/fs/read_write.c:615 #18 SyS_write (fd=, buf=94025832515520, count=5) at /build/linux-uQJ2um/linux-4.15.0/fs/read_write.c:607 #19 0xffffffff81003ae3 in do_syscall_64 () #20 0xffffffff81a00081 in entry_SYSCALL_64 () at /build/linux-uQJ2um/linux-4.15.0/arch/x86/entry/entry_64.S:237 #21 0x00007f8e8d85d760 in ?? () #22 0x00007f8e8d85e2a0 in ?? () #23 0x0000000000000005 in irq_stack_union () Backtrace stopped: previous frame inner to this frame (corrupt stack?)
This reveals a deadlock involving hidg->write_spinlock in f_hid.c which explains why the CPU goes to 100% when the bug occurs. The spinlock is acquired in the f_hidg_write() function before calling usb_ep_queue() which callbacks to f_hidg_req_complete() which tries to acquire the same spinlock again.
My attempt to fix this bug is this patch which I submitted to the maintainers of the USB subsytem. Let’s see how that goes 🙂
UPDATE: My patch has been merged in Linux 5.1-rc3