云计算、AI、云原生、大数据等一站式技术学习平台

网站首页 > 教程文章 正文

如何debug一个正在运行的go进程

jxf315 2025-01-05 18:14:38 教程文章 38 ℃

背景

  • go进程内无集成pprof等debug工具包
  • go在生产环境运行存在死锁
  • 可以kill进程,但是前提时kill的话保障能找到问题根源,否则现场丢失无法再次debug


关于此问题

通常可以通过更改程序代码来进行调试。

这可以称为检测:添加调试检测以帮助了解错误,然后再次运行有问题的操作。

检测可以是“打印语句”,也可以是更优雅的方式,例如添加调试器断点,甚至可以不加改动地构建您的代码,但 要求编译器添加调试符号。

但有时您遇到的问题可能很少发生,以至于您无法重建(并因此重新运行)二进制文件,而只能调试正在运行的进程。这篇文章是关于这种情况的,使用 Go。


选项 1:将调试器附加到正在运行的程序

您可以使用调试器(例如Delve)附加到现有进程。无需重新编译或添加检测。

假设我们进程的 PID 是4040133

$ sudo ./dlv attach 4040133
Type 'help' for list of commands.
(dlv) goroutines
... goroutines' state is dumped here ...

那很简单!Delve 当然更强大:您可以设置断点、观察变量、逐步执行代码等。

选项 2:当您可以看到进程的 stderr 时,使用堆栈跟踪退出

Go 提供了这个开箱即用的好功能:当您向它发送 SIGQUIT 信号时,它会以堆栈转储退出。显示所有 goroutine 的堆栈转储,因此您可以知道每个“线程”在接收时正在做什么SIGQUIT

所以在实践中,这个堆栈跟踪对你来说真的很有价值。现在让我们学习挖掘它。

4040133您可以通过运行(仍然假设我们的 PID 是)将转储写入进程的 stderr :

$ kill -QUIT 4040133

在您运行程序的另一个终端(或者它写入 stderr 的地方,也许 $ journalctl如果您的应用程序在 systemd 下运行)您会看到:

SIGQUIT: quit
PC=0x464ce1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
	/usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21
runtime.futexsleep(0x7fff3a356560?, 0x441df3?, 0xc000032000?)
	/usr/local/go/src/runtime/os_linux.go:56 +0x36
runtime.notesleep(0xbf91d0)
	/usr/local/go/src/runtime/lock_futex.go:159 +0x87
runtime.mPark()
	/usr/local/go/src/runtime/proc.go:1447 +0x2a
runtime.stoplockedm()
	/usr/local/go/src/runtime/proc.go:2611 +0x65
runtime.schedule()
	/usr/local/go/src/runtime/proc.go:3308 +0x3d
runtime.park_m(0xc0001251e0?)
	/usr/local/go/src/runtime/proc.go:3525 +0x14d
runtime.mcall()
	/usr/local/go/src/runtime/asm_amd64.s:425 +0x43

goroutine 1 [chan receive, 21508 minutes]:
github.com/function61/gokit/sync/taskrunner.(*Runner).Wait(...)
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:79
github.com/joonas-fi/joonas-sys/pkg/statusbar.logic({0x96f6f8, 0xc000030cc0})
	/workspace/pkg/statusbar/bar.go:150 +0x1bf
github.com/joonas-fi/joonas-sys/pkg/statusbar.Entrypoint.func1(0xc0001bc780?, {0xc28ab8?, 0x0?, 0x0?})
	/workspace/pkg/statusbar/bar.go:34 +0x25
github.com/spf13/cobra.(*Command).execute(0xc0001bc780, {0xc28ab8, 0x0, 0x0})
	/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0xc000187b80)
	/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
	/go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
	/workspace/cmd/jsys/main.go:42 +0x434

goroutine 17 [syscall, 21508 minutes]:
os/signal.signal_recv()
	/usr/local/go/src/runtime/sigqueue.go:168 +0x98
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x19
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:151 +0x2a

goroutine 18 [chan receive, 21508 minutes]:
github.com/function61/gokit/os/osutil.CancelOnInterruptOrTerminate.func1()
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/os/osutil/canceloninterruptorterminate.go:32 +0x4d
created by github.com/function61/gokit/os/osutil.CancelOnInterruptOrTerminate
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/os/osutil/canceloninterruptorterminate.go:31 +0x10a

goroutine 19 [syscall, 4080 minutes]:
syscall.Syscall(0x0, 0x0, 0xc0000ea3e4, 0xc1c)
	/usr/local/go/src/syscall/asm_linux_amd64.s:20 +0x5
syscall.read(0xc000072060?, {0xc0000ea3e4?, 0x9?, 0xc0002e2ea0?})
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:696 +0x4d
syscall.Read(...)
	/usr/local/go/src/syscall/syscall_unix.go:188
internal/poll.ignoringEINTRIO(...)
	/usr/local/go/src/internal/poll/fd_unix.go:794
internal/poll.(*FD).Read(0xc000072060?, {0xc0000ea3e4?, 0xc1c?, 0xc1c?})
	/usr/local/go/src/internal/poll/fd_unix.go:163 +0x285
os.(*File).read(...)
	/usr/local/go/src/os/file_posix.go:31
os.(*File).Read(0xc00000e010, {0xc0000ea3e4?, 0x1?, 0x120?})
	/usr/local/go/src/os/file.go:119 +0x5e
bufio.(*Scanner).Scan(0xc0000e3ef8)
	/usr/local/go/src/bufio/scan.go:215 +0x865
github.com/joonas-fi/joonas-sys/pkg/statusbar.logic.func1({0x0?, 0x0?})
	/workspace/pkg/statusbar/bar.go:61 +0x89
github.com/function61/gokit/sync/taskrunner.(*Runner).Start.func1()
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:51 +0x45
created by github.com/function61/gokit/sync/taskrunner.(*Runner).Start
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:50 +0x105

goroutine 23 [chan receive, 1390 minutes]:
github.com/function61/gokit/sync/taskrunner.(*Runner).waitInternal.func2(...)
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:101
github.com/function61/gokit/sync/taskrunner.(*Runner).waitInternal(0xc0000b2140)
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:134 +0x30a
github.com/function61/gokit/sync/taskrunner.(*Runner).Done.func1.1()
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:63 +0x25
created by github.com/function61/gokit/sync/taskrunner.(*Runner).Done.func1
	/go/pkg/mod/github.com/function61/gokit@v0.0.0-20211228101508-315ec8b830c9/sync/taskrunner/taskrunner.go:62 +0x5a

goroutine 34 [chan send, 1390 minutes]:
github.com/vishvananda/netlink.routeSubscribeAt.func2()
	/go/pkg/mod/github.com/vishvananda/netlink@v1.1.0/route_linux.go:1075 +0x453
created by github.com/vishvananda/netlink.routeSubscribeAt
	/go/pkg/mod/github.com/vishvananda/netlink@v1.1.0/route_linux.go:1037 +0x2f2

rax    0xca
rbx    0x0
rcx    0x464ce3
rdx    0x0
rdi    0xbf91d0
rsi    0x80
rbp    0x7fff3a356530
rsp    0x7fff3a3564e8
r8     0x0
r9     0x0
r10    0x0
r11    0x286
r12    0x43c400
r13    0x0
r14    0xbf8940
r15    0x7fb0d47ba96c
rip    0x464ce1
rflags 0x286
cs     0x33
fs     0x0
gs     0x0

关于转储的一件被低估的事情是它显示了系统调用等待事件的时间!我知道我的过程的问题大约在 23 小时 15 分钟前开始,并且1390 minutes 与此几乎完全一致!

通过上面的堆栈转储,我能够找出错误所在。

选项 3:当您看不到进程的 stderr 时,使用堆栈跟踪退出

如果您不确定该过程的stderr去向,我建议您先看看是否有一个简单的解决方案。假设您的进程 ID 是4040133. 查找文件描述符 #2(它始终是 stderr)以了解其stderr连接位置:

$ ls -al /proc/4040133/fd/2
l-wx------ 1 joonas joonas 64 Feb 20 19:37 /proc/4040133/fd/2 -> /home/joonas/.xsession-errors

在我的例子中,我的程序在 X.org 服务器下运行,并且stderr简单地写入了我的 .xsession-errors文件。如果我早点意识到这一点,我就可以省去麻烦了。

由于当时我不确定stderr要写入何处,所以我选择了核选项。

(即使在您认为stderr没有价值并将其重定向到/dev/null!!! 的情况下,这也有效!)

诀窍是针对$ strace您现有的流程并捕获write(2, ...)系统调用。syscall的第一个参数write()文件描述符编号2再次表示stderr

所以附加到这个过程strace

$ sudo strace -p 2143770 -s 512 -ewrite 2> /tmp/strace.log

然后在另一个终端中,要求您的进程退出(这将触发 Go 运行时写入堆栈跟踪,这通常会通过写入代理结束丢弃/dev/null):

$ kill -QUIT 2143770

该进程现在将堆栈跟踪转储到/dev/null,但它必须通过发出系统调用来完成,系统调用 strace会为您记录。

当您查看日志文件时/tmp/strace.log,它看起来像:

strace: Process 4040133 attached
--- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=2140770, si_uid=1000} ---
write(2, "SIGQUIT: quit", 13)           = 13
write(2, "\n", 1)                       = 1
write(2, "PC=", 3)                      = 3
write(2, "0x464ce1", 8)                 = 8
write(2, " m=", 3)                      = 3
write(2, "0", 1)                        = 1
write(2, " sigcode=", 9)                = 9
write(2, "0", 1)                        = 1
write(2, "\n", 1)                       = 1
write(2, "\n", 1)                       = 1
write(2, "goroutine ", 10)              = 10
write(2, "0", 1)                        = 1
write(2, " [", 2)                       = 2
write(2, "idle", 4)                     = 4
write(2, "]:\n", 3)                     = 3
write(2, "runtime.futex", 13)           = 13
write(2, "(", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/sys_linux_amd64.s", 43) = 43
write(2, ":", 1)                        = 1
write(2, "552", 3)                      = 3
write(2, " +", 2)                       = 2
write(2, "0x21", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "runtime.futexsleep", 18)      = 18
write(2, "(", 1)                        = 1
write(2, "0x7ffc12443b70", 14)          = 14
write(2, "?", 1)                        = 1
write(2, ", ", 2)                       = 2
write(2, "0x441df3", 8)                 = 8
write(2, "?", 1)                        = 1
write(2, ", ", 2)                       = 2
write(2, "0xc000036500", 12)            = 12
write(2, "?", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/os_linux.go", 37) = 37
write(2, ":", 1)                        = 1
write(2, "56", 2)                       = 2
write(2, " +", 2)                       = 2
write(2, "0x36", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "runtime.notesleep", 17)       = 17
write(2, "(", 1)                        = 1
write(2, "0xbfd370", 8)                 = 8
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
... output snipped ...
write(2, "0xbfcae0", 8)                 = 8
write(2, "\n", 1)                       = 1
write(2, "r15    ", 7)                  = 7
write(2, "0x7ff9ba37ce03", 14)          = 14
write(2, "\n", 1)                       = 1
write(2, "rip    ", 7)                  = 7
write(2, "0x464ce1", 8)                 = 8
write(2, "\n", 1)                       = 1
write(2, "rflags ", 7)                  = 7
write(2, "0x286", 5)                    = 5
write(2, "\n", 1)                       = 1
write(2, "cs     ", 7)                  = 7
write(2, "0x33", 4)                     = 4
write(2, "\n", 1)                       = 1
write(2, "fs     ", 7)                  = 7
write(2, "0x0", 3)                      = 3
write(2, "\n", 1)                       = 1
write(2, "gs     ", 7)                  = 7
write(2, "0x0", 3)                      = 3
write(2, "\n", 1)                       = 1
+++ exited with 2 +++

这些是原始系统调用,因此您需要进行一些文本处理才能将其转换回人类可读的内容。

像这样的脚本可能会对您有所帮助。

但基本思想是这样的,让我们先看前几?行:

write(2, "SIGQUIT: quit", 13)           = 13
write(2, "\n", 1)                       = 1
write(2, "PC=", 3)                      = 3
write(2, "0x464ce1", 8)                 = 8
write(2, " m=", 3)                      = 3
write(2, "0", 1)                        = 1
write(2, " sigcode=", 9)                = 9
write(2, "0", 1)                        = 1
write(2, "\n", 1)                       = 1
write(2, "\n", 1)                       = 1
write(2, "goroutine ", 10)              = 10
write(2, "0", 1)                        = 1
write(2, " [", 2)                       = 2
write(2, "idle", 4)                     = 4
write(2, "]:\n", 3)                     = 3
write(2, "runtime.futex", 13)           = 13
write(2, "(", 1)                        = 1
write(2, ")\n", 2)                      = 2
write(2, "\t", 1)                       = 1
write(2, "/usr/local/go/src/runtime/sys_linux_amd64.s", 43) = 43
write(2, ":", 1)                        = 1
write(2, "552", 3)                      = 3
write(2, " +", 2)                       = 2
write(2, "0x21", 4)                     = 4
write(2, "\n", 1)                       = 1

只需获取原始字符串,您甚至可以+在浏览器的 JS 控制台中将它们评估为 JavaScript 运算符,例如重新组合它们:

"SIGQUIT: quit" +
"\n" +
"PC=" +
"0x464ce1" +
" m=" +
"0" +
" sigcode=" +
"0" +
"\n" +
"\n" +
"goroutine " +
"0" +
" [" +
"idle" +
"]:\n" +
"runtime.futex" +
"(" +
")\n" +
"\t" +
"/usr/local/go/src/runtime/sys_linux_amd64.s" +
":" +
"552" +
" +" +
"0x21" +
"\n";

->

SIGQUIT: quit\nPC=0x464ce1 m=0 sigcode=0\n\ngoroutine 0 [idle]:\nruntime.futex()\n\t/usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21\n

然后用\n换行符和\t制表符替换:

SIGQUIT: quit
PC=0x464ce1 m=0 sigcode=0

goroutine 0 [idle]:
runtime.futex()
	/usr/local/go/src/runtime/sys_linux_amd64.s:552 +0x21

因此,即使在数据被发送到垃圾箱的情况下,我们也恢复了重要数据!

Tags:

最近发表
标签列表