Life Of Navin

Debugging a process actively running on a JVM without restarting the application and setting up flags and attaching a debugger is a a bit of a pain. However, there's quite a bit you can do to debug a running process without needing a JVM restart.

This. But with less enthusiasm. Much less enthusiasm.

Here's a quick walkthrough, with the associated commands. I'm simply compiling the steps to allow a single lookup for everyone else who has to walk down this road.

Ensure that it is the JVM which is consuming too many resources. A combination of free and sorting ps output should do the trick. free tells us current memory consumption, while ps gives us process level statistics

$ free -m
total used free shared buffers cached
Mem: 15040 13956 1083 0 84 890
-/+ buffers/cache: 12980 2059
Swap: 0 0 0

$ ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -5

69.8 67.5 8180544 24204 java -Duser.dir=[...]
7.5 5.6 1550956 4866 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
7.4 6.3 1542356 4848 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
7.4 6.0 1548132 4863 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
7.4 5.8 1537104 4869 python /mnt/manage.py run_gunicorn -c /mnt/conf.py

Wow, so we seem to have barely a GB of memory free and a resource hog java process and a bunch of gunicorn processes. Let's take a look at the Java process shall we? The pid is 24204 (highlighted in previous command)

Let's find thread utilization of resources. ps to the rescue again!

$ ps -mo 'pid lwp stime time pcpu' -p 24204
PID LWP STIME TIME %CPU
24204 - 16:12 00:32:41 17.4
- 24204 16:12 00:00:00 0.0
- 24255 16:12 00:00:03 0.0
- 24256 16:12 00:01:11 0.6
- 24257 16:12 00:01:11 0.6
- 24258 16:12 00:01:11 0.6
- 24259 16:12 00:01:11 0.6
- 24260 16:12 00:06:34 33.5
- 24261 16:12 00:00:11 0.0
- 24262 16:12 00:00:00 0.0
- 24263 16:12 00:00:00 0.0
- 24264 16:12 00:00:00 0.0
- 24265 16:12 00:00:00 0.0
- 24266 16:12 00:00:43 0.3

Interesting. It looks like a single issue here. How do we trace this? Why, we take a couple of thread dumps of course! We use jstack for this. jstack simply produces a thread dump of the given process.

$ jstack 24204 > 1.dump
$ jstack 24204 > 2.dump
$ jstack 24204 > 3.dump

I usually take multiple jstacks just for further analysis. You may also need to run jstack as sudo as sudo -u user jstack <pid>

Now we have the thread dump and the misbehaving thread, so we can check what that thread is doing. Since thread id in the previous command (24260) is a decimal, we first convert it into hex because thread dumps use hex ids. To do this we simply use bc to convert the number, and then tr to transform the uppercase into lowercase.

$ echo "obase=16; 24260" | bc | tr '[:upper:]' '[:lower:]'
5ec4 # this is the thread id in hex

$ vim 1.dump # Now search for the thread hex in the dump file
[...]
"Concurrent Mark-Sweep GC Thread" os_prio=0 tid=0x00007fea2c083000 nid=0x5ec4 runnable
[...]

So it turns out in this specific case that the GC is taking up a massive number of CPU cycles. The solutions for this are numerous, and may vary depending on what your constraints are. The other issues may be related to database access, I/O access, or, quite frankly, a million other reasons relating to badly written code.

To be honest, this method of debugging does have it's constraints, but can be a godsend when you need to quickly validate your thinking when it comes to resource hog bugs. And of course, you can celebrate by posting memes on slack!

At Bloomreach, we prefer :patre: #insidejoke

Hope this is helpful! Until the next b̶u̶g̶ feature... :D

PS: Yes, I know the first couple of steps can be done using htop as well, using htop in thread view. I just find ps so much more convenient

Random Musings, Random Bullshit.

Debugging Running JVM Without Restart

0 comments :

Prologue

Blog Archive