This blog records the process to diagnose a deadlock in test environment for future possible similar situation to reuse the methodology and commands.
State
Suddenly, requests to some paths of a microservice return 504 (Gateway timeout). The strange thing is other microservice works find and even stranger, some requests for other paths in that microservice can work.
We soon find those failed requests are all write request for databases, so we guess it might be a transaction lock in MySQL.
MySQL Lock
In order to find whether there is some deadlocks in MySQL, we used the following commands:
# identify locked tables
show open tables where In_use > 0;
# displays the [InnoDB Monitor](https://mariadb.com/kb/en/xtradbinnodb-monitors/) output, which is extensive InnoDB information (include transaction info) which can be useful in diagnosing problems.
show engine innodb status;
# show process connected to mysql
SHOW PROCESSLIST;
We try to resend a timeout request to that microservice then query the status of MySQL but not find any useful info, so we try to dive into the application.
Stack Trace
We had a thread dump by using jps
and jstack
.
jps
jstack 1
The following a some curial part of stack trace:
"http-nio-16888-exec-146" #348 daemon prio=5 os_prio=0 tid=0x00007fa67800b000 nid=0x6c5 waiting on condition [0x00007fa6da4eb000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c70aa188> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at cn.xxx.yyy.IdClient.nextId(IdClient.java:24)
By studying the stack trace, we soon find that part of threads are waiting for same lock: <0x00000000c70aa188> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
. We soon find the code that uses the lock.
DeadLock
The following is the simplified code that uses the lock:
lock.lock();
IdRange range = reuqestIdRange();
map.put("xxx", range);
// xxxx
lock.unlock();
We can find that the unlock
is not wrapped with finally
, so any RuntimeException
happens before unlock
will causes this lock
is not released by current thread. And any other thread will block if it wants to get this lock
.
A runnable example which reproduces a deadlock like this is shown in following code (cited from here):
public static void main(String[] args) throws IOException {
Lock lock = new ReentrantLock();
ExecutorService service = Executors.newCachedThreadPool();
service.submit(
() -> {
Thread.currentThread().setName("HaveLock");
lock.lock();
aFatalMethod();
lock.unlock();
}
);
service.submit(
() -> {
Thread.currentThread().setName("WaitingLock");
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
lock.lock();
// do something
lock.unlock();
}
);
System.in.read();
}
private static void aFatalMethod() {
throw new RuntimeException("Dead");
}
Other Helpful Assist
When debugging this kinds of problem, a request trace system is also very useful, especially when the invocation chain is very long and to locate where it is blocked. But most of trace system in production environment (like Pinpoint) is set with a fixed trace rate, so sometimes it may be not recorded.
Logging system, which include logging, log collection, log search is also very useful for trace the exception.
Ref
Written with StackEdit.
评论
发表评论