How to Troubleshoot SLOW backends using TCP dump and Wireshark
Imagine a scenario where there is a simple server at localhost:8191/save. This would be the sample node server that will implement such a server.
Now lets start the node server and call the server with a simple JSON body as such and check for the Wireshark loopback output.
Notice the initial TCP 3 way handshake on the Wireshark. Adding 2 more columns to Wireshark (Delta time, Calculated Window size aka the received buffer size), lets us see that the first two packets ([SYN] from client to server and [SYN,ACK] from server to client) has a window size of 65535 (which is around 8 kB). And the window size for the POST request (9) is having a window size of 10230 (10kB). Both of which are pretty small compared to the amount of data we deal with nowadays. This is because TCP was invented in a time when the thought of bigger amounts of data was not taken into account.
However, to overcome the issue of a tedious iterative cycle of sending data and ACK back and forth, newer versions of tcp has a mechanism under Options called window scale( if 2⁶⁴) which can change the initial window size of 65535 to a value 64 times as big as it. Window Scale only takes place during the handshake stage and is not changeable after it.
Let’s test out this theory by uploading a bigger file (much larger than the initial window size) such as a jpeg image.
Checking on the window scale on the handshake packets shows us that it has been changed to a scale of 8 because of the sent file.
Checking on the No 24 shows you that a [TCP Window Update] which means the window scale was changed in order to hold the sent file. Also, note how the window size of packets (No 19,20,21) keep dropping in cycles before coming back up to 2618880 every time.
Reasons behind the slow backend
- Due to the application’s data consumption rate
For instance, if server is having a calculated window size of 500 000 and the client sends a packet of 500 000 data to the server. The application on the layer on top of this tcp layer (on the application layer) is feeding data to & from the buffer (window) in a rate indifferent to the window size. The rate the server is receiving data is 500 000 per ACK however, the application consumes only 400 000 of the data. Which results with only 400 000 space for the new data received from the client.
Next Acceptable space= Full Buffer Capacity-Remaining buffer space
Problem — Therefore, one culprit for backend lag might be due to the reason of not having enough space on the window, which is observable through the window size drops from Wireshark (window size will come close to zero). Window size hitting zero is called a zero window situation.
Solution — The client needs to wait till there’s enough buffer space on the server side for transmission. You need to optimize the application in order to up the speed. You might also have to take a tcp dump (in the form of a pcap file) from any more servers the main servers is fetching data from in order to analyze further into the problem. If that is not possible, it could also be done from an intermediate networking device such as a switch or a firewall.
2. Due to considerable time taken on sending back ACK.
Always check for the time between a packet and its acknowledgement to make sure it is within the norms (every ACK should take around the same time).
Problem — Check (№20) for the TCP segment length of a ACK of a certain packet. Which in my case was 65475 (As shown in the image below). This would be the value stored in the sent buffer of the server. The sequence number (Seq) of this ACK packet would be 196829 (which is the addition of the previous sequence number which was 131354 and current length which was 65475) (previous Seq number and ACK’s Ack numbers should match in order to match the request with the response)
However, until the ACK is received from the opposite side, the data would take up space in the sent buffer. If the ACK is slow to be delivered (which is when the sent buffer is cleared out), the sent buffer will accumulate more and more data because of the slow nature of the transmission of ACK.
Solution — If the ACK is not delivered within the timeout period, the same sequence and the same ack no is used to perform a TCP retransmission.
3. Problem with Time to Live
Time to Live is the number of nodes you can jump through while reaching the other end. It basically checked for the distance between the two ends. It can be checked under the details of a packet.
Problem — If the Time to Live keeps reducing, it is good indication that the packets are taking an inefficient routing path. For instance, this could be common with having to deal with North America region servers as an Asian client.
Solution — Change the DNS and other intermediate nodes and debug why the packets take such a long route to reach to the server. The port of the traffic could be checked on Wireshark under the Info. Pay more attention to the client side port to see if its changing, rather than focusing on the servers port (which probably could be an ephemeral port which keeps changing inside the server).
In summary, capturing the tcp dump from the handshake itself is important in analyzing any lagging of the system/communication. Also, paying attention to delta time is important as it helps identify in lag, by comparing the delta time with the average round trip time. Analyzing the calculated window size is also important in figuring out if the application’s data consumption rate is down (by analyzing whether the windows size is coming down). Checking for the keep-alive headers sent by the server is also important in figuring out if there’s a lag on the server side.
These are few of the most common ways of troubleshooting the slow backend of a server using the tcp dump.
Thanks for reading. Until next time! 👋🏽