



Curious case of debugging failing webhook API requests

Curious case of debugging failing webhook API requests

Curious case of debugging failing webhook API requests

Curious case of debugging failing webhook API requests

A short debugging story to start off the new year.

Developer (D): Hey, can you join a call? I need some help in debugging a webhook API connectivity issue. The customer team is also on the call.

Team lead (You): Okay sure, add me to the call.

You: So, what are you trying to do and what's the problem?

D: Our backend has a webhook API that a third party will invoke over the internet. The customer team from that third party is on the call. They are getting a 500 error when making this API call.

You: Okay, what's the error? And before that, can you tell me the HTTP request flow?

D: Yeah, so the error is related to TLS and I am not sure how to debug this. The request flow is as follows:

Third-party --> Firewall --> Nginx --> Our backend

You: Okay, have you tried exposing the API on staging on http instead of https?

D: Yes, this is on a staging environment, and the http API works. When I switch to https, we get an error in the third party.

You: Okay, where does the TLS termination happen in the above request flow?

D: (long pause) I think it happens at Nginx.

You: Sure?

D: (thinking...) Yes, I have configured the certs in Nginx.

You: So the Firewall is Layer 4 and Nginx acts as Layer 7 and terminates TLS?

D: Yes

You: Can you try making a call to webhook API from your local machine via curl or Postman?

D: (tries to demo this, but the call times out)

You: Is there any IP whitelisting at the Firewall level to make sure we allow the webhook API call only from that third party and no one else?

D: Yes.

You: Can you try removing the whitelisting for testing purposes right now and try the request from the local machine again?

D: (demos this use case, curl works okay, returns 4xx due to missing auth headers)

You: Can you open the url (used only as an example here) on your browser?

D: (demos this use case, on Chrome, it shows "Not secure" even though the protocol is https)

You: Interesting, show me more details about the certificate on Chrome.

D: (shows the details) Where are we going with this?

You: See, the certificate you're using is not valid for the subdomain. It's only valid for the main domain. So, you're getting an error on the third party. The third-party strictly checks the certificate, so you're getting a TLS error.

(you continue): I wonder why it works via curl as it should also perform a strict check as we didn't pass -k or --insecure flag.

D: Oh, I have set up a default config in curl to always use the --insecure flag to make testing easier.

You: (smiling) Ah, that explains why! So, the TLS error issue is due to the bad certificate. You'll need to use the correct cert for a subdomain or a wildcard cert that works on all subdomains and update your Nginx config accordingly.

D: (hours later) Thanks, this issue was fixed. How do I learn to debug like this?

You: Here are the lessons.


  • Learn the end-to-end request flow and the whole stack (from Layer 4 to Layer 7, at least)

  • Understand how proxies operate, what's TLS, DNS and how certs work

  • Get familiar with basic networking utilities - curl, nslookup, netstat, telnet, tcpdump, and more

  • Try to form a mental model about how things work and a hypothesis about where the problem could be

  • Only change one variable at a time when debugging

  • Only change what's relevant to your hypothesis and revisit your hypothesis and mental model

  • Practice and learn from past incidents and war stories from seniors

I write such stories on software engineering.

There's no specific frequency, as I don't make up these.

If you liked this one, you might love - ⚡Migrating Terabytes of metrics data with zero downtime.

Follow me on LinkedIn and Twitter for more such stuff, straight from the production oven!

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.