The site was down. Not partially slow — completely unreachable. What turned out to be a DNS SSL failure had me chasing the wrong problems for hours. The browser was showing a certificate error, and half the time the page wouldn’t load at all.
My first instinct was the same as anyone’s: something broke on the server. So I logged into AWS Lightsail and checked everything I could think of. CPU usage was normal. Disk had plenty of space. The web server process was running. The database was alive. When I ran a curl request directly to the server’s IP address, it returned a response.
By every measurable signal, the server was healthy. But the site was down.
That gap — healthy server, broken website — is what makes this kind of failure so disorienting. Every monitoring tool says green. Every status check passes. And yet users are getting certificate errors and connection timeouts. It took me longer than I’d like to admit to figure out what was actually happening, and I’m writing this DNS SSL failure guide so the next person who hits this problem doesn’t spend an hour restarting services that don’t need restarting.
What the failure looked like from the outside
Before getting into the diagnosis, it’s worth describing exactly what I was seeing — because the symptoms are what make this type of failure so confusing.
In the browser: Chrome was showing “Your connection is not private” with the error code NET::ERR_CERT_AUTHORITY_INVALID. Occasionally the page would load, but most of the time it wouldn’t. There was no consistent pattern.
For different users: A friend in a different city could open the site fine. From my laptop it failed. On my phone using mobile data it worked. On the same phone connected to my home Wi-Fi it didn’t. Same site, same time, completely different results depending on where the request was coming from.
On the server side: The Let’s Encrypt SSL renewal was failing with a curl error during the domain verification step. The error message indicated it couldn’t connect to the domain during the ACME challenge. But when I tested the same connection manually, it worked sometimes.
What I ruled out early:
- It wasn’t a server resource problem (CPU, memory, disk all normal)
- It wasn’t a firewall block (ports 80 and 443 were open)
- It wasn’t a WordPress configuration issue (the wp-config.php settings hadn’t changed)
- It wasn’t a plugin or theme conflict (same behavior on a clean URL)
Everything pointed to something between the domain and the server — but not anything on the server itself.
The Real Cause: A DNS SSL Failure from an Old Server
The site had been migrated to AWS Lightsail a few months earlier. Before that, it had been on a different hosting environment with a different IP address. The DNS records had been updated at the time of migration, and the site had worked fine for weeks afterward. The migration seemed complete.
What I didn’t account for was that some DNS resolvers — including, critically, the server’s own internal resolver — were still caching the old IP address.
Here’s what was happening behind the scenes every time Let’s Encrypt tried to renew the certificate:
- The renewal process makes an HTTP request to verify that the domain actually points to the server requesting the certificate
- That verification request goes through a DNS resolver to find the IP address for the domain
- The resolver the server was using still had the old IP address cached
- The verification request was sent to the old, now-decommissioned server
- The old server didn’t respond (or responded incorrectly)
- Let’s Encrypt marked the verification as failed
- The certificate expired without being renewed
- The site started showing SSL errors
Nobody changed anything. The failure just quietly accumulated until the certificate hit its expiry date — which happened to be several weeks after the migration. Everything in the migration had worked. The certificate had been valid. Everything ran fine. And then renewal day came, and the stale DNS cache made the renewal impossible.
That’s what makes this type of outage particularly nasty: there’s no obvious trigger. If something breaks immediately after a change, you know where to look. When a site that’s been running fine for months suddenly fails, the connection to a migration that happened weeks ago isn’t obvious.
Why the symptoms seemed random
The inconsistency — works for some people, fails for others — is actually the clearest sign that the problem is DNS-related. Here’s why.
DNS caches are local. Your ISP maintains its own cache. Each router maintains one. Even your operating system maintains one. Your browser maintains one. Different caches update at different times based on TTL (time-to-live) settings. So at any given moment, different devices and networks around the world may have different “answers” for what IP address a domain points to.
When the old IP was still cached on some resolvers but not others, requests from those resolvers would go to the wrong server. Requests from resolvers that had already refreshed their cache went to the correct server. That’s why my phone on mobile data worked (the mobile carrier’s DNS had already updated) while my laptop on home Wi-Fi didn’t (the router’s DNS cache still had the old record).
The Let’s Encrypt verification failure was caused by the same thing — the server’s internal resolver was one of the ones that still had the stale record, so the verification requests went to the wrong place.
How I diagnosed it
Once I understood that the symptoms pointed to DNS inconsistency rather than a server problem, the diagnosis became straightforward. These are the steps I went through:
Step 1 — Check DNS from multiple external resolvers
I used whatsmydns.net to check what IP address the domain was resolving to across different locations and DNS providers. Most locations returned the correct IP. A few — including some in the region where my server was hosted — were still returning the old one.
This confirmed the problem: DNS propagation was inconsistent. The authoritative record was correct, but some resolvers hadn’t picked it up yet (or had cached the old record with a long TTL).
Step 2 — Check what the server itself resolves
I ran a DNS lookup directly on the server using:
dig toolflowlab.com
nslookup toolflowlab.com
The server returned the old IP address. That’s why Let’s Encrypt was failing — the verification step was running from the server, which was using a resolver that had the wrong address cached. The server was essentially verifying against itself using an outdated record.
Step 3 — Confirm the SSL error was downstream of DNS
I tried running the Let’s Encrypt renewal manually:
sudo /opt/bitnami/bncert-tool
It failed with a connection error during the ACME challenge. The error message confirmed it was trying to reach the domain and failing — which made sense given that the server’s DNS resolver was pointing to the wrong IP.
The fix, step by step
Once I had a clear picture of what was wrong, the fix was straightforward — though it had to be done in the right order.
Step 1: Fix the authoritative DNS records
The first thing I verified was that the authoritative DNS records (the ones in Namecheap’s DNS management) were pointing to the correct Lightsail Static IP. They were. This wasn’t the root cause — it was the downstream caching that was the problem — but it’s always the right place to start.
If your authoritative records had been wrong, fixing them here would be step one, and you’d need to wait for propagation before anything else would work.
Step 2: Update the server’s DNS resolver
The server’s internal resolver was the one that mattered most for the Let’s Encrypt verification. I updated the DNS resolver configuration to use public resolvers instead of the cloud provider’s default internal one.
On a Linux system, this means editing /etc/resolv.conf:
sudo nano /etc/resolv.conf
And replacing the existing nameserver entries with:
nameserver 8.8.8.8
nameserver 1.1.1.1
Then flushing the local DNS cache:
sudo systemd-resolve --flush-caches
After this, running the DNS lookup again returned the correct IP address. The server was now using resolvers that had current, accurate records.
Step 3: Re-run the SSL certificate setup
With DNS resolving correctly from the server, I ran the Let’s Encrypt renewal again:
sudo /opt/bitnami/bncert-tool
This time it completed without errors. The ACME domain verification went through on the first try. After that, the certificate was issued and installed. The browser warning disappeared and the site loaded correctly over HTTPS.
The entire fix took about 15 minutes once I understood the actual problem. The previous hour had been spent ruling out the wrong things.
What I should have checked first
Looking back, there’s a faster diagnostic path for this type of failure. If a site goes down and the server appears healthy, check these things in this order:
- Run a DNS check from an external tool first, not from the server itself. Whatsmydns.net or Google’s dig tool will show you what the domain resolves to from multiple locations. If there’s inconsistency, you’re looking at a DNS problem.
- Check what the server’s own resolver returns. Run dig or nslookup directly on the server. If it returns a different IP than the external check, the server’s resolver is stale and that’s likely the cause of any SSL renewal failures.
- Check SSL certificate expiry date. If the cert is expired or about to expire, and DNS is inconsistent, the renewal failure is almost certainly the direct cause of the site being down.
- Don’t restart services before ruling out DNS. Restarting nginx, apache, or php-fpm does nothing if the problem is that requests are going to the wrong server entirely. I made this mistake and lost time.
How to Prevent a DNS SSL Failure from Happening Again
- Monitor SSL certificate expiry. Tools like UptimeRobot (free tier) can alert you when a certificate is approaching expiry. A 30-day warning gives you plenty of time to catch any renewal issues before they become outages.
- Verify DNS from multiple resolvers after any migration. Don’t just check if the site loads — check what IP the domain resolves to from at least three different locations or DNS providers. Whatsmydns.net makes this easy and takes two minutes.
- Use a static IP and don’t let it go. The original migration had been done correctly, but the old IP had been from a dynamic allocation that was eventually reassigned. Static IPs on Lightsail are free when attached to a running instance. There’s no reason not to use one.
- Consider using Cloudflare for DNS management. Cloudflare’s DNS propagation is faster than most registrar-managed DNS, and its interface makes it easier to see the current state of your records. It also provides an additional layer of caching control.
Common mistakes during this type of outage
- Assuming “if the site loads for me, DNS is fine.” Your local DNS cache might have the correct record while the server’s resolver still has the old one. Always check from multiple locations and from the server itself.
- Rebuilding the server. This is the expensive version of the wrong diagnosis. If DNS is the problem, rebuilding the server doesn’t fix it — you’d just have a new server with the same DNS issue. Several forum posts I found during diagnosis described people who had done exactly this.
- Waiting for DNS to “just fix itself.” Stale DNS caches do eventually expire, but if the TTL is set to 24 hours, you could be looking at a full day of downtime. Updating the server’s resolver to a public DNS service fixes the server-side problem immediately without waiting for cache expiry.
- Running SSL renewal before fixing DNS. Let’s Encrypt’s ACME verification requires that the domain resolve correctly from the server running the renewal. If DNS is wrong on the server side, the renewal will fail regardless of how many times you try it. Fix DNS first, then renew.
FAQ
Why did my site work for some people but not others at the same time?
DNS caches are local and update at different times. Your ISP, your router, your operating system, and your browser all maintain separate caches with different expiry times. If some resolvers had already picked up the updated DNS records while others still had the old IP, requests from those resolvers would go to different servers — which is exactly why the behavior appeared inconsistent.
Why did SSL renewal suddenly fail when the site had been working fine for weeks?
The certificate had been issued before the DNS inconsistency was relevant. SSL certificates don’t need to re-verify the domain while they’re valid — only when they’re being renewed. The stale DNS cache had been sitting there the whole time, but it only became a problem when renewal day arrived and Let’s Encrypt tried to verify the domain from the server’s perspective.
Can this happen even if I didn’t change anything recently?
Yes, and this is what makes it confusing. DNS records have TTL values that determine how long they’re cached. In some cases, resolvers cache records well beyond the stated TTL, especially internal cloud resolvers. A migration you completed months ago can have residual effects that only surface when something triggers a re-verification — like SSL renewal.
Practical DNS Troubleshooting Tips
How do I check DNS propagation across multiple locations?
The easiest tool is whatsmydns.net. Enter your domain, select “A record,” and it shows you what IP address the domain resolves to from dozens of locations worldwide. If you see different IPs in different locations, propagation is incomplete and you likely have a stale cache issue somewhere.
How do I fix the DNS resolver on my Lightsail server?
Edit /etc/resolv.conf and replace the existing nameserver lines with nameserver 8.8.8.8 and nameserver 1.1.1.1. Then flush the cache with sudo systemd-resolve --flush-caches. Run dig yourdomain.com afterward to confirm the server is now returning the correct IP.
What’s the fastest way to tell if this is a DNS problem vs. a server problem?
If the server is responding to direct IP requests (curl to the IP works) but domain-based requests are failing, it’s DNS. If curl to the IP also fails, it’s a server problem. Run dig on the server to see what IP it resolves the domain to — if it’s different from the actual server IP, the resolver is stale.
How do I prevent SSL renewal failures in the future?
Set up SSL expiry monitoring (UptimeRobot’s free tier includes this). Also verify DNS from multiple external locations after any hosting changes — not just immediately after, but a few days later to confirm consistency. And update your server’s DNS resolver to use 8.8.8.8 and 1.1.1.1 instead of the cloud provider’s default internal resolver, which tends to cache more aggressively.
The lesson
A website can fail completely without anything being wrong with the server. DNS and SSL sit above the infrastructure layer, and when they break, everything below them looks guilty. Monitoring tools check whether servers are running — they don’t check whether the domain actually resolves to that server from all the places that matter.
Once you’ve seen this kind of failure, you never start a diagnosis by restarting services again. You start with DNS. Check what the domain resolves to from the outside. Check what it resolves to from the server. If those two answers are different, you’ve found your problem.
The fix is usually quick. Finding it is the hard part — and it’s only hard the first time.
Part of the AI Productivity System
Start here → Start Here page
Keep Reading
If you’re setting up or maintaining your own WordPress site, these posts cover the full picture:
👉 How I Set Up My WordPress Site on AWS Lightsail — The complete setup walkthrough from domain registration to a live WordPress site.
👉 How to Turn a Brain Dump Into an Action Plan Using ChatGPT — When you’re troubleshooting, dumping everything into ChatGPT first can save you hours of going in circles.