Error Library
Common Log Errors — Fix 502s, OOMKilled Pods, and Deadlocks Fast
Production is on fire. Users are complaining. Your logs are exploding. Here's your field guide to the most common errors, what they actually mean, and how to fix them before your manager asks why the site is down.
Why Log Errors Matter (And Cost You Real Money)
Let's talk numbers. According to Erwood Group's 2025 research, the average cost of IT downtime is now $14,056 per minute for mid-sized companies. For large enterprises, that number jumps to $23,750 per minute. Do the math: a 30-minute outage caused by an undiagnosed 502 Bad Gateway error costs your company over $400,000.
The old Gartner estimate from 2014 pegged downtime at $5,600/minute. In 2024, inflation and increased digital dependency have more than doubled that cost. If you're in retail e-commerce during Black Friday? You're looking at $1-2 million per hourin lost revenue, according to BigPanda's 2024 outage analysis.
But here's the thing: Most outages aren't caused by infrastructure failures. They're caused by slow incident detection. The DORA metrics (DevOps Research and Assessment) show that elite engineering teams have a Mean Time to Recovery (MTTR) under 1 hour. Low-performing teams? 1 to 6 months. The difference isn't luck — it's process.
The Real Cost of Errors:
- → Downtime costs: $14,056/minute (mid-size), $23,750/minute (enterprise)
- → MTTR gap: Elite teams <1hr, low performers 1-6 months
- → Preventable losses: 80% of outages stem from undiagnosed log errors
How to Use This Error Library
This isn't just a list of error codes. It's a battle-tested troubleshooting system built from thousands of production incidents. Here's how to use it effectively:
1. Search by Error Code
Seeing "502 Bad Gateway" in your logs? Jump straight to the Nginx/Kubernetes section. Each guide includes the exact SQL query you need to filter your logs and find the root cause.
2. Filter by Category
Not sure what's wrong? Browse by platform: Web Server, Container, Database, Cloud. We've grouped errors by the layer where they typically appear, making it easier to narrow down the source.
3. Run the SQL Queries
Every error guide comes with ready-to-use DuckDB SQL. Load your logs into LogAnalytics, paste the query, and get answers in seconds. No vendor lock-in, no slow SaaS queries.
4. Bookmark Frequent Issues
Each guide has a unique URL. Save the ones your team hits repeatedly. Share them in Slack during incidents. Build a runbook library tailored to your infrastructure.
Error Categories and Quick Fixes
These are the errors that keep you up at night. We've organized them by platform and included SQL queries, debugging workflows, and real-world solutions used by Google SRE teams, DevOps engineers, and platform architects.
Web Server Errors
When your reverse proxy or load balancer throws errors, it's usually a misconfiguration between layers. The Google SRE book calls this "cross-component debugging" — you need to trace the request path from edge to origin.
- 502 Bad Gateway: Upstream server returned invalid response. Check pod health, keepalive timeouts, connection pools.
- 504 Gateway Timeout: Upstream took too long to respond. Measure origin latency, reduce payload, check DNS resolution.
- Connection Refused: Upstream port not accepting connections. Verify service endpoints, pod readiness probes, firewall rules.
- SSL Handshake Failed: Certificate mismatch or cipher incompatibility. Audit TLS versions, cert chains, SNI configuration.
Example SQL Query
SELECT time_local, request, status, upstream_response_time, upstream_addr FROM logs WHERE status IN (502, 504) ORDER BY time_local DESC LIMIT 100;
Container and Kubernetes Errors
Container orchestration adds complexity. When pods die, you need to check resource limits, probe configs, and image pulls. According to Komodor's Kubernetes troubleshooting research, 80% of 502 errors in K8s clusters are caused by pod restart loops or health check failures.
- OOMKilled: Container exceeded memory cgroup limit. Compare requests/limits vs actual usage, tune GC, add VPA.
- CrashLoopBackOff: Pod keeps restarting. Check application logs, init container failures, mismatched env vars.
- ImagePullBackOff: Can't pull container image. Verify registry credentials, image tags, network policies.
- Liveness Probe Failed: Health check timed out. Adjust probe thresholds, check endpoint latency, review readiness logic.
Example SQL Query
SELECT timestamp, pod_name, namespace, message FROM logs WHERE message ILIKE '%OOMKilled%' OR message ILIKE '%CrashLoopBackOff%' ORDER BY timestamp DESC LIMIT 50;
Database Errors
Database errors are deceptive. A "deadlock detected" message doesn't tell you which queries are blocking each other. You need to capture lock chains, transaction isolation levels, and query execution plans. The Google SRE Workbook recommends structured logging for all database operations.
- Deadlock Detected: Two transactions locked each other. Add explicit row locking order, reduce transaction scope, lower isolation.
- Too Many Connections: Connection pool exhausted. Tune max_connections, add pgBouncer, audit connection leaks.
- Slow Query Warning: Query exceeded threshold. Add indexes, rewrite with CTEs, partition large tables.
- Replication Lag: Replica behind primary by N seconds. Check network latency, disk I/O, replication slots.
Example SQL Query
SELECT timestamp, message, duration_ms FROM logs WHERE message ILIKE '%deadlock%' OR duration_ms > 5000 ORDER BY timestamp DESC LIMIT 100;
Cloud Platform Errors
Cloud APIs have rate limits, permission boundaries, and quotas. When you hit them, the error messages are often cryptic. AWS CloudWatch, GCP Stackdriver, and Azure Monitor charge per GB ingested — so you need efficient log queries to avoid surprise bills.
- AccessDenied (S3): IAM policy doesn't allow operation. Check bucket policies, VPC endpoints, KMS key permissions.
- ThrottlingException: API rate limit exceeded. Implement exponential backoff, batch requests, request quota increase.
- QuotaExceeded: Resource limit reached (e.g., Lambda concurrency). Monitor usage trends, request limit increase, add sharding.
- RateLimitExceeded: Too many requests in time window. Add caching layer, implement request queuing, use CDN.
Example SQL Query
SELECT
timestamp,
request_id,
error_code,
key
FROM logs
WHERE error_code IN ('AccessDenied', 'ThrottlingException')
GROUP BY error_code, key
ORDER BY COUNT(*) DESC
LIMIT 20;The 5-Step Debugging Workflow
When an error appears, don't panic. Follow this systematic process used by Google SRE teams:
- 1
Isolate the Symptom
What's the error code? Which service? What timestamp? Use
WHEREclauses to filter logs. - 2
Measure Frequency
Is this a one-off spike or sustained failure? Use
GROUP BY time_bucketto plot error rates. - 3
Find Correlations
What changed recently? Deployments? Config updates? Use
JOINwith deployment logs. - 4
Trace the Request Path
Follow request IDs across services. Use
WHERE request_id = '...'to build trace timeline. - 5
Test the Fix
Apply the solution. Monitor error rates for 15 minutes. If errors persist, rollback and repeat from step 1.
When to Escalate
Not every error needs immediate escalation. But some do. Here's when to page the on-call engineer:
- →Error persists >1 hour despite troubleshooting attempts
- →Cascading failures across multiple services or regions
- →Data loss risk (e.g., database corruption, replication failure)
- →Security breach indicators (unauthorized access, injection attempts)
According to DORA's 2024 metrics, elite teams have clear escalation policies that reduce MTTR by 60%. Know when to ask for help.
Preventative Strategies
The best way to handle errors is to catch them before they become incidents. Here's how:
Aggregate Logs Centrally
Don't SSH into servers to grep logs. Use structured logging (JSON) and ship to a central store. Tools like Vector, Fluentd, or Promtail make this trivial.
Alert on Error Rate, Not Count
A single 502 error isn't an incident. But 502s increasing by 500% in 5 minutes? That's a problem. Use rate() functions in your monitoring.
Synthetic Monitoring
Don't wait for users to report errors. Run automated tests every 60 seconds that hit critical paths. Catch regressions before they impact revenue.
Chaos Engineering
Kill pods randomly in staging. Inject latency. Simulate network partitions. If your system can't handle artificial failures, it won't survive real ones.
Browse Error Guides
Click any error below to see detailed troubleshooting steps, SQL queries, and real-world solutions. Each guide links directly to LogAnalytics with a pre-built query.
Nginx
Fix 502 Bad Gateway in Nginx
Gateway returned an invalid response from upstream while proxying.
CloudFront
CloudFront 504 Gateway Timeout
Edge node waited too long for the origin before timing out.
AWS S3
S3 AccessDenied
Requester lacks bucket policy permissions for the requested key.
Kubernetes
Kubernetes Pod OOMKilled
Container terminated because memory usage exceeded the cgroup limit.
PostgreSQL
PostgreSQL deadlock detected
Two or more transactions locked each other causing an abort.