Error Library

Common Log Errors — Fix 502s, OOMKilled Pods, and Deadlocks Fast

Production is on fire. Users are complaining. Your logs are exploding. Here's your field guide to the most common errors, what they actually mean, and how to fix them before your manager asks why the site is down.

Why Log Errors Matter (And Cost You Real Money)

Let's talk numbers. According to Erwood Group's 2025 research, the average cost of IT downtime is now $14,056 per minute for mid-sized companies. For large enterprises, that number jumps to $23,750 per minute. Do the math: a 30-minute outage caused by an undiagnosed 502 Bad Gateway error costs your company over $400,000.

The old Gartner estimate from 2014 pegged downtime at $5,600/minute. In 2024, inflation and increased digital dependency have more than doubled that cost. If you're in retail e-commerce during Black Friday? You're looking at $1-2 million per hourin lost revenue, according to BigPanda's 2024 outage analysis.

But here's the thing: Most outages aren't caused by infrastructure failures. They're caused by slow incident detection. The DORA metrics (DevOps Research and Assessment) show that elite engineering teams have a Mean Time to Recovery (MTTR) under 1 hour. Low-performing teams? 1 to 6 months. The difference isn't luck — it's process.

The Real Cost of Errors:

→ Downtime costs: $14,056/minute (mid-size), $23,750/minute (enterprise)
→ MTTR gap: Elite teams <1hr, low performers 1-6 months
→ Preventable losses: 80% of outages stem from undiagnosed log errors

How to Use This Error Library

This isn't just a list of error codes. It's a battle-tested troubleshooting system built from thousands of production incidents. Here's how to use it effectively:

1. Search by Error Code

Seeing "502 Bad Gateway" in your logs? Jump straight to the Nginx/Kubernetes section. Each guide includes the exact SQL query you need to filter your logs and find the root cause.

2. Filter by Category

Not sure what's wrong? Browse by platform: Web Server, Container, Database, Cloud. We've grouped errors by the layer where they typically appear, making it easier to narrow down the source.

3. Run the SQL Queries

Every error guide comes with ready-to-use DuckDB SQL. Load your logs into LogAnalytics, paste the query, and get answers in seconds. No vendor lock-in, no slow SaaS queries.

4. Bookmark Frequent Issues

Each guide has a unique URL. Save the ones your team hits repeatedly. Share them in Slack during incidents. Build a runbook library tailored to your infrastructure.

Error Categories and Quick Fixes

These are the errors that keep you up at night. We've organized them by platform and included SQL queries, debugging workflows, and real-world solutions used by Google SRE teams, DevOps engineers, and platform architects.

Web Server Errors

When your reverse proxy or load balancer throws errors, it's usually a misconfiguration between layers. The Google SRE book calls this "cross-component debugging" — you need to trace the request path from edge to origin.

502 Bad Gateway: Upstream server returned invalid response. Check pod health, keepalive timeouts, connection pools.
504 Gateway Timeout: Upstream took too long to respond. Measure origin latency, reduce payload, check DNS resolution.
Connection Refused: Upstream port not accepting connections. Verify service endpoints, pod readiness probes, firewall rules.
SSL Handshake Failed: Certificate mismatch or cipher incompatibility. Audit TLS versions, cert chains, SNI configuration.

Example SQL Query

SELECT
  time_local,
  request,
  status,
  upstream_response_time,
  upstream_addr
FROM logs
WHERE status IN (502, 504)
ORDER BY time_local DESC
LIMIT 100;

Container and Kubernetes Errors

Container orchestration adds complexity. When pods die, you need to check resource limits, probe configs, and image pulls. According to Komodor's Kubernetes troubleshooting research, 80% of 502 errors in K8s clusters are caused by pod restart loops or health check failures.

OOMKilled: Container exceeded memory cgroup limit. Compare requests/limits vs actual usage, tune GC, add VPA.
CrashLoopBackOff: Pod keeps restarting. Check application logs, init container failures, mismatched env vars.
ImagePullBackOff: Can't pull container image. Verify registry credentials, image tags, network policies.
Liveness Probe Failed: Health check timed out. Adjust probe thresholds, check endpoint latency, review readiness logic.

Example SQL Query

SELECT
  timestamp,
  pod_name,
  namespace,
  message
FROM logs
WHERE message ILIKE '%OOMKilled%'
   OR message ILIKE '%CrashLoopBackOff%'
ORDER BY timestamp DESC
LIMIT 50;

Database Errors

Database errors are deceptive. A "deadlock detected" message doesn't tell you which queries are blocking each other. You need to capture lock chains, transaction isolation levels, and query execution plans. The Google SRE Workbook recommends structured logging for all database operations.

Deadlock Detected: Two transactions locked each other. Add explicit row locking order, reduce transaction scope, lower isolation.
Too Many Connections: Connection pool exhausted. Tune max_connections, add pgBouncer, audit connection leaks.
Slow Query Warning: Query exceeded threshold. Add indexes, rewrite with CTEs, partition large tables.
Replication Lag: Replica behind primary by N seconds. Check network latency, disk I/O, replication slots.

Example SQL Query

SELECT
  timestamp,
  message,
  duration_ms
FROM logs
WHERE message ILIKE '%deadlock%'
   OR duration_ms > 5000
ORDER BY timestamp DESC
LIMIT 100;

Cloud Platform Errors

Cloud APIs have rate limits, permission boundaries, and quotas. When you hit them, the error messages are often cryptic. AWS CloudWatch, GCP Stackdriver, and Azure Monitor charge per GB ingested — so you need efficient log queries to avoid surprise bills.

AccessDenied (S3): IAM policy doesn't allow operation. Check bucket policies, VPC endpoints, KMS key permissions.
ThrottlingException: API rate limit exceeded. Implement exponential backoff, batch requests, request quota increase.
QuotaExceeded: Resource limit reached (e.g., Lambda concurrency). Monitor usage trends, request limit increase, add sharding.
RateLimitExceeded: Too many requests in time window. Add caching layer, implement request queuing, use CDN.

Example SQL Query

SELECT
  timestamp,
  request_id,
  error_code,
  key
FROM logs
WHERE error_code IN ('AccessDenied', 'ThrottlingException')
GROUP BY error_code, key
ORDER BY COUNT(*) DESC
LIMIT 20;

The 5-Step Debugging Workflow

When an error appears, don't panic. Follow this systematic process used by Google SRE teams:

1
Isolate the Symptom
What's the error code? Which service? What timestamp? Use WHERE clauses to filter logs.
2
Measure Frequency
Is this a one-off spike or sustained failure? Use GROUP BY time_bucket to plot error rates.
3
Find Correlations
What changed recently? Deployments? Config updates? Use JOIN with deployment logs.
4
Trace the Request Path
Follow request IDs across services. Use WHERE request_id = '...' to build trace timeline.
5
Test the Fix
Apply the solution. Monitor error rates for 15 minutes. If errors persist, rollback and repeat from step 1.

When to Escalate

Not every error needs immediate escalation. But some do. Here's when to page the on-call engineer:

→Error persists >1 hour despite troubleshooting attempts
→Cascading failures across multiple services or regions
→Data loss risk (e.g., database corruption, replication failure)
→Security breach indicators (unauthorized access, injection attempts)

According to DORA's 2024 metrics, elite teams have clear escalation policies that reduce MTTR by 60%. Know when to ask for help.

Preventative Strategies

The best way to handle errors is to catch them before they become incidents. Here's how:

Aggregate Logs Centrally

Don't SSH into servers to grep logs. Use structured logging (JSON) and ship to a central store. Tools like Vector, Fluentd, or Promtail make this trivial.

Alert on Error Rate, Not Count

A single 502 error isn't an incident. But 502s increasing by 500% in 5 minutes? That's a problem. Use rate() functions in your monitoring.

Synthetic Monitoring

Don't wait for users to report errors. Run automated tests every 60 seconds that hit critical paths. Catch regressions before they impact revenue.

Chaos Engineering

Kill pods randomly in staging. Inject latency. Simulate network partitions. If your system can't handle artificial failures, it won't survive real ones.

Browse Error Guides

Click any error below to see detailed troubleshooting steps, SQL queries, and real-world solutions. Each guide links directly to LogAnalytics with a pre-built query.

Nginx

Fix 502 Bad Gateway in Nginx

Gateway returned an invalid response from upstream while proxying.

View guide →

CloudFront

CloudFront 504 Gateway Timeout

Edge node waited too long for the origin before timing out.

View guide →

AWS S3

S3 AccessDenied

Requester lacks bucket policy permissions for the requested key.

View guide →

Kubernetes

Kubernetes Pod OOMKilled

Container terminated because memory usage exceeded the cgroup limit.

View guide →

PostgreSQL

PostgreSQL deadlock detected

Two or more transactions locked each other causing an abort.

View guide →

Error Library

Common Log Errors — Fix 502s, OOMKilled Pods, and Deadlocks Fast

Why Log Errors Matter (And Cost You Real Money)

The Real Cost of Errors:

→ Downtime costs: $14,056/minute (mid-size), $23,750/minute (enterprise)
→ MTTR gap: Elite teams <1hr, low performers 1-6 months
→ Preventable losses: 80% of outages stem from undiagnosed log errors

How to Use This Error Library

This isn't just a list of error codes. It's a battle-tested troubleshooting system built from thousands of production incidents. Here's how to use it effectively:

1. Search by Error Code

Seeing "502 Bad Gateway" in your logs? Jump straight to the Nginx/Kubernetes section. Each guide includes the exact SQL query you need to filter your logs and find the root cause.

2. Filter by Category

Not sure what's wrong? Browse by platform: Web Server, Container, Database, Cloud. We've grouped errors by the layer where they typically appear, making it easier to narrow down the source.

3. Run the SQL Queries

Every error guide comes with ready-to-use DuckDB SQL. Load your logs into LogAnalytics, paste the query, and get answers in seconds. No vendor lock-in, no slow SaaS queries.

4. Bookmark Frequent Issues

Each guide has a unique URL. Save the ones your team hits repeatedly. Share them in Slack during incidents. Build a runbook library tailored to your infrastructure.

Error Categories and Quick Fixes

Web Server Errors

502 Bad Gateway: Upstream server returned invalid response. Check pod health, keepalive timeouts, connection pools.
504 Gateway Timeout: Upstream took too long to respond. Measure origin latency, reduce payload, check DNS resolution.
Connection Refused: Upstream port not accepting connections. Verify service endpoints, pod readiness probes, firewall rules.
SSL Handshake Failed: Certificate mismatch or cipher incompatibility. Audit TLS versions, cert chains, SNI configuration.

Example SQL Query

SELECT
  time_local,
  request,
  status,
  upstream_response_time,
  upstream_addr
FROM logs
WHERE status IN (502, 504)
ORDER BY time_local DESC
LIMIT 100;

Container and Kubernetes Errors

OOMKilled: Container exceeded memory cgroup limit. Compare requests/limits vs actual usage, tune GC, add VPA.
CrashLoopBackOff: Pod keeps restarting. Check application logs, init container failures, mismatched env vars.
ImagePullBackOff: Can't pull container image. Verify registry credentials, image tags, network policies.
Liveness Probe Failed: Health check timed out. Adjust probe thresholds, check endpoint latency, review readiness logic.

Example SQL Query

SELECT
  timestamp,
  pod_name,
  namespace,
  message
FROM logs
WHERE message ILIKE '%OOMKilled%'
   OR message ILIKE '%CrashLoopBackOff%'
ORDER BY timestamp DESC
LIMIT 50;

Database Errors

Deadlock Detected: Two transactions locked each other. Add explicit row locking order, reduce transaction scope, lower isolation.
Too Many Connections: Connection pool exhausted. Tune max_connections, add pgBouncer, audit connection leaks.
Slow Query Warning: Query exceeded threshold. Add indexes, rewrite with CTEs, partition large tables.
Replication Lag: Replica behind primary by N seconds. Check network latency, disk I/O, replication slots.

Example SQL Query

SELECT
  timestamp,
  message,
  duration_ms
FROM logs
WHERE message ILIKE '%deadlock%'
   OR duration_ms > 5000
ORDER BY timestamp DESC
LIMIT 100;

Cloud Platform Errors

AccessDenied (S3): IAM policy doesn't allow operation. Check bucket policies, VPC endpoints, KMS key permissions.
ThrottlingException: API rate limit exceeded. Implement exponential backoff, batch requests, request quota increase.
QuotaExceeded: Resource limit reached (e.g., Lambda concurrency). Monitor usage trends, request limit increase, add sharding.
RateLimitExceeded: Too many requests in time window. Add caching layer, implement request queuing, use CDN.

Example SQL Query

SELECT
  timestamp,
  request_id,
  error_code,
  key
FROM logs
WHERE error_code IN ('AccessDenied', 'ThrottlingException')
GROUP BY error_code, key
ORDER BY COUNT(*) DESC
LIMIT 20;

The 5-Step Debugging Workflow

When an error appears, don't panic. Follow this systematic process used by Google SRE teams:

1
Isolate the Symptom
What's the error code? Which service? What timestamp? Use WHERE clauses to filter logs.
2
Measure Frequency
Is this a one-off spike or sustained failure? Use GROUP BY time_bucket to plot error rates.
3
Find Correlations
What changed recently? Deployments? Config updates? Use JOIN with deployment logs.
4
Trace the Request Path
Follow request IDs across services. Use WHERE request_id = '...' to build trace timeline.
5
Test the Fix
Apply the solution. Monitor error rates for 15 minutes. If errors persist, rollback and repeat from step 1.

When to Escalate

Not every error needs immediate escalation. But some do. Here's when to page the on-call engineer:

→Error persists >1 hour despite troubleshooting attempts
→Cascading failures across multiple services or regions
→Data loss risk (e.g., database corruption, replication failure)
→Security breach indicators (unauthorized access, injection attempts)

According to DORA's 2024 metrics, elite teams have clear escalation policies that reduce MTTR by 60%. Know when to ask for help.

Preventative Strategies

The best way to handle errors is to catch them before they become incidents. Here's how:

Aggregate Logs Centrally

Don't SSH into servers to grep logs. Use structured logging (JSON) and ship to a central store. Tools like Vector, Fluentd, or Promtail make this trivial.

Alert on Error Rate, Not Count

A single 502 error isn't an incident. But 502s increasing by 500% in 5 minutes? That's a problem. Use rate() functions in your monitoring.

Synthetic Monitoring

Don't wait for users to report errors. Run automated tests every 60 seconds that hit critical paths. Catch regressions before they impact revenue.

Chaos Engineering

Kill pods randomly in staging. Inject latency. Simulate network partitions. If your system can't handle artificial failures, it won't survive real ones.

Browse Error Guides

Click any error below to see detailed troubleshooting steps, SQL queries, and real-world solutions. Each guide links directly to LogAnalytics with a pre-built query.

Nginx

Fix 502 Bad Gateway in Nginx

Gateway returned an invalid response from upstream while proxying.

View guide →

CloudFront

CloudFront 504 Gateway Timeout

Edge node waited too long for the origin before timing out.

View guide →

AWS S3

S3 AccessDenied

Requester lacks bucket policy permissions for the requested key.

View guide →

Kubernetes

Kubernetes Pod OOMKilled

Container terminated because memory usage exceeded the cgroup limit.

View guide →

PostgreSQL

PostgreSQL deadlock detected

Two or more transactions locked each other causing an abort.

View guide →

Why Log Errors Matter (And Cost You Real Money)

How to Use This Error Library

1. Search by Error Code

2. Filter by Category

3. Run the SQL Queries

4. Bookmark Frequent Issues

Error Categories and Quick Fixes

Web Server Errors

Container and Kubernetes Errors

Database Errors

Cloud Platform Errors

The 5-Step Debugging Workflow

Isolate the Symptom

Measure Frequency

Find Correlations

Trace the Request Path

Test the Fix

When to Escalate

Preventative Strategies

Aggregate Logs Centrally

Alert on Error Rate, Not Count

Synthetic Monitoring

Chaos Engineering

Browse Error Guides

Why Log Errors Matter (And Cost You Real Money)

How to Use This Error Library

1. Search by Error Code

2. Filter by Category

3. Run the SQL Queries

4. Bookmark Frequent Issues

Error Categories and Quick Fixes

Web Server Errors

Container and Kubernetes Errors

Database Errors

Cloud Platform Errors

The 5-Step Debugging Workflow

Isolate the Symptom

Measure Frequency

Find Correlations

Trace the Request Path

Test the Fix

When to Escalate

Preventative Strategies

Aggregate Logs Centrally

Alert on Error Rate, Not Count

Synthetic Monitoring

Chaos Engineering

Browse Error Guides