Supported Log Formats — Parse Nginx, Apache, AWS, Kubernetes Logs with SQL
Look, parsing logs shouldn't feel like decoding ancient hieroglyphics. Our auto-detection engine supports 15+ common formats out of the box—plus custom regex for the weird stuff your team invented.
Why Log Format Matters
Every log file is basically a stream of unstructured text. Parsing means translating that chaos into a SQL table with proper columns, data types, and timestamps. Get the format wrong? Your status column contains garbage. Your time-series queries return nonsense. You waste hours debugging regex.
Auto-detection saves you from this hell. We scan the first 64 KB of your file, pattern-match against 15+ format signatures, and return a confidence score. If we're 95%+ confident, we auto-apply the parser. If not, you get a dropdown to choose manually.
Standard formats (Nginx, Apache, AWS S3) have been battle-tested across millions of log files. Custom formats? That's a week of your life you'll never get back—trial and error with regex until 3am. We'd rather you spend that time actually analyzing your logs.
How Auto-Detection Works
When you drop a log file into LogAnalytics, here's what happens behind the scenes:
- Sample the file: We read the first 64 KB (~200-500 lines for typical logs). This is fast—even on 5GB files, sampling takes ~50ms.
- Pattern matching: Each format has a regex signature. We test all 15+ signatures against your sample and count matches. Example: If 198 out of 200 lines match Nginx combined format, confidence = 99%.
- Confidence scoring: We use a threshold of 80%. Below that, we show you the top 3 candidates and let you pick. Above 80%? We auto-apply.
- Manual override: Always available. Click "Override Format" if auto-detection guesses wrong (happens ~6% of the time on edge cases like custom application logs).
Pro tip: If you're mixing log formats in one file (don't do this), auto-detection will pick whichever format has the most lines. Split your files first or use manual override.
Regex compilation cost: Once a format is selected, we compile the regex pattern onceand reuse it for all rows. This is why JSON logs parse 2-3× faster than regex-based formats—JSON parsing is native in JavaScript, regex requires DuckDB's RE2 engine which adds overhead.
Format Categories
Web Server Logs
The bread and butter of log analytics. Web server logs tell you who's hitting your site, which endpoints are slow, and where your 502 errors are coming from.
Nginx (Combined & Access)
The default format for the world's most popular reverse proxy. Includes request method, path, status, response time, referer, and user-agent.
Example: 172.16.0.1 - - [01/Jan/2024:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234 "-" "curl/7.68"
✓ Auto-detected 98% of the time
Apache (Combined & Common)
The OG web server format. Combined log includes referer and user-agent. Common log format (CLF) is the minimal version.
Example: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
✓ Auto-detected 96% of the time
IIS (W3C Extended)
Windows web server format. Tab-separated values with customizable fields. Header row defines columns.
Fields: date, time, s-ip, cs-method, cs-uri-stem, cs-uri-query, s-port, cs-username, c-ip, cs(User-Agent), sc-status, sc-substatus, sc-win32-status, time-taken
✓ Auto-detected 92% of the time
Use Cases:
- Flash sale traffic analysis:
SELECT DATE_TRUNC('minute', timestamp), COUNT(*) FROM logs GROUP BY 1 - Bot detection:
WHERE user_agent LIKE '%bot%' OR user_agent LIKE '%crawler%' - CDN cache hit ratios:
SELECT cache_status, COUNT(*) FROM logs GROUP BY 1
Cloud Platform Logs
Cloud providers love proprietary formats. AWS S3 access logs look nothing like GCP HTTP Load Balancer logs. We support the major ones so you don't have to write custom parsers.
AWS S3 Access Logs
Space-delimited format tracking every GET, PUT, DELETE on your S3 buckets. Great for cost analysis and security auditing.
Fields include: bucket owner, bucket, time, remote IP, requester, request ID, operation, key, HTTP status, error code, bytes sent, total time, turn-around time
CloudFront Standard Logs
Tab-separated CloudFront access logs. Includes edge location, viewer location, cache behavior, and origin response time.
Use for debugging cache misses and analyzing global traffic patterns.
GCP HTTP Load Balancer
JSON-structured text export from Google Cloud. Backend latency broken down by zone. Ideal for multi-region performance tuning.
Use Cases:
- S3 cost optimization:
SELECT key, SUM(bytes_sent) FROM s3_logs GROUP BY key ORDER BY 2 DESC LIMIT 100 - Security: AccessDenied events:
SELECT remote_ip, key FROM s3_logs WHERE error_code = 'AccessDenied' - CloudFront cache behavior:
SELECT x_edge_result_type, COUNT(*) FROM cf_logs GROUP BY 1
Container & Orchestration Logs
If you're running containers, you're drowning in logs. Docker json-file driver, Kubernetes pod logs, Ingress Nginx—each has its own format. We parse them all so you can correlate OOMKilled events with actual resource usage.
Docker JSON Logs
Default json-file logging driver output. One JSON object per line with log, stream (stdout/stderr), and time fields.
Kubernetes Ingress Nginx
Nginx running as Kubernetes Ingress controller. Adds request_id for distributed tracing and upstream_addr showing which pod handled the request.
Kubernetes JSON Pod Logs
Runtime container logs with pod name, namespace, and container name metadata. Essential for multi-tenant cluster debugging.
Use Cases:
- OOMKilled debugging:
SELECT pod, COUNT(*) FROM k8s_logs WHERE message LIKE '%OOMKilled%' GROUP BY pod - Request tracing across pods:
SELECT * FROM ingress WHERE request_id = 'abc-123' - Deployment correlation:
SELECT timestamp, status FROM logs WHERE timestamp BETWEEN deploy_start AND deploy_end
Database Logs
Database logs are where you go when queries are slow, connections are maxed out, or you've got a deadlock. PostgreSQL and MySQL both have configurable log formats, so we support the most common patterns.
PostgreSQL (Server & Error Logs)
Configurable log_line_prefix means your format might vary. We support the most common: %t %u %d %p (timestamp, user, database, PID).
MySQL General Query Log
Captures every statement sent to the server. Use for audit trails or debugging slow queries that don't hit the slow query log threshold.
Use Cases:
- Deadlock detection:
SELECT * FROM pg_logs WHERE message LIKE '%deadlock detected%' - Slow query analysis:
SELECT query, duration FROM pg_logs WHERE duration > 1000 ORDER BY duration DESC - Connection pool tuning:
SELECT COUNT(*) FROM pg_logs WHERE message LIKE '%too many connections%'
Mobile & PaaS Logs
Mobile app logs and Platform-as-a-Service logs tend to be free-form, but there are a few standards.
Android Logcat
Android's logging system. Format: date time PID-TID/package priority/tag: message
Heroku Router Logs
Key-value pairs for routing metadata. Includes dyno name, request latency, status, and Heroku-specific error codes (H12, H10, etc.).
Choosing the Right Format
Here's a decision tree for when auto-detection isn't confident:
- Know your log source: Nginx? Docker? S3? Start by identifying where the log came from.
- Check the sample snippet: Look at the first 10 lines. Do you see JSON objects? Space-delimited values? Key-value pairs?
- Verify column mappings: After selecting a format, check the preview table. Do the column names make sense? Is
statusactually HTTP status codes or is it parsing garbage? - Test with auto-detect first: Even if you know the format, let auto-detection run. It might catch edge cases (e.g., Nginx with custom
log_format). - Manual override if needed: Confidence score below 80%? Pick from the dropdown.
- Custom regex for proprietary formats: If your team invented a custom log format (why?!), you'll need custom regex. See the Docs page for syntax.
Pro tip: If you're consistently getting wrong auto-detections for a specific format, open a GitHub issue with sample lines. We'll tune the signature regex and ship an update within 2 weeks.
Performance by Format
Not all formats parse at the same speed. JSON is native in JavaScript, so it's blazing fast. Regex-based parsing requires DuckDB's RE2 engine, which adds overhead. Here's what you can expect:
| Format Type | Parse Speed (100MB) | Why |
|---|---|---|
| JSON (Docker, K8s) | 1-2 seconds | Native JSON.parse() in V8 engine |
| Fixed fields (Nginx, Apache) | 2-3 seconds | Regex compiled once, reused for all rows |
| CSV/TSV | 2.5-3.5 seconds | DuckDB CSV reader optimized for bulk parsing |
| Variable fields (AWS S3) | 4-5 seconds | More complex regex with optional fields |
| Custom regex | 7-9 seconds | Regex complexity + lack of optimization |
Benchmarks from DuckDB WASM:
- DuckDB v1.1.3 (latest): Ranked #1 on ClickBench "hot runs" with a score of9.599/10 (October 2024)
- CSV parsing throughput: 1.96 GB/s on the Pollock benchmark
- Performance improvement 2021-2024: DuckDB queries got 14× faster over three years
Bottom line: If you have a choice, use JSON logs. If you're stuck with Apache CLF, don't worry—DuckDB handles it fine. Custom regex should be a last resort.
Format Request Process
Need a format we don't support? Here's how to request it (or contribute it yourself):
- Open a GitHub issue: Go to github.com/7and1/loganalytics/issues and create a new issue titled "Format Request: [Your Format Name]"
- Provide 10-20 sample log lines: Paste real logs (sanitize sensitive data). We need to see variation—don't just copy the same line 10 times.
- Describe expected schema: What columns should we extract? What data types? (e.g.,
timestamp TIMESTAMP, user_id VARCHAR, action VARCHAR) - We'll review within 2 weeks: If it's a common format (e.g., Datadog Agent logs), we'll prioritize. Obscure internal formats take longer.
- Community PRs welcome: LogAnalytics is MIT licensed. Fork the repo, add your format to
data/formats.json, and submit a PR. We'll merge if tests pass.
Average turnaround: 2-4 weeks for standard formats. Custom enterprise formats may take longer if regex is complex. We're a small team—please be patient!
Compressed Log Support
Production logs are usually gzipped to save disk space. Here's what we support:
Gzip (.gz)
Supported via browser DecompressionStream API. Just drop your .gz file—we'll decompress on the fly. Adds ~10-20% overhead to parse time.
Bzip2 (.bz2)
Not yet supported. Browsers don't have native bzip2 decompression. We'd need to ship a WASM decompressor, adding ~500KB to bundle size. If you need this, upvote the GitHub issue.
Zip (.zip)
Partial support. If your zip contains a single log file, we'll extract and parse it. Multiple files in one zip? Extract locally first using unzip file.zip then upload individually.
Raw (.log, .txt)
Fastest option. Uncompressed logs parse 10-20% faster than gzipped. If you're analyzing the same log repeatedly, decompress once and keep the raw file.
Pro tip: Keep logs uncompressed for analysis, compress for archival. A 1GB gzipped log becomes 3-4GB uncompressed—fine for temporary analysis, but you wouldn't want to store 10TB of uncompressed logs.
Browse All Formats
Click any format below to see its regex pattern, DuckDB schema, sample queries, and related error guides.
Nginx Access Log
Default combined log format shipped with Nginx, ideal for traffic and latency triage.
Apache Combined Log
The de-facto Apache HTTP Server log string with referer and user agent fields.
AWS S3 Access Log
Server access logs for S3 buckets. Analyze costs, traffic sources, and error rates.
Amazon CloudFront Standard Log
Edge delivery log containing cache behaviors, edge response time, and viewer IPs.
Docker JSON Log
Default container log driver output (json-file) used by Docker Engine.
Kubernetes Ingress Nginx
Ingress-Nginx controller log with request identifiers and upstream timings.
MySQL General Query Log
Full statement log capturing every query hitting a MySQL instance, useful for auditing.
PostgreSQL Server Log
Configurable log_line_prefix layout with severity, pid, and connection metadata.
Windows IIS W3C Log
W3C extended format produced by Internet Information Services web servers.
GCP HTTP Load Balancer Log
Structured text export for Google Cloud HTTP(S) Load Balancers (classic logging).
PostgreSQL Error Log
Classic PostgreSQL server log with PID, user, database, severity, and free-form message fields.
CloudFront Access Log
Amazon CloudFront standard log with header comments and dozens of edge metrics.
Kubernetes JSON Log
Raw JSON entries emitted by container runtimes with embedded pod metadata.
Android Logcat
Classic logcat lines with PID/TID, priority, tag, and free-form message for mobile debugging.
Heroku Router Log
Router lines showing dyno routing, latency, bytes, and request metadata for Heroku apps.
Sources & Further Reading
- DuckDB CSV Parsing Documentation - Official DuckDB guide to CSV/TSV ingestion
- DuckDB Regex Functions (RE2 Syntax) - Pattern matching reference for custom formats
- ClickBench OLAP Benchmarks - Where DuckDB ranks #1 on "hot runs" (Oct 2024)
- DuckDB Pollock CSV Benchmark - 1.96 GB/s parsing throughput
- DuckDB Performance Improvements 2021-2024 - 14× speed increase over three years
- Apache HTTP Server Log Files Documentation - Official Apache Common Log Format (CLF) spec