Debugging Production Without APM: Logging Strategies Before New Relic Existed (2011)
New Relic launched in 2008. Datadog in 2010. Sentry in 2010. But in 2011, "add an APM agent to your PHP app" was either expensive, immature, or something your team simply hadn't done yet.
When production broke, you had three things: SSH into the server, tail -f /var/log/apache2/error.log, and whatever you had decided to log beforehand. If you'd logged nothing useful, you were reconstructing the incident from access.log timestamps and guesswork.
We built a logging discipline out of necessity. Here's what it looked like.
The structured log format we landed on
Raw PHP error logs ([Thu May 19 14:32:01 2011] [error] [client 84.204.10.5] PHP Fatal error: ...) gave you the error but no context: which user, what they were doing, which state led there.
We moved to application-level structured logging before "structured logging" was a common term:
class AppLogger {
private static $context = [];
// Called at request start: attach session context to every log line
public static function setContext(array $ctx): void {
self::$context = $ctx;
}
public static function log(string $level, string $message, array $data = []): void {
$entry = array_merge([
'ts' => date('c'), // ISO 8601
'level' => $level,
'msg' => $message,
'req_id' => self::$context['req_id'] ?? null,
'user_id' => self::$context['user_id'] ?? null,
'url' => $_SERVER['REQUEST_URI'] ?? null,
'ip' => $_SERVER['REMOTE_ADDR'] ?? null,
], $data);
// One JSON line per log entry — grep-friendly
error_log(json_encode($entry, JSON_UNESCAPED_UNICODE));
}
public static function info(string $msg, array $data = []): void { self::log('INFO', $msg, $data); }
public static function warn(string $msg, array $data = []): void { self::log('WARN', $msg, $data); }
public static function error(string $msg, array $data = []): void { self::log('ERROR', $msg, $data); }
}
// Bootstrap: attach request context
AppLogger::setContext([
'req_id' => substr(md5(uniqid()), 0, 8), // Short ID to correlate log lines per request
'user_id' => $_SESSION['user_id'] ?? null,
]);
// Usage
AppLogger::info('Order created', ['order_id' => $order->id, 'amount' => $order->total]);
AppLogger::error('Payment failed', ['order_id' => $id, 'error' => $e->getMessage(), 'gateway' => 'paypal']);
The req_id was the key idea. Every log line from the same HTTP request shared an ID. grep "req_id\":\"a3f91c" /var/log/app.log showed the complete timeline for that single request — all SQL queries, all external calls, all decisions.
Slow query logging: the most valuable 10 lines
Most production incidents in 2011 were slow MySQL queries under load, not PHP errors. The error log showed nothing. The slow query log showed everything:
# /etc/mysql/my.cnf
[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5 # Log queries slower than 500ms
log_queries_not_using_indexes = 1 # Critical: catches full table scans
min_examined_row_limit = 100 # Avoid logging fast queries on tiny tables
Then in our deployment checklist: mysqldumpslow -s t -t 10 /var/log/mysql/slow.log — top 10 slowest queries by total time — ran before every production deploy and after every incident.
The second tool: EXPLAIN. Every query that appeared in slow logs:
EXPLAIN SELECT p.*, u.name
FROM posts p
JOIN users u ON p.user_id = u.id
WHERE p.category_id = 5
ORDER BY p.created_at DESC
LIMIT 20;
-- If "type" column shows "ALL" → full table scan → missing index
-- If "rows" shows 50000+ → problem even with an index → query needs restructuring
The "heartbeat" endpoint
We added a /health endpoint to every application. Not for external monitoring services (we didn't have those yet) — for a cron job that hit it every 60 seconds and wrote to a local log.
// health.php — no authentication, read-only checks only
$checks = [];
// Database connectivity
try {
$db->query("SELECT 1");
$checks['db'] = 'ok';
} catch (Exception $e) {
$checks['db'] = 'error: ' . $e->getMessage();
}
// Memcached
$mc = new Memcache();
$checks['cache'] = $mc->connect('127.0.0.1', 11211) ? 'ok' : 'error';
// Disk space
$free = disk_free_space('/');
$total = disk_total_space('/');
$checks['disk_pct'] = round(($free / $total) * 100);
$checks['disk_ok'] = $checks['disk_pct'] > 10; // Alert if < 10% free
header('Content-Type: application/json');
$allOk = !in_array('error', array_values($checks), true);
http_response_code($allOk ? 200 : 503);
echo json_encode(['status' => $allOk ? 'ok' : 'degraded', 'checks' => $checks]);
# crontab -e
* * * * * curl -s http://localhost/health >> /var/log/healthcheck.log 2>&1
When something broke we could run grep '"db":"error"' /var/log/healthcheck.log and see exactly when the database started failing. Primitive by modern standards. Exactly what we needed at the time.
Exception capture before Sentry
Sentry's PHP SDK existed in 2011 but wasn't widely used yet. We built a minimal version: uncaught exceptions wrote to a database table, and a daily cron emailed us the previous day's errors grouped by message.
// Global exception handler
set_exception_handler(function (Throwable $e) {
$db->insert('error_log', [
'message' => $e->getMessage(),
'file' => $e->getFile(),
'line' => $e->getLine(),
'trace' => $e->getTraceAsString(),
'url' => $_SERVER['REQUEST_URI'] ?? '',
'user_id' => $_SESSION['user_id'] ?? null,
'created_at' => date('Y-m-d H:i:s'),
]);
// Show user a friendly error page, not a stack trace
include 'views/500.php';
exit;
});
// Daily digest cron
$errors = $db->query(
"SELECT message, file, line, COUNT(*) as count
FROM error_log
WHERE created_at >= DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY message, file, line
ORDER BY count DESC
LIMIT 20"
)->fetchAll();
if ($errors) {
mail('team@company.com', 'Daily Error Report', renderErrorDigest($errors));
}
This caught real bugs: a missing null check that threw on 0.1% of requests was invisible in normal testing but showed up as 40 occurrences in the daily digest.
The discipline that transferred
When modern APM tools came — New Relic, then Sentry, then Datadog — we adopted them immediately. But the habits built without them translated directly: think about what you'll need to know when something breaks in production, and log it before the incident, not after.
The specific tools have changed completely. The question hasn't: when this fails at 3am, what will I need in the logs to understand why? Answer that before you deploy, and debugging production becomes tractable instead of an archaeology expedition.