Abstract

Live event streams serve millions of viewers. When quality drops during a match, operations teams need the root cause in seconds. Current tools alert but do not isolate the faulty component. A football goal triggers a 3x viewer surge that looks identical to a CDN failure in standard metrics. We present a fault localization system that models the streaming pipeline as a directed acyclic graph. A Temporal Graph Attention Network processes 30-second sliding windows of per-component metrics and traces. A content-event signal extractor flags expected load from audio energy and chat velocity spikes. A content-correlated anomaly filter separates expected stress from genuine faults, cutting false positives by 64.9%. On a multi-tier testbed with 8 fault types, the system achieves 87.3% top-1 root cause accuracy, 96.1% top-3 accuracy, and 11.2-second mean time to root cause, outperforming OCEAN by 16.1 percentage points on AC@1 and reducing diagnosis time by 3.0x.

Authors: Suresh Kumar Palus; Partha Samal

Our Sponsors