Two weeks ago, we experienced yet another subway train control system failure caused by aging equipment. A failure like this is certain to impact everyone working on or riding Muni. What’s not widely known is that the ingenuity and skill of Muni’s technical staff makes the difference between these failures crippling the system for weeks or for just a few hours.
On March 3, a control computer failed that governs part of the underground network of tracks and switches between Embarcadero Station and the surface, where most Muni Metro trains turn around. When our Signal Maintenance team is called to address a problem like this, all they start out knowing is that there are a bunch of “disturbed” switches and track segments.
The Automatic Train Control System, or ATCS, constantly watches over the system’s track and switches, and reports them as “disturbed” when it gets a peculiar reading, or when a system error prevents it from knowing whether the area is safe or dangerous. When this happens, the technicians methodically go through troubleshooting procedures, step by step, ruling out different components and subsystems as the cause.
View of the failure that occurred March 3, 2020, from the Transportation Management Center. Disturbed track switches are circled and disturbed track segments shown in red.
To do this successfully, Muni’s technicians need to have a solid familiarity for what behaviors and indications are “normal”—not an easy task in a system that has some of its original equipment dating back to the 1990s, mixed with other parts that have been swapped and re-swapped as the years go on. Last week, it was a night-shift technician’s sharp eye that caught a split-second oddity on the Axle Counter Evaluator, or ACE, a computer that monitors those train detectors in the trackway.
The Signal Maintenance crew found that the ACE was in an unusual low-power mode. After swapping out the power supply and bringing the computer to full power, it still wouldn’t boot. After changing some components it started up, but now one of two redundant control computers, called Intersigs, failed whenever both were switched on together. Despite this, each worked fine individually.
On Thursday morning they thought they found the culprit—a faulty connector that had been working faithfully since the 90s, allowing only one of the two Intersig computers to run at a time. But just as the crew was packing up their tools after replacing the faulty connector, both of the Intersigs failed again.
The local control center rack at the Muni Metro Turnback, containing the Intersig computers
They restarted troubleshooting when a member of the crew noticed something unusual for a split-second while watching the flashing lights of the equipment. Although the two Intersigs failed, the ACE, the original piece of equipment that was having problems, had also failed very briefly, but recovered itself without declaring an error. Because it recovered so quickly and showed no indications or logs that it had failed, it had gone unnoticed.
To address the new ACE failure, the team increased the power supply and there were no more failures. The night shift team had finally found the root cause of the problem: The faulty power supply had damaged multiple pieces of equipment in the area, causing them to fail in different ways.
Without so many things going right—the sharp eye of the night crew, the dedicated systems knowledge of the technicians, the collaboration and turnover of information between work shifts and the willingness to stick to the methodology, it’s likely that this problem wouldn’t have been discovered so quickly.
The culprit of the March 3 subway train control system failure, an old power supply
Our train control system is a challenge to manage because it is both a technology system and a piece of critical infrastructure. In the United States, this sort of infrastructure is updated once or twice a century, but technology systems become obsolete at a much faster pace.
Like every other transit system in the country, Muni has been managing the train control system on the same timescale as infrastructure. That has left us with situations like this when components become outdated and ultimately fail.
Today, with a subway train control system approaching 30 years old, our success depends entirely on the prowess and dedication of our maintenance team, who are holding the system together. While we celebrate their capabilities to get us through events like this, we must rely on more than just the heroics of our staff to provide more reliable train service for San Francisco.
We must change the paradigm of how we procure, manage and maintain our train control systems. Muni’s rail network demands a modern train control system which is always kept up to date with the latest service-proven technology, and our customers deserve it.