Because the code classified it as a "this should never happen!" error, and then ...

samus · on Sept 11, 2023

That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.

JumpCrisscross · on Sept 11, 2023

> it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios

Yes. But also, it's an ATC system. Its primary purpose "is to prevent collisions..." [1].

If the system encounters a "this should never happen!" error, the correct move is to shut it down and ground air traffic. (The error shouldn't have happened in the first place. But the shutdown should have been more graceful.)

[1] https://en.wikipedia.org/wiki/Air_traffic_control

crabbone · on Sept 11, 2023

Neither formal methods nor fuzzing would've helped if the programmer didn't know that input can repeat. Maybe they just didn't read the paragraph in whatever document describes how this should work and didn't know about it.

I didn't have to implement flight control software, but I had to write some stuff described by MIFID. It's a job from hell, if you take it seriously. It's a series of normative documents that explains how banks have to interact with each other which were published quicker than they could've been implemented (and therefore the date they had to take effect was rescheduled several times).

These documents aren't structured to answer every question a programmer might have. Sometimes the "interesting" information is close together. Sometimes you need to guess the keyword you need to search for to discover all the "interesting" parts... and it could be thousands of pages long.

samus · on Sept 12, 2023

The point of fuzzing is precisely to discover cases that the programmers couldn't think about, and formal methods are useful to discover invariants and assumptions that programmers didn't know they rely on.

Furthermore, identifiers from external systems always deserve scepticism. Even UUIDs can be suspect. Magic strings from hell even more so.

crabbone · on Sept 12, 2023

Sorry, you missed the point.

If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

The mistake is too trivial to attribute it to the programmer incompetence / lack of attention. I'd bet my lunch it was because the spec is written in an incomprehensible language, is all over the place in a thousand pages PDF, and the particular aspect of repetition isn't covered in what looks like the main description of how paths are defined.

I've dealt with specs like that. It's most likely the error created by the lack of understanding of the details of the requirements than of anything else. No automatic testing technique would help here. More rigorous and systematic approach to requirement specification would probably help, but we have no tools and no processes to address that.

samus · on Sept 12, 2023

> If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

It totally would. The point of a fuzzer is to test the system with every technically possible input, to avoid bias and blind spots in the programmer's thinking.

Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned. Unless you know all about the business rules of an external system, you can't trust its data and can't assume much about its behavior.

Anyways, we are discussing about the wrong issue. Bugs happen, even halting the whole system can be justified, but the operators should have had an easier time figuring out what was actually going on, without the vendor having to pore through low-level logs.

crabbone · on Sept 12, 2023

No... that's not the point of fuzzing... You cannot write individual functions in such a way that they keep revalidating input handed to them. Because then, invariably, the validations will be different function to function, and once you have an error in your validation logic, you will have to track down all function that do this validation. So, functions have to make assumptions about input, if it doesn't come from an external source.

I.e. this function wasn't the one which did all the job -- it already knew that the input was valid because the function that provided the input already ensured validation happened.

It's pointless to deliberately send invalid input to a function that expects (for a good reason) that the input is valid -- you will create a ton of worthless noise instead of looking for actual problems.

> Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned.

How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?... There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

You really need to try doing what you suggest before suggesting it.

samus · on Sept 13, 2023

I am not going to comment the first paragraph since you turned my words around.

> How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?

A dictionary in my program is under my control and I can be sure that the key is unique since... well, I know it's a dictionary. I have no such knowledge about data coming from external systems.

> There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

"Meant to be" and "actually are" can be very different things, and it's the responsibility of a programmer to establish the difference, or to at least ask pointed questions. Actually, the programmers did the correct thing by not sweeping this unexpected problem under the rug. The reaction was just a big drastic, and the system did not make it easy for the operators to find out what went wrong.

Edit: as we have seen, input can be valid, but still not be processable by our code. That not fine, but it's a fact of life since specs are often unclear or incomplete. Also, the rules can actually change without us noticing. In these cases, we should make it as easy as possible to figure out what went wrong.

sublimefire · on Sept 11, 2023

I've only heard from people engineering systems for aerospace industry and we're speaking hundreds of pages of api documentation. It is very complex so equally the chances of a human error are higher.

hn_throwaway_99 · on Sept 11, 2023

I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.

That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".

I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.

jameshart · on Sept 12, 2023

If the code takes a valid series of ICAO waypoints and routes, generates the corresponding ADEXP waypoint list, but then when it uses that to identify the ICAO segment that leaves UK airspace it's capable of producing a segment from before when the route enters UK airspace, then that code is wrong, and who knows what other failure modes it has?

Maybe it can also produce the wrong segment within British airspace, meaning another flight plan might be processed successfully, but with the system believing it terminates somewhere it doesn't?

Maybe it's already been processing all the preceding flight plans wrongly, and this is just the first time when this error has occurred in a way that causes the algorithm to error?

Maybe someone's introduced an error in the code or the underlying waypoint mapping database and every flight plan that is coming into the system is being misinterpreted?

WalterBright · on Sept 12, 2023

An "unexpected error" is always a logic bug. The cause of the logic error is not known, because it is unexpected. Therefore, the software cannot determine if it is an isolated problem or a systemic problem. For a systemic problem, shutting down the system and engaging the backup is the correct solution.

iudqnolq · on Sept 12, 2023

I'm pretty inexperienced, but I'm starting to learn the hard way that it takes more discipline to add more complex error recovery. (Just recently my implementation of what you're suggesting - limiting the blast radius of server side errors - meant all my tests were passing with a logged error I missed when I made a typo)

Considering their level 1 and 2 support techs couldn't access the so-called "low level" logs with the actual error message it's not clear to me they'd be able to keep up with a system with more complicated failure states. For example, they'd need to make sure that every plan rejected by the computer is routed to and handled by a human.

crabbone · on Sept 11, 2023

> is essentially totally independent

They physically cannot be independent. The system works on an assumption that the flight was accepted and is valid, but it cannot place it. What if it accidentally schedules another flight in the same time and place?

krisoft · on Sept 11, 2023

> What if it accidentally schedules another flight in the same time and place?

Flight plans are not responsible for flight separation. It is not their job and nobody uses them for that.

As a first approximation they are used so ATC doesn’t need to ask every airplane every five minute “so flight ABC123 where do you want to go today?”

I’m staring to think that there is a need for a “falsehoods programers believe about aviation” article.

Thorentis · on Sept 11, 2023

Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision. The system needs to maintain the integrity of all plans it sees. If it can't process one, and there's the risk of a plane entering airspace with a bad flight plan, you need to stop operations.

phkahler · on Sept 11, 2023

>> Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision.

Flight plans don't contain any information relevant for collision avoidance. They only say when and where the plane is expected to be. There is not enough specificity to ensure no collisions. Things change all the time, from late departures, to diverting around bad weather. On 9/11 they didn't have every plane in the sky file a new flight plan carefully checked against every other...

lozenge · on Sept 11, 2023

But they have 4 hours to reach out to the one plane whose flight plan didn't get processed and tell them to land somewhere else.

ivraatiems · on Sept 11, 2023

Assuming they can identify that plane.

Aviation is incredibly risk-averse, which is part of why it's one of the safest modes of travel that exists. I can't imagine any aviation administration in a developed country being OK with a "yeah just keep going" approach in this situation.

jameshh · on Sept 11, 2023

That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?

raverbashing · on Sept 11, 2023

And that's why I never (or very rarely) put "this should never happen" exceptions anymore in my code

Because you eventually figure out that, yes, it does happen

pmontra · on Sept 11, 2023

A customer of mine is adamant in their resolve to log errors, retry a few times, give up and go on with the next item to process.

That would have grounded only the plane with the flight plan that the UK system could not process.

Still a bug but with less effects to all the continent, because planes that could not get inside or outside the UK could not fly and that affected all of Europe and possibly more.

crabbone · on Sept 11, 2023

> That would have grounded only the plane with the flight plan that the UK system could not process.

By the looks of it, it was few hours in the air by the time the system had a breakdown. Considering it didn't know what the problem was, it seems appropriate that it shut down. No planes collided, so the worst didn't happen.

pmontra · on Sept 12, 2023

Couldn't the outcome be "access to the UK airspace denied" only for that flight? It would have checked with an ATC and possibly landed somewhere before approaching the UK.

In the case of a problem with all flights, the outcome would have been the same they eventually had.

Of course I have no idea if that would be a reasonable failure mode.

airstrike · on Sept 11, 2023

This here is the true takeaway. The bar for writing "this should never happen" code must be set so impossibly high that it might as well be translated into "'this should never happen' should never happen"

andrewaylett · on Sept 11, 2023

The problem with that is that most programming languages aren't sufficiently expressive to be able to recognise that, say, only a subset of switch cases are actually valid, the others having been already ruled out. It's sometimes possible to re-architect to avoid many of this kind of issue, but not always.

What you're often led to is "if this happens, there's a bug in the code elsewhere" code. It's really hard to know what to do in that situation, other than terminate whatever unit of work you were trying to complete: the only thing you know for sure is that the software doesn't accurately model reality.

In this story, there obviously was a bug in the code. And the broken algorithm shouldn't have passed review. But even so, the safety critical aspect of the complete system wasn't compromised, and that part worked as specified -- I suspect the system behaviour under error conditions was mandated, and I dread to think what might have happened if the developers (the company, not individuals) were allowed to actually assume errors wouldn't happen and let the system continue unchecked.

PeterStuer · on Sept 11, 2023

So what does your code do when you did not handle the this should never happen exception? Exit and print out a stacktrace to stdout?