« articles

Traceroute isn't real, or: Whoops! Everyone Was Wrong Forever

Updated - see 1 addendum at bottom.
Source presentation by Richard Steenbergen
Video version of presentation.

There is no such thing as traceroute.

I used to deliver network training at work. It was freeform, I was given wide latitude to design it as I saw fit, so I focused on things that I had seen people struggling with - clearly explaining VLANs in a less abstract manner than most literature, for instance, as well as actually explaining how QoS queuing works, which very few people understand properly.

One of the "chapters" in my presentation was about traceroute, and it more or less said "Don't use it, because you don't know how, and almost nobody you'll talk to does either, so try your best to ignore them." This is not just my opinion, it's backed up by people much more experienced than me. For a good summary I highly recommend this presentation.

But as good as that deck is, I always felt it left out a crucial piece of information: Traceroute, as far as the industry is concerned, does not exist.

Look it up. There is no RFC. There are no ports for traceroute, no rules in firewalls to accommodate it, no best practices for network operators. Why is that?

Traceroute has no history

First off: Yes, there is a traceroute RFC. It's RFC1393, it's 31 years old, and to my knowledge nothing supports it. The RFCs are jam-packed with brilliant ideas nobody implemented. This is one of them. The traceroute we have is completely unrelated to this.

Unsurprisingly however, it's a good description of how a traceroute protocol should work. You send a packet to a given destination, with a special flag set, and any machine it passes through observes the flag and says "oh, this packet is meant to be traced," so it generates an ICMP Traceroute response and sends it back to the originating host.

The host, then, sends a single packet and receives a flood of responses describing the path that packet took, definitively. Great! Or, I mean, it would be, if anything supported it. And if it was 1993.

As the linked presentation explains, traceroute simply no longer works in the modern world, at least not "as designed" - and it no longer can work that way, for several reasons not the least that networks have been abstracted in ways it did not anticipate.

There are now things like MPLS, which operate by encapsulating IP - in other words, putting a bag over a packets head, throwing it in the back of a van, driving it across town and letting it loose so it has no idea how far it's traveled. Without getting much further into how that works: It is completely impossible for it to satisfy the expectations of traceroute.

This "tool" works purely at layer 3, so it's impossible for it to adapt to the sort of "layer 12-dimensional-chess" type shenanigan that MPLS does - and there are other problems, but they're all getting ahead of reality, since traceroute never even worked correctly as intended, and there's no reason it would.

Traceroute, you see, is "clever," which is an engineering term that means "fragile." When programmers discover something "clever," any ability they may have had to assess its sustainability or purpose-fit often goes out the window, because it's far more important to embrace the "cleverness" than to solve a problem reliably.

The RFC process is likely not perfect - it's basically an enormous committee system, so, that's troubling - but it does at least constitute a review and consensus process. Had someone written out a spec for traceroute, and then vendors had agreed to implement it, that would be one thing. But that is not what happened.

Traceroute is a filthy hack

From the traceroute man page (1987):

Implemented by Van Jacobson from a suggestion by Steve Deering.
Debugged by a cast of thousands with particularly cogent
suggestions or fixes from C. Philip Wood, Tim Seaver and Ken Adelman.

I can't find any proper history of the tool, but my impression and my assumption is that it is simply a behavior that someone noticed was possible. Engineers did not get together and design a system for this; some people just realized that it was a side effect of other network behavior not intended to accomplish this goal.

In other words, it's an exploit, and that is really the best way to describe both how it works, and why it's a bad idea.

Here's how traceroute works:

When you send a packet to a destination, it often has to go through multiple routers, or "hops."

To prevent packets from cycling indefinitely in a network due to routing loops (router A points to router B which points to router A...) they include a Time-To-Live field, which is set to a reasonably high value when a packet is created, and each machine that the packet passes through decrements that field by one.

When the field hits zero, the packet gets thrown away. As a courtesy, the router that's dropping the packet has the option to generate a new packet, using the ICMP protocol, with the subtype "TTL Exceeded," and send it back to the originating machine, to let it know there's something wrong with the network path.

These clever fellows in 1987 realized that by manipulating the TTL value, you can choose which router will send that ICMP message.

Send a packet with the TTL set to 1. The first router you hit will decrement it to zero. The packet is now "dead", so it drops the packet, and sends back TTL Exceeded. That response will originate from the router's own IP address - congratulations, you now have the IP of the first hop.

Now send another with the TTL set to 2. The first router will decrement it to 1 and pass it, and the second one will decrement to zero and drop it. Now you have its IP address.

Repeat, increasing TTL each time, until the final hop responds. You now have your complete path.

This is indeed quite clever, but don't lose sight of what is going on here. TTL Exceeded is simply not meant for this. It is a message meant to diagnose a specific, unrelated kind of network malfunction. It's not intended for tracing paths, and for reasons I'll explain, it's also exactly the kind of feature that may exist in a lab, and in the first few experimental networks, but gets abandoned as soon as money enters the picture.

DJ Shadow - Why Hip Hop Sucks In 96 (It's The Money)

TTL Exceeded is not a "feature."

Features are things that enable functionality. It doesn't do that.

Features are things that affect end-user experience. It doesn't do that.

TTL Exceeded is purely informational. It's useful to exactly one person: a network engineer. It would be absolutely untenable to report this sort of error to an end user, since there's positively nothing they can do about it, so no application will ever do this.

Not only were these messages not intended for end users, they weren't even intended for network operators as we know them now.

In 1987, virtually every network admin could get an email address for the admin of pretty much any other network, worldwide, with a couple phone calls or a whois lookup. That meant it was practical to troubleshoot other peoples networks, which are often where these errors are seen. Nowadays? Forget about it. Hah. Wow. No way.

If you get a TTL Exceeded while trying to reach another host through the internet, there is a zero percent probability that you can get traction on your problem unless you are a Fortune 500 - and even then it will be tough. At least half the companies that are likely to be involved simply do not provide any form of support for problems involving less than millions of hosts.

It is, generally speaking, not possible to call AT&T and say "Hey, when I try to ping one of your subscribers in California from a Level3 circuit in New York, I'm hitting a routing loop." I have worked for an ISP with direct peering with those networks and that simply never worked. We got incompetent, consumer-grade support techs and the issue went nowhere, if we even had a contact at all.

It's even harder to call the exchange partners, the network providers that may sit in between AT&T and Level3 in this equation. Nobody will even tell you who they are, and if they did, there is simply nobody to call. Those phone numbers don't exist unless you're a network engineer at one of their direct partners who is calling to report that a fiber port is down.

No, AT&T is not going to push your complaint up the line to XO. Haha. No.

Problems like this are fairly rare these days, which makes it even less likely that anyone will be on hand to work on them. Most of the time, IME, they get resolved through Brownian Troubleshooting: large scale network maintenance happens for unrelated reasons and incidentally fixes the problem.

So traceroute, on an internet scale, has been useless for ages. You think you see a routing problem? So what? There's absolutely nothing you can do about it.

With that information, go ahead and ask yourself if you think anyone, at any network hardware company, has given a shit about implementing TTL Exceeded since the 90s. The answer is obvious: No. Without a doubt, this is not on anyone's priority list.

If you're at Juniper, nobody is clamoring for this. You do not have ISPs threatening to switch to Cisco (lmao!) just because you didn't implement TTL Exceeded correctly, because they aren't using it. The kind of problems ISPs care about are "we lost the entire US northeast" or "we can't reach Comcast, at all." The NOCs involved at that point may use traceroute, but they will get by without it. Nobody is going to make a C-level escalation with Juniper over it.

So, as a network hardware vendor, with a certain budget and a whole galaxy of internet standards to implement, are you going to put time into this? Absolutely not.

Academics, perhaps. People doing research and experiments at universities 35 years ago might have stuck to the specs religiously, but there is no financial reason whatsoever to implement this correctly.

But then, we're ahead of ourselves again. Because what is a "correct" implementation?

Nothing involving a router is "correct", but

RFC 792, "INTERNET CONTROL MESSAGE PROTOCOL", explains how to implement TTL Exceeded (which I believe is technically called "Time Exceeded"):


      If the gateway processing a datagram finds the time to live field
      is zero it must discard the datagram.  The gateway may also notify
      the source host via the time exceeded message.

      If a host reassembling a fragmented datagram cannot complete the
      reassembly due to missing fragments within its time limit it
      discards the datagram, and it may send a time exceeded message.

Anyone who knows how to read an RFC understands the crushing solemnity of MAY.

Something that you MAY do is something that WILL NOT be done when it counts. MAY, in RFC terminology, means exactly what the dictionary says it should: The implementer can do it if they want.

That means that it is "standards-compliant" to create a router that has absolutely no implementation of ICMP TTL Exceeded. "May" can mean "never." It is completely up to the vendor.

And wouldn't you know it: vendors do, in fact, choose to never send these messages, and for good reason: It's hard!

The linked presentation (seriously! read it! please! you will benefit!) addresses why this is, and in short it comes down to the fact that routers are basically supercomputers. Consider that a core router at an ISP is potentially handling billions of packets per second. Running all that through a conventional CPU is absurd, and nobody has done this in decades.

Instead, routers contain custom, purpose-built hardware - called the data plane - consisting of dedicated silicon with the sole ability to look at the parts of a packet that matter for routing purposes and ask a couple very simple questions, e.g. "Do I have a way to get this to its next hop?" and "does it still have time to live?"

There are probably other details, but you get my point - it's highly optimized.

99.99% of the packets that pass through such a device simply come in one port, get glanced at, and are then hurled out of another port so the silicon can get on to the next packet. During all of this, the actual computer part of the router, the thing that can make complex decisions, is idle.

Yes, routers do contain general-purpose computers; they're pitiful little things. From the linked presentation (read it!!):

A 320-640+ Gbps router may only have a 600MHz CPU
ICMP generation is NOT a priority for the router.

Yeah. The CPU is... Not Fast.

As I implied, this is how supercomputers often work: you have a massive array of extremely fast processors, that can only solve certain, very well defined kinds of problems, and then off to the side you have some horrible little Core i3 Ideapad Yoga whose sole job is to feed program and data into the thing and then pull the string on its back.

With a supercomputer, feeding it invalid data will simply crash the process and you'll have to start all over. That's not an option with networks, since you can't control the incoming data, so routers need a way to handle exception scenarios. That's where the computer - known as the control plane - comes in.

In addition to feeding configs to the data plane, the control plane CPU is responsible for handling unexpected situations. If an interface goes down, the data plane simply starts dropping packets (if it has no other paths.) It takes no other actions; it's the control plane's job to notice this event and do something about it, e.g. sending SNMP traps so someone in a NOC can investigate.

I don't know how many TTLs get Exceeded these days, but even if a router sees tons of them every day, there's nothing it can do to fix the problem, and sending TTL Exceeded is a MAY, not a MUST, so no vendor is going to spend an extra $100,000 to design circuitry to generate and return those responses. That means that any packet that runs out of TTL will have to get forwarded to the control plane, which will decide if and when to send a response.

It goes without saying that the control plane is a very busy little bee. It's bad enough that it has to handle all the "exceptions", which are going to be plentiful in a carrier network with millions of hosts passing through it, but it also has to handle any actual self-destined traffic.

In addition to the millions of hosts that an internet router has to arbitrate between, it also has its own IP addresses, which people rudely try to interact with all day long. When you ping a router, you're making that poor little 600MHz ARM chip find time to deal with your traffic, not the terabit-per-second monster that it's married to. Same goes for SNMP queries, regular config backups, and other forms of management access.

Other than traceroute, TTL Exceeded serves very little purpose in the modern world, and with traceroutes being such a tiny percentage of traffic, it is perfectly reasonable for network admins to not care if it works or not. When you put all this together, it becomes apparent that most network providers are never going to spend a second thinking about this.

You can easily confirm this is true. Run a traceroute... anywhere. Yahoo dot com. You will see nodes that never respond, 9 times out of 10.

The Worst Diagnostics In The World

I cannot even guess how many times I have seen network techs see one hop not respond and say "well it looks like hop 5 is down, so that's your problem," even though hop 6 is responding.

It is impossible for me to imagine how they think the internet works, but they're playing against a stacked deck, because traceroute is just the worst diagnostic tool imaginable.

A good tool gives you a go, a no-go, or information. That is, it tells you something is working, or broken, or provides data you can interpret.

Traceroute does provide a single "go" outcome: If you see a trace get all the way through to the last node, well, okay, that's a success. The path is probably fine.

However, it also only provides a single "no-go" outcome, and it's not the one people think. Lack of response from hosts is not a failure. The sole failure you can identify reliably from traceroute is a network loop. If you see the same pair of nodes respond over and over, then you have a loop.

...and that information is almost completely useless, because this is the exact problem that TTL Exceeded is meant to diagnose, so you can just use it as intended. Just ping the target, and you'll see a TTL Exceeded response from one of the two routers that is looping the packet, identifying the failure point. Admittedly, traceroute does tell you both of those names, which is convenient.

Inadvertent routing loops are incredibly rare however, and 99% of the ones that I have seen were actually caused by network interfaces being down, and would have been discovered and resolved through ordinary, thorough network review.

As far as information? Well, read the presentation. The information provided by traceroute is limited, objectively incorrect and misleading in many cases, and fiendishly hard to interpret.

Lack of response from a node means nothing.

Even if all the nodes past a certain point aren't responding, that also means nothing.

If the nodes have high latency, that also means nothing.

If they respond on some probes and not others, that also means nothing.

Nothing you see in a traceroute means anything, because it is all accidental.

You are sending a packet through a network that did not plan for it. Nobody has taken steps to ensure your traceroute should succeed. There are no "Traceroute" checkboxes or statements in router configs. There's a really spicy reason for that, too: Traceroute does not even meet the most minimal definitions of a network protocol.

It's not a special kind of ICMP message, or a UDP or TCP packet that uses a defined port. You cannot "permit traceroute" in a firewall, because it has no standard characteristics. A lot of people think traceroute sends pings - this is an option, but never the default behavior AFAIK.

By default, traceroute simply sends a gibberish UDP packet on a random pair of ephemeral ports. The entire point is to be thrown away before a host even gets a chance to consume it, so the contents are irrelevant.

That means that if you were trying to prepare a network to handle traceroute, you wouldn't be able to. From a network perspective, traceroute does not exist.

It's simply an exploit, a trick someone discovered, so it's to be expected that it has no defined qualities. It's just random junk being thrown at a host, hoping that everything along the paths responds in a way that they are explicitly not required to. Is it any surprise that the resulting signal to noise ratio is awful?

So What Does All This Mean

It means that you can't run a traceroute unless you know what you expect to see.

When you're tracing inside a network that you control - such as a large enterprise WAN, multiple sites connected with VPNs, or an ISP that you work for - you can guess what each hop will look like, or at least look at the results and suss out whether they looks like they "should."

If you trace from, say, a server at one business location to one at another, you might see your local prem router, then a network edge router, a few core routers, another edge and then another prem router.

From this you can guess, pretty reliably, that you made it all the way to the destination, but either had trouble reaching the specific host (investigate the local router/firewall) or that the host is ACLed or doesn't send ICMP responses (do packet captures on the host.)

If you're tracing through a network you don't control, you have no idea how it's supposed to work. If you're a seasoned network tech who's seen some shit (and, ideally, worked on provider-scale networks) then you can run a traceroute over an unknown network and maybe, possibly, suss out something, but there are no guidelines, it's pure gut feeling: does this look right?

If you aren't that experienced however, you should avoid it, because you are not immune to propaganda. When you see high latency, hops not responding or whatever, that information will stick in your head. Despite your best efforts, it will affect the course of your troubleshooting even though you would not be able to say, if asked, what those results meant and what should be done with them.

As a diagnostician, you should ask yourself one question before performing any test: "What would I do if the outcome was x? And what if it was y?"

Can you fill in x and y? Can you answer either question? If not - why run the test?

And if you do run the traceroute anyway, god forbid you mention it to someone else. Do not write down the results unless you think you actually know what they mean, because no matter how offhandedly you do it, whoever comes across it is guaranteed to see it as a lifeline.

Network techs are mostly incompetent. It is a sad truth, and it's not their fault. People get pressed into jobs that they are told are far less complicated than they actually are. It has been my experience that easily 75% of people working networking jobs are operating in a state of absolute terror, trying to keep their head above water with problems they don't really understand at all.

If you say "hop five isn't responding," congratulations - you just identified "the cause of the problem" as far as all those folks are concerned, and there's no way to get that piss out of the pool.

Whoever you said it in front of is going to refuse, aggressively, to do a lick of additional troubleshooting until "hop five" starts responding. If that's clearly a node that nobody on the conf call or email thread has access to, then everyone's going to throw up their hands and say "Well I Guess We Just Have To Hope It Starts Working." I have seen this countless times.

It happens because, fundamentally, troubleshooting networks sucks.

If you don't have total control of the entire path, end to end, with admin access and expertise on every node along the way, there is no way to get a complete picture of what's going on. That kind of access is extremely rare; you're probably a high-ranking network architect if you have it; and even with all that access there are still plenty of cases where you simply cannot see what's wrong, because it's happening either too fast, or in a place that's impossible to inspect.

As a result, networking is full of superstition. People casting spells, executing words of power, trying to read tea leaves and declaring that the end times are coming, not because the hard info isn't available, but because it's incredibly difficult to obtain and interpret.

The Thanksgiving Uncle Problem

Read the presentation. It does a better job than this messy post at illustrating the problem. Even if you don't understand networking, by the time you're done, you will be convinced that this is too complicated for most people, full stop. There are just too many unknowns.

You will hate me for making you read this. You will regret it, because you will now be the only person in every conversation who understands these things, and the knowledge is damning. You will have to sit, silently, as everyone around you makes egregious errors in diagnostics that lead them down completely incorrect paths. This is the Thanksgiving Uncle Problem.

That's the situation where you, a gay leftist, go to Thanksgiving dinner with the family, and a shitty uncle sits across from you and begins telling lies about society, about people of color, about gay marriage, and so on. If you're self-destructive, you engage him. It will not go well.

The reason for this is that, in order for him to accept anything you say, he needs to accept that many of his foundational beliefs about the world are wrong. Ideas like "the police protect us" and "children need a mom and a dad" have been part of his worldview for so long that he has, without question, made millions of decisions based on these assumptions.

In order for him to discard them, he has to admit that he has been making a fool of himself, doing incredibly wrong and often harmful things, for his entire life. That is too much guilt to handle, and he - and most people - will do anything possible to avoid accepting it. Certainly, this is not a door he's willing to open when he's on his fourth mimosa and doing his best not to think about the goddamn insurance adjuster job he has to go back to on Monday.

So you will read this slideshow, and then you will sit on conference calls thinking, "My god. They are all wrong. And they've always been wrong. And I can't help them, because they will fight me tooth and claw to continue being wrong."

I have no advice on how to deal with this, but it's better to be correct than to be comfortable.

Footnote: Ironically, it seems very possible to me that the systems that most consistently enable this are the cheapest routers on the market. Every single home "gateway" ever sold runs Linux, where the ICMP implementation is a core kernel feature, not a user provided daemon.

I would not be surprised at all if the Linux kernel devs actually have made sure that TTL Exceeded is implemented and enabled by default - and since most Linux-based routers do everything in pure software, there is no data/control plane split to worry about, so sending an Exceeded is more or less "free."

This would only make a difference for traceroute if Linux was used for anything other than the cheapest endpoint routers, but it's still very funny.

Addendum #1

I reviewed that slide deck again and learned that I conflated a couple concepts.

Yes, the control plane may be responsible for handling exceptions, including ICMP generation, but it is apparently more likely (at least, at the scale of equipment that I am discussing) that the data plane has a fast path and slow path, both located in the data plane, and the slowpath is responsible for handling this work. The control plane, in such a device, only handles data destined to the router's own IP.

However, the slowpath is (per Richard Steenbergen, the author of that presentation; we will trust his research is valid) still a general-purpose CPU instead of custom silicon, so functionally, the point I was making is still valid: There is a very slow computer handling these packets.

Steenbergen uses this fact to make the point that, because these slow-path CPUs are so slow, they are usually rate-limited. Yes, this means that some number of TTL Exceeded messages will simply be thrown away, even if they are enabled.

The example given is that a handful of users running MTR (do not get me started on this bastard program) can actually hit this rate limit. This is an outstanding example because I have seen something similar in practice.

Consider what that would look like, and how common it would be: If you have a NOC full of people who think they know what they're doing, but don't, that only enhances the probability that everyone is trying to troubleshoot on their own instead of doing a screenshare and coordinating their efforts - thus, you have six guys running MTR to the same IP.

If they hit that rate limit, what do they see? Nodes suddenly not responding! Randomly, in fact - sometimes responding, sometimes not! That means it's not just a hop that doesn't respond to traceroutes, but packet loss! Wow! We found the problem!

Of course, if four of them hit Ctrl+C, the PL would mysteriously vanish. Huh! Weird! Well, it must be an intermittent issue in squints at resolved hostname Hurricane Electric's network. I'm sure they have a flapping port they haven't noticed (lol.) Just send them a trouble ticket!

By the time this useless waste of effort has resolved (e.g. HE has received, acked, investigated, and declared the ticket No Trouble Found and rejected it) the problem has probably gone away due to unrelated network weather effects. The NOC guys all tell each other that HE was lying about their broken network, slap each other on the back for being smarter than the other bastards, and go out for beers.

How do I know this? Because I've been part of it!

My employer used to have an unholy number of customer sites terminated with little Linux shitboxes - you know the sort, they used to be common as dirt. Tiny Soekris-esque SBCs in folded sheet metal boxes with 12V power supplies, running horrible little SoCs and a copy of Busybox from before the fall of Rome. We had reasons for it that I won't go into.

These things were underpowered to put it mildly. They could route maybe 30 mbps, and if you turned on any firewall functionality that dropped to 10. This was at a time when a tremendous number of customers were on connections no faster than 5 mbps, so, this wasn't a huge problem. We got rid of them all when bandwidths skyrocketed.

But what used to happen is that you'd have three or four people looking at one of these things at once, and you'd start seeing packet loss. And there you go, the customer has a bad connection. Kick it to the ISP and close the ticket, right?

I can't count how many times this happened, but I do remember after about four years of doing this, I had come up with a method for getting more accurate latency stats: just ping -i .1. Absolutely hammer the thing with pings while you have the customer test their usual business processes, and it'll be easier to see latency spikes if something is eating up too much bandwidth.

What I discovered is that running two of these in parallel would produce exactly 50% packet loss, with total reliability. I then tested and found that if I just fired up three or four normal pings, at the default interval, it would do the same thing. 30% or 40% packet loss.

There is no telling how many issues we prolonged because everyone was running their own pings simultaneously and the kernel was getting overloaded and throwing some of them out. This is a snapshot of every network support center, everywhere. It is a bad scene.

List of Articles