The Internet of Things

A disaster for no good reason

Originally published: 7/7/2017

This page is written in 100% conventional HTML. If your device is offering a "mobile-friendly view," it should format this article very nicely and I recommend you try it.

Update (5:24 PM PST 14/7/2017):

This article has been linked on hacker news, far more visibility than I ever expected, and unsurprisingly someone in the comments says they worked on it and my accusations are unfounded and it doesn't run Linux.

I'm not going to double down on factual inaccuracies; if it doesn't run Linux, it doesn't run Linux. I did do some research, I just didn't pour hours into it - I didn't expect anyone besides a few twitter followers to read this, in all frankness, and I admit that I looked up the MCU but didn't realize it wasn't sophisticated enough to run a full-stack PC OS.

I don't think it invalidates my suspicions about the failure mode in a high level sense, and more importantly I disagree with the adequacy of their precautions. Simply put, if they'd done their due diligence and done their job competently, the reported problems would not have happened. Decades of perfectly functional, reliable alarms prove that. They took a solved problem and de-solved it, made a product worse than it was, and it was a product that couldn't tolerate that. Excuses excuses, I'm not interested in them and neither is a family without parents or possessions or a home.

I've written a lot of tweets and had a lot of conversations with a lot of people about IoT. It's bullshit, no question, let's get past that part. It's not hard to find evidence of this; some light googling, or reviewing any website under the sun from Reddit to Slashdot, and you'll find yourself ankle-deep in stories about failures of slick, flashy, aspirational miracle devices.

To wit:

https://www.forbes.com/sites/ianmorris/2016/04/06/google-is-sitting-on-a-timebomb-with-its-nest-disaster/#7183e47d7181

This article is what got me going on this today. It's about Google's acquisition of Nest and what a poor job they've done. It isn't a terribly in-depth article, but it reminded me of the fiasco of the Nest Protect and the video of it going off indefinitely. I remember watching that and being shocked at the incompetence of the device, despite long since having accepted that IoT was a dead end.

The reason this particular video stunned me the way it did was because I simply cannot figure out how they fucked up a smoke alarm. This is a technology that has been established for decades. I remember reading about the Radioactive Boy who built a breeder reactor from smoke alarm components and that was in '94. And they work, I mean, I don't have statistics but I've never seen one set off by not-smoke, personally. So they seem pretty robust, the tech seems pretty close to "finished."

So how, how, how did Nest screw up this finished product?

Well, I haven't had one open, so I can't tell you. These folks have:

https://www.ifixit.com/Teardown/Nest+Protect+Teardown/20057

Needless to say, this being a tech gadget, it looks just like you'd expect inside: a cellphone-esque ultra-micro PCB with bleeding-edge SMD components, the best refinements in production process tech that the industry has to offer, natch. LEDs and BGA chips and off the shelf radio ICs of all sorts and so on.

I considered buying one of these off eBay (I can probably get four for $10 it looks like) but a glance through this teardown tells me I wouldn't learn anything. Jellybean parts plugged together with a nest ^ha of invisible traces, all speaking digital soup - never going to pry any truth out of it. I can only speculate on how it works, so speculate I will.

What would cause a smoke alarm to go off interminably? Let's consider some theories. Please note that this will be a very technical article since this is a very technical subject; bear with me.

Technical Analysis - Conventional Smoke Alarm

In a conventional smoke alarm (teardown) you can expect to find a sensor (one of several designs), a handful of passive components, a transistor or two, and maybe a dedicated smoke detector IC. This is what I found in several designs I just looked up, including the linked teardown and several pictures on Google image search.

The reliability and behavior of resistors and transistors is extremely well known. The IC in the linked teardown is a MC145017. I looked up a datasheet and found Freescale Semiconductor makes these. They are probably second-sourced from a number of other extremely well known semiconductor manufacturers, and therefore have a considerable pedigree. Looking at the datasheet, I find the IC has a sensor input pin, some analog calibration inputs, and an oscillator source. That's it.

Note that I didn't say a clock source. An oscillator, because this is not a digital device. At all. Despite being "a chip", there are no digital electronics inside. The datasheet contains a block diagram which shows that the internals are all analog components - transistor logic gates, comparators, latches. This is not a computer.

Interestingly, the datasheet provides no information on how to silence the alarm. My assumption is that this is not included in the design; if you want to interrupt it, you - the equipment manufacturer - solve that on your own. I imagine a transistor and capacitor arrangement are used to temporarily disconnect power to the detector when the user presses silence.

Conclusion

The cases where this type of detector is likely to go off interminably include:

The sensor trips from something that is not smoke
The sensor fails completely and outputs nonsense values
The monitoring circuitry falls out of calibration
The monitoring circuitry fails in a way that shorts rather than opens a connection
The silence button breaks

That's very very rough, there's more to list, but the point is, these are all very general, "analog" circuit failures. In other words, they could happen to anything, and are basically unpreventable within reasonable constraints. Also, I've never seen this happen. I've lived with probably... twenty smoke detectors or more in my life, and I've never had one fail in either direction (false positive or false negative.)

There are a number of ways to design a "smart" smoke detector. Let me just jump straight to telling you how I think Nest did it.

Technical Analysis - Smart Smoke Alarm

I need to preface this by saying that I have very little faith in the worldliness or general sense of Silicon Valley hardware engineers. I have seen a long history of extremely poor decisions from that part of industry, so I will assume they made all the worst decisions.

The Smart device engineer does not begin by disassembling ten smoke alarms to see how they work. They do not begin by reading papers written by fire chiefs and scientists. They do not look at the statistics on fire-related deaths with and without smoke alarms of different eras (although the marketing department director does). They begin by contacting a company that sells smoke sensors and ordering a batch of sample parts. And when they arrive, the engineer connects them to the holy, unifying hub of all technology: The computer.

I don't know what modern hardware engineers start with. Probably some Atmel demo board. Doesn't matter. They plug it into an I/O line and start writing code, blowing smoke over the sensor and watching the 0 in their debug console go to 1. Maybe it's a really advanced device, it spits out an analog value, so they plug it into an ADC channel and watch the float value drift up and down and they take notes and figure out what candle smoke, steam, vape smoke, fog machine fog, and so on produce.

They try all the sensors they ordered and then, in the case of Nest apparently, they decide none of them are any good because they suffer from false positives. (Nest blog) So they contact a hardware engineering company and they start a six month process to develop a completely new sensor, one that doesn't suffer from those problems. They come up with a novel design that nobody in the smoke alarm industry has ever seen before, and it has a tenth the false positive rate. Progress! Innovation! The Free Market! It's all working!

They rewrite the basic code to read the new sensor output and sound the alarm. It reads from the ADC and turns on or off a transistor to activate the piezo buzzer if the smoke detection threshold is over 0.63. They demonstrate it to their bosses and get approval to continue and everyone has a few beers and the next day they move on. With the first 90% complete, now they just have to do the other 90%.

A whole firmware dev team is involved now. They start with a Linux base - busybox - and add drivers for WLAN, BT, custom software to talk to the LEDs and the motion sensors and so on and so forth. Another team is working on the cloud backend. They're adding features at a breakneck pace, everything's going great.

Problem is, the last five paragraphs were all mistakes. Every single step of the way they made critical errors and then compounded them.

Conclusion

The smart smoke alarm can start sounding interminably if any of these things happen:

The sensor fails electrically
The sensor is out of calibration
The LED / photocell in the sensor fail
The sensor corrodes due to solvents used on the ceiling nearby
The microprocessor crashes or fails
The battery dies
The OS kernel crashes
The OS kernel spinlocks
The OS process scheduler gives too much time to the WLAN driver
The smoke alarm process crashes
Memory becomes corrupted
The raw I/O driver reports incorrect values

I could probably list forty or fifty more items if I took the time but it'd be better to focus on why these are categorically different types of problem.

There are two absolutely critical errors made here, unquestionably, and then a third that is more philosophical. The two critical errors were:

Monitoring a smoke alarm sensor solely with a computer
Putting a completely novel design of sensor in an expensive mass market smoke alarm without years of testing

And the philosophical error:

Reinventing every part of the wheel down to the molecules

The Critical Errors

The reason I call these errors "critical" is that, simply put, this is irresponsible and dangerous behavior. This is a safety-critical device. If it fails, human beings die. That means it is unethical to sell one that is not known to be safe, and that means that any ethical, knowledgeable person in this industry would act conservatively when designing such a product.

The sensors in $10 mass-market smoke alarms have been in use for decades. Their behavior is known - it's known well enough that the ICs that drive them are purpose-built. That doesn't happen until something is established as a standard, because Freescale is not a company that spins up to produce 5,000 of an IC. If they made these, they made millions, and that's because they had a market waiting for them.

On that topic, the IC (at least the example I looked at) is designed using very conservative design principles. It contains no computer, no microprocessor to make decisions using complex logic. The design uses analog circuitry which behaves consistently - it doesn't start acting strange when the voltage drops below a perfect 3.33V (it has a comparator circuit that makes it shut down completely if that happens), it doesn't have memory to get corrupted and cause invalid instructions to execute. It cannot reboot. As long as power is applied, it is continuously operating.

Because all of this is purpose-built, these devices can run on a single 9V battery (which have awful energy density) for many months. They are designed "down" to this purpose.

All this means that if you use these components and this basic design, there's almost no chance that the device will fail unless you screw up the design or manufacturing fundamentally - and it takes a LOT to mess up a design this basic. You could dead-bug wire up a smoke alarm with $6 in parts and a handful of random wires and the likelihood it would fail is infinitesemal. In other words, this is a solved problem. But look at the decisions made by Nest here.

The Sensor

They chose to use a completely novel component - one that had never been out in the world, had only ever experienced their labs and their chosen test environments. Whatever tests they put it through, they couldn't hold a candle to the acid test of being installed in random customers homes. Until that's done, there's no telling how the device could fail. Maybe it stops working if it gets a drop of Formula 409 on it, and they just didn't happen to test that. Someone installs one in a bathroom and it turns out continuous high humidity fouls the sensor. Who knows?

"But," you might say, "the sensors in normal smoke detectors had to go through all that to gain a track record too!" You aren't wrong, but: they already did. The work had to be done, and it was scary, dangerous work that could have failed and killed people, but it was done, and now those sensors have been proven. You have to have a damn good reason to put consumers through that dangerous process again, you don't just do it for kicks. This isn't a game, this isn't a market that needs Disruption. You don't Disrupt someone's heart-lung machine. You leave it the hell alone and let it do its job.

And those sensors were designed by companies that make smoke detectors. Not a computer software company doing this for the first time. I'm sure they went to a hardware designer that has some experience in this, but the question is, as acceptors, is Nest / Google qualified to determine whether their vendor provided an adequate product? The traditional sensors were almost certainly tested more thoroughly, by more experienced people, and their manufacturer had more stake in the outcome, whereas to Nest / Google - this is just another product to succeed or fail. It won't hurt their name.

The Computer

All that wouldn't be so bad except that they went on to compound the problem by putting it all through a computer. Computers are bad, they're broken, and relying on one is a death sentence.

Maybe someone reading this is mad at me now. "What about realtime computing, what about the computers in jets and the ones that monitor reactors," so on and so forth. Sure, those exist, they're trustworthy. But that's not what they used. That's not what Silicon Valley ever uses because they only make one product: software. Not hardware, not firmware.

If I may opine freely? Thanks: nothing with Linux in it has "firmware." A 1998 LCD-display programmable thermostat has firmware: purpose-built code that does one thing. Something that is not a general-purpose computer. Something that does not have a process and thread hierarchy that is scheduled on the fly by a kiiiiiind-of deterministic system. There are no instructions in it that aren't pertinent to the job at hand. When the device is idle, the CPU is sitting in a tight loop checking a clock periodically or even waiting for an interrupt to continue execution, at which point it'll check the temperature, and if there's nothing to do according to its 12 bytes of programming, it goes back to waiting, sucking up almost no power.

Linux-based integrated devices do not have "firmware," they're just PCs running on really basic hardware. There is no significant difference between a PC with Linux, a Raspberry Pi or a Netgear router. And Linux - and any other "fat" PC OS - is disastrously unpredictable. The most basic Linux system invariably has a plethora of software that is always doing something. Linux systems can't sit still. There's always a fantastic array of network services running, and the more things the system can potentially do, the more things it's always doing.

Precision timing doesn't exist; if you want to check the status of an I/O line every 10ms, will that really happen? Or will the userland process that does this get delayed for 40ms because something was doing a DNS lookup and the wifi was down for a moment? Nothing can be truly interrupt triggered because you can't just have an interrupt abruptly redirect processing right in the middle of, say, a TCP packet being transmitted. So everything on a PC OS is fungible, unpredictable, shiftable. It's all fluid. And that's just not how hardware works, and that's not how the real world works.

To get overly technical: the job of any digital device that interacts with the physical world is to attempt to reconstruct a reasonable facsimile of it's environment through sampling, a basic concept of information theory, and a device that can't sample at a consistent speed above the "Nyquist rate" is subject to aliasing; in other words, its view of its environment looks correct but isn't. When you're on a multitasking system, especially one that has external timing dependencies, especially networking, operations are effectively nondeterministic, meaning you can't predict how long anything will take.

The Nyquist rate of a smoke sensor is probably very very low, on the order of hertz (e.g. it only needs to be sampled a few times a second) and well within the capabilities of a modern computer... in theory. In practice, well, it all depends how things are architected. In theory, even if other processes on the machine are completely hung, preemptive multitasking should context-switch them out so that the task that reads the sensor status can be serviced. Emphasis on the "should," however - I've seen networking tasks hang entire Linux machines, causing basically all system calls to block until e.g. a DHCP request completed, taking thirty seconds or more. That's long enough to seriously delay detecting actual smoke. And reading an alarm silence button or motion sensor definitely requires a sample rate in at least the hundreds of hertz, which I can easily see a PC failing to achieve.

Further issues: the number of ways that a computer can fail is unbelievable. I simply won't get deeply into it here, the topic is too incomprehensibly big to discuss. The problem is that computers are just too complex, they have too many parts, too many dependencies. The simplest computer you can buy now has millions of components and hundreds of discrete functional blocks. The smoke detector sensor chip I looked at probably contains less than a hundred components total.

The PC As A Safe Space

The PC is all that Silicon Valley understands. The people working there are either elitist or naive depending on how you read it. You could say that they want every car they get into to be a Cadillac - "I want all the features! All the bells and whistles! Minibar in the back!" - or you could say that without training wheels, they can't even ride - "I don't understand, how do i turn on and off an LED without an external driver that accepts i2c?" Either way, they get very uncomfortable leaving the PC-esque environment, working outside the bounds of Processes and Threads and a Kernel and a Filesystem and so on.

There is a tremendous chasm of disconnect between software people and hardware people. A lot of programmers know literally nothing about how computers or electronics work. This is fine, their job usually doesn't require it. But when you take those people and tell them to do hardware development, what are they going to do? The natural thing, they're going to coerce all the concepts they don't understand into something they're comfortable with. That means that job one is "get everything into the computer." Get the ADCs, get the I/O chips, get the external digital sensors and plug them all in to the computer.

Oh, that breath of fresh air, the relief as everything becomes comprehensible. Manageable. Comfortable. Safe. Like that feeling when you get home from a day out running errands and sit down at your "battlestation", you settle into your worn-in chair and your hands fall on your keyboard and mouse right where you always keep them and suddenly all the worries and anxieties fade away because this is where you thrive, this is where you're king. Once you have all those externalities - sensor readings, button presses - transformed into a set of API calls, everything becomes an elaboration on a set of standard phrases. It's quantized, normalized, you've converted the unknowable uncertainties of life into numbers and synchronous calls and now you can treat them the same way you do everything; just data.

But, that's not how life works. Life doesn't wait for you to be ready to process it. A fire will not wait for you to get an IP address (the radio signal dropped out, see) and authenticate with a cloud service so that you can be totally certain your settings are up to date and that you aren't going to set off an alarm when the user asked you not to alarm for a while. And a real, actual human being, on the phone with the hospital, is not interested in your excuses about synchronous API calls or security-critical firmware updates when they can't hear the nurse telling them the condition of their husband that just got brought in from a car crash. The world will not wait for computers.

For the record, I read several stories about the Nest Protect going into permanent alarm, and you know what my hunch is? The same thing I always assume: "Dumb Linux crap." The culprit was probably some shell script that opened /opt/smoked/detect and output 1 to it and then left the file locked so nothing else could touch it or forgot to delete a pid file or whatever. This is what I always assume when I read about Linux integrated devices screwing up, and on the occasions I've actually heard what the cause was I usually end up right.

Disruption Is Stupid Bullshit

OK but listen

The initial release of the Nest Protect should have looked like a smoke alarm. It should have been in a round box with a red blinking battery LED, run off a 9V or two, and had a silence button. Instead, Nest decided to throw literally everything out the window and reinvent the entire product from the ground up. They didn't use the same sensor, the same design, the same interface, the same behavior or the same circuitry.

That means that they took a risk that every one of those things could be a poor decision, and indeed, they ended up recalling half a million of these devices because they reached too far. On their VERY FIRST maiden voyage into this market they refused to even consider any of the ideas that came before them. This is really common for Tech, NIH syndrome applied to everything at all scales.

This keeps happening. IoT stuff - and other tech products - keep coming out that throw away everything that came before them, so if they fall short of their stratospheric goals for revolutionizing the way we shell peanuts or make shopping lists or whatever, they end up being completely worthless. If the Nest Protect had been an ordinary smoke alarm that could also send you a notice on your phone, then if that latter part didn't work you could still use it as a smoke alarm. If the Protect had been an ordinary smoke alarm with an optional voice module, if the voice module malfunctioned, you could turn it off. Instead these companies bet the whole farm on a brand-new design and often close up shop when it fails.

A Better Way

I didn't write all of this just to complain, I wrote it because it makes me furious seeing this crap when none of this needs to be this way. Here's how you design a smart smoke alarm that doesn't suck:

First, begin with a smoke alarm. A tried and true design that you can buy for $10. Buy the exact parts that are already in use, and put them in your final product. The smoke sensor, the transistors, the through-hole resistors, the Fairchild IC, the 9V, the LED. Buy all of that and use it in your final product.

Next, build your gadget. Get your SoC, roll your Linux image, write your software. But when you add the computer to the design, don't put it between the sensor, the silence button and the buzzer. Put it on top. Connect the I/O lines on your little PC to the output of the smoke detector chip. Let it do the heavy lifting, the stuff that you aren't sure you can do safely and which it has proven it can do for decades.

The Protect has a feature where it plays a voice alert for ten seconds before the alarm goes off. That's fine, you can do that. On the analog side of the system, where you have the original smoke alarm circuitry, add circuitry to produce a one-time delay, so that when the alarm goes off it takes ten seconds before a transistor begins conducting and lets the buzzer go off. Then your SoC reads the signal from the detector chip, and if the user decides to remotely silence it, an I/O line triggers the alarm silence circuitry from the original design.

In the above implementation, if the SoC catches fire and is reduced to carbon, the smoke alarm will still work. There will be a ten second delay now, to give the computer right of first refusal on alarms, but if the computer is nonresponsive or hung or dead that won't stop the alarm from marching on and doing its job. You can add circuitry - again, analog - to notice if the "silence" output line from the computer has been stuck high for a long time and suppress it, so that a hardware failure or software malfunction that causes the machine to try to constantly silence the alarm can't do it for more than a few minutes before it returns to normal operation.

This is a failsafe design. You think it out. You consider the worst cases and how to prevent them within reason, and then once you have that nailed down you look for ways to add sophistication. You figure out how to make the critical functionality of the device minimally complex. That is where tech falls on its face, complexity. Silicon Valley wants to cram as much functionality and as much flash into everything as they possibly can, and as a result everything they make is fragile, has single points of failure, and in the case of safety-critical things like smoke alarms and locks and cars, is outright dangerous.

This is the kind of design that should be used throughout IoT, especially at this early stage. IoT door locks should be ordinary locks with just enough modification to allow a computer to control them by rotating the lock control just like a human would. IoT washing machines should be ordinary washing machines that just have an extra computer glommed on that monitors the signals in the existing design, or has additional sensors separate from the ones that allow the core machinery to function. IoT lightbulbs should turn on bright white if the computer doesn't respond within a few seconds after startup, and they should have a switch to bypass the computer if it goes on the fritz. In other words, the computer should always be optional.

The reason I'm frustrated is because if these things were designed this way, I would WANT them. I really wish my washing machine would tell me when the wash is done because I am EXTREMELY bad at remembering to go check on it. But I can't buy that, I can't buy something that just has a $5 microprocessor with just enough intelligence to connect to the internet and send me an email or a push notification if the buzzer on the washer goes off. The only thing I can buy is a washing machine that's had a horrible, unreliable PC full of quarter-baked software crammed into it which will stop working when some godforsaken cloud service is "sunset", and which is so dependant on the reliability and trustworthiness of the software on the computer that if someone hacks it or the software has a bug, the washer can start spraying water at me when I have the loading door open.

IoT is desirable, but it is too aspirational, it promises too much for such an early technology, and it is too dependant on technology that we know, from incredible amounts of experience in every field, is not an acceptable substitute for "old fashioned" process control circuitry. Even if we assume the players in the market are acting in good faith - which I seriously doubt in most cases - the products they are trying to sell just don't make sense.

Disclaimer & Contact

This is my opinion. I do not work in this industry or anywhere in hardware or software development. I base this on my autodidactic knowledge of these topics. I do not personally own a Nest Protect, and I don't know how much of the information I've read is relevant to the first version vs. what might be unique to the second version. I did some research, but I wouldn't say it was extensive because this article is about the industry as a whole, and even if these complaints aren't 100% accurate for this specific product, IoT is categorically guilty of all of these mistakes.

If you have factual contradictions of anything I've written here, email them to articles@gekk.info and I will consider revising my article or including your commentary.

List of Articles