Troubleshooting (bis)

#0190
Troubleshooting (bis)

As I mentioned in an earlier entry called Troubleshooting, one of my strengths as an engineer is that I'm an excellent troubleshooter. This entry describes how I located an intermittent problem in a complex product. ^Note 1

In 1976, at Codex Corporation, I worked on the design of a data communications product called the "6030 Statistical Multiplexer" reporting to the principal designer of this product, Jim Vandermey. It was a highly innovative, state-of-the-art product. ^Note 2

A multiplexer combines multiple low-speed data communications signals onto a single high-speed link, for transmission over a network. At the other end, an identical device reconstitutes the original separate signals. For example, eight 1200 bits-per-second signals might be combined on a single 9600 bits-per-second "network link". A "statistical multiplexer" takes advantage of redundancies in the data streams to allow for increased capacity. For example, perhaps 12 data streams of 1200 bits per second could be combined on the same 9600 bits per second network link. By modern standards, all the speeds I'm giving you are extremely slow. But they were state-of-the-art speeds at the time.

Multiplexers typically take care of other tasks, such as end-to-end error detection and correction. They will thus guarantee, to an extremely high degree of probability, that all the data are delivered to the other end without any errors, even in the presence of noise or dropouts on the network link. This is usually done by breaking transmissions on the network link into blocks of bits, each of which carries some sort of "checksum". If an error is detected in one of the blocks, the receiving device requests retransmission of that block. This is necessary because there are bound to be errors due to noise on any real-world network link.

After a long and complex development, the 6030 went into production. With the pressure of releasing the device to the market, its initial production testing was a bit ad hoc. But eventually, the Codex Test Department formalized the tests that each device had to pass before it could be shipped to a customer.

In one of these tests, two multiplexers were connected with a short cable in place of the network link. That is, instead of sending its data over telephone lines, microwave links, or whatever digital connections might be used in the great wide world, these two devices were simply connected by a few wires about a foot long (about a third of a meter). Data streams were inserted by test equipment into each of the ports, multiplexed together by the 6030, and transmitted over a foot of wire. They were then demultiplexed by the other 6030, and the separated data streams were checked to be sure that no errors had been introduced. The test protocol required the units to run for 24 hours error-free.

Once this test was put into place, it turned out that virtually none of the devices we were producing could pass the test. There seemed to be an MTBF ("Mean Time Between Failure") of about eight hours, so that in a 24-hour test, about three errors would be detected that required a re-transmission on the network link. That was an average figure - sometimes there was only one error, sometimes four or five, but very few pairs of multiplexers were able to run for 24 hours error-free. Since there's obviously no noise to speak of on one foot of cable, it was clear that there was something wrong with the design.

That being the case, I was surprised to not hear about this problem until it had been known for a few weeks. By the time I was brought in and put in charge of finding a solution, there were already 50 or so production systems, connected in pairs, running 24 hours a day. A group of test technicians had been trying to figure out the source of the errors.

The problem was considered to be quite serious, because we were unable to ship any of these expensive units. In fact, my first recommendation was that we relax the test criterion, and ship units which had only one or two errors in a 24-hour period. We would still keep a smaller number of units with a higher error rate, and use those for debugging the problem. I didn't want to ignore it - I did want to get to figure out what was going on. But the fact is, three or four errors in a 24-hour period is not at all important out in the real world.

Since one of the jobs of the multiplexer, as I noted above, is to correct errors, three errors or so in a 24-hour period represents an error rate that wouldn't even be noticed by our end users. Out in the real world, errors due to noise and other network transmission problems would exceed that error rate by easily one or two orders of magnitude. In other words, calling this a problem was an artifact of our test, in a way, and not a meaningful problem in actual operation. But the Test Department was adamant that since the system really ought to be able to run with no errors at all over a network link comprising one foot of cable, these systems should not be shipped.

The 6030 comprised a number of circuit boards, including a set of identical terminal port cards for each of the low speed inputs, a network port card that sent and received the signal on the network link, multiple processor cards each with its own Motorola 6800 microprocessor chip, and a rather complex bit-slice microcontroller card that allocated work among the processor cards (the system capacity could be expanded just by plugging in more processor cards).

The obvious first step to troubleshooting the problem was to try to determine which of these cards was causing the problem. To this end, the test technicians had been monitoring each pair of multiplexers under test. When a problem occurred, they would typically immediately swap one of the cards with an adjacent system which had not failed (yet). They would then continue monitoring to see in which system the next failure occurred. If the failure moved to the other system, they'd assume that the fault was with the card that had been moved.

As anyone who's ever done troubleshooting can attest, highly intermittent problems are the hardest type of problem to track down. Because they occur so seldom, it's hard to catch them in the act. As I looked over the notes on the work that had been done to that point, it became clear to me that not enough care had been taken in designing the test protocols, controlling the test conditions, and, in particular, allowing enough time for each test. This latter problem was because of the long Mean Time Between Failure that was being observed.

Quite a few tests had been run in which the total test length was only a few hours. But with a mean time between failure of around eight hours, these results were not particularly significant. In fact, they were confusing - many of the "patterns" that were being seen were just the result of chance. Indeed, pretty much each technician had his own theory as to the source of the problem.

Furthermore, I found many cases in which the systems had been run with their covers on, but when circuit cards were swapped, one of the covers had been left off on one of the systems. This meant that in the first part of the test, that system was running internally warm, but in the second part of the test it was running cooler. Since many electronic problems are affected by temperature, this error invalidated that test. It's important to only change one variable at a time. But it wasn't always clear that this had happened, because whether or not the cover had been put on had not been recorded in any of the paperwork.

Disregarding most of the work that had been done so far, I created checklists of test conditions. I put a spiral-bound notebook next to each pair of systems, and required the techs to sign off on the state of the system whenever anything was changed. Perhaps most importantly, all tests were run for at least 24 hours. I would no longer allow cards to be swapped after a shorter period of time. Great care was taken to change only one variable at a time, and to monitor seemingly insignificant variables, such as whether the cover was on or not.

Once we started running longer test runs of a uniform length, it became clear that some of the systems had a shorter Mean Time Between Failure than others, although it took several days of testing to be statistically confident of this. Other systems failed less frequently, so we concentrated our work on those systems with the shortest Mean Time Between Failure, allowing us to get more testing done in less time. But the MTBF was still around six hours or so in the most failure-prone systems, so we still needed to do long test runs.

Nevertheless, given enough time and care, we were eventually able to isolate the circuit board that was causing the problem. And it was, indeed, the network port card. In a way, this was good news, because this was not a card with an enormous amount of circuitry on it. The design had actually been done by the Vice President - Engineering, Dave Forney, who had requested the task, I think, as a pleasant respite from the managerial work with which he had become mostly engaged. He wanted to get back to doing a little bit of design himself. Actually, I thought it to be rather a masterpiece of design, making effective use of MSI ("Medium Scale Integration") circuitry to keep the chip count down, and enable to us to keep the card compact.

If the problem had been with one of the larger cards, such as the bit-slice microcontroller card, we might have still had a good deal more work ahead of us. But knowing that the problem was on the network port card, I was able to simply give the circuitry an intensive review. I also knew that the problem was highly intermittent, which cued me in to look for a certain types of timing problems which often produce the sort of highly infrequent errors we were seeing.

And indeed, that's where the problem proved to lie. Although the logic on the network port card was what is called "clocked" logic, there was an external signal that arrived asynchronously. It was clocked into two separate flip-flops, and it turned out to be important that either both of them were set on a given cycle, or neither was set. If the asynchronous signal transitioned just as the clock occurred, it turned out to be possible, rarely, for one of these flip-flops to be set, but not the other. In half those cases (it mattered which one was set), an error would occur requiring retransmission on the network port channel. The addition to the circuitry of a single "synchronizing" flip-flop solved the problem.

Although Dave Forney was a bit put out to find that the problem was due to his design, the circuit had undergone a design review, as usual, and we had all missed it.

I'm sure I'll have still more troubleshooting stories in future blog entries.

#0190 *CAREER *TECHNOLOGY

Footnotes (click [return to text] to go back to the footnote link)

Note 1: In addition to the entry specifically called Troubleshooting, some of the troubleshooting events I've discussed in other blog entries are:

Fixing a problem in the initial Kronos timeclock: This story was told in the entries Coming to Kronos, and in History of Kronos, part 1.

Fixing another early timeclock problem: See Personalities.

Debugging a communications problem: Occurred at The house of turkey death.

And of course, my troubleshooting prowess was probably why I was made Engineering representative to the Kronos "Y2K Steering Committee", to proactively eliminate problems that might be caused by the advent of the Year 2000. This is described in the entries Y2K, Y2K (bis), Y2K speech, part 1, and Y2K speech, part 2. [return to text]

Note 2: For you techies: the 6030 design was based on a tightly-coupled multiple-microprocessor architecture, with a high-speed bit-slice master microcontroller. I did the detailed hardware design, wrote firmware for the microcontroller, debugged the product, and released it to production. The product included multiple 6800 microprocessors. It was rather more difficult to develop than it would be now, because at the time, logic analyzers and similar digital development tools were not available.

A paper on the 6030's design can be found in "The Architecture of a Multiple Microprocessor Communications Processor", J. E. Vandermey and L. J. Krakauer, Proceedings of the 1976 International Zürich Seminar on Digital Communications, pg. C6 (March 1976) (IEEE Catalogue # 76CH1054-6ZURICH). My trip to Zürich to deliver this paper is described in my blog entry Zürich. Although I was involved in the design of the product described in the paper, Jim was very much its principal architect.

You can also learn more about this product in A History of Computer Communications: 1968 -1988, by James L. Pelkey. In particular, some notes on the 6030 statistical multiplexer and be found on this page. [return to text]