One of my strengths as an engineer is that I'm an excellent troubleshooter. This ability was very important to me during my career.
Quite a few of my blog entries describe episodes of troubleshooting. Note 1 I'm sure there will be more such entries to come. In this entry, I'll describe a couple of troubleshooting incidents, from before I became an engineer, that were learning experiences.
The first of these occurred at Camp Robinson Crusoe, as did so many formative episodes of my life. It was probably when I was in one of the two "Primitive" units, but possibly when I was a counselor in training ("Trainee" or "Forester").
The camp owned several army surplus "field telephones", like the one shown to the left (that's not one of the actual phones). These were used to communicate to and among the somewhat far flung "Primitive" units of the camp. They were battery powered, except that the ringers were driven by turning a hand crank on the side.
The phones were in what was commonly called a "party line" configuration, all connected on the same line, so the number of rings told which of the phones you were trying to reach. They were physically connected by long stretches of insulated twisted-pair steel wire running through the woods from tree to tree. Upon our arrival each summer, damage to the wires that had occurred over the winter caused the system to malfunction. As one of the more "techie" campers in the group, I was given the task of repairing these wires to get the system working again.
Step one was to simply walk along the wires, looking for breaks, generally caused by falling tree limbs. Inspecting the wires was not always that easy, because the woods they traversed were quite dense, so that it could be difficult to walk alongside. Nevertheless, if a break was spotted, I had to bushwhack my way up to it, splice the wires together, and tack them up again high in the tree.
But that didn't do the job. Measurement with a simple ohmmeter generally showed that the two wires were short-circuited. That is, somewhere along their path (and usually in multiple locations), the insulation had been worn off, and the steel conductors of the two wires were touching each other. Since the wires were twisted together, and were elevated off the ground, this was hard to spot by a visual inspection. Twisted together, the insulation on the wires was touching all along their length; but where was the insulation abraded in between the two conductors?
So it was at camp Robinson Crusoe that I first learned the technique that I eventually came to know as a "binary search". The trick was to cut the wires roughly in the center of their length, and make a measurement on each side to check for a short circuit. If one side did not show a short, then I was good to go on that side. If a side of the cut was short-circuited, then I had to go to approximately its center, and repeat the operation. By this means, I could quickly home in on the locations where the conductors were touching. Once I had isolated the problem to short lengths of wire, I could complete the operation with a close visual inspection.
It's called a "binary search" because it divides the wire's length first in half, then in quarters, then in eighths, then in sixteenths, and so on. This sort of division by two is the essence of the binary number system. Only lengths of the wire that proved by measurement to have short circuits needed to be further subdivided, so that long sections with no problems were quickly eliminated from the search.
In a separate episode decades later, at the company Codex, I was working on the prototype of a statistical multiplexer. You don't need to know what that is, or understand any of the further technical terminology in this and the next paragraphs, to follow my story. Suffice it to say that this was an extremely sophisticated piece of equipment, which contained multiple Motorola 6800 microprocessors, and high-speed bit-slice controller. Since it was a prototype, there was only one such unit in existence (I had just designed it). And then one morning it failed.
I brought over an oscilloscope and a logic analyzer, and the thick listings of the program that controlled the microprocessors and the separate program that controlled the bit-slice controller. I couldn't depend on anyone else to figure out what had gone wrong, since I was the only person on the face of the earth at that moment who actually understood how all of it worked, both hardware and software. I traced the execution of the programs and debugged it for a while, making very little progress, and then took a break for lunch.
When I returned, one of the technicians in the lab told me that he had fixed the unit. I was stunned, since I knew that he didn't understand its operation at all. I asked him how he'd done it.
"Well," he said, "you told me it had been working before it failed abruptly. That meant that the problem was not a design problem - it had to be a failure of one of the integrated circuits. So I removed half the integrated circuits, and replaced them with new ones. When that didn't fix it, I knew the problem was in the other half. I removed half of them, and replaced them with new ones. That fixed it, so I actually could have stopped there. But out of curiosity, and to avoid throwing out a whole bunch of good IC chips, I kept halving until I isolated the single integrated circuit that was the source of the problem."
He handed me a single (defective) integrated circuit chip, like the one shown to the right, the source of the problem.
I've just described problems with some very low-tech wires, and many years later, a problem with the most sophisticated piece of electronic equipment I ever worked on. Both problems were solved by binary search. Although the latter was not solved by me, so maybe I didn't adequately learn my lesson the first time.
I probably was started on the path to being a good troubleshooter by my father, Daniel Krakauer, who was an excellent troubleshooter himself. He was also very patient when he worked with me when I was a child. He allowed me to do things myself that he probably could have done more quickly without me.
I also got lessons in troubleshooting during a summer job he provided. The headquarters of his company, Kay Manufacturing Corporation, was at the foot of Warren Street, in Brooklyn, New York. Although the company's offices were still in the building that summer, the large six-story factory was vacant. However, a fire inspection had shown many of the heavy steel fire doors in the building were not properly functioning. Even in a vacant building, it's important for these fire doors to work. In fact, it's particularly important in a vacant building.
The picture to the left shows a fire door in an office building, but the ones in the factory were more or less the same type. They slid along a track in order to close a large opening in the walls connecting various parts of the building.
In fact, the doors were quite large and heavy, and the tracks were slanted to cause the doors to slide closed under their own weight. They were normally held open by heavy counterweights, connected to the door by ropes running over pulleys. By counterbalancing the weight of the door, these allowed it to be opened and closed by hand. But there was a metal link in the counterweight rope, made out of a low melting point alloy. In the event of a fire, the heat would melt the link, allowing the door to roll closed along its track.
Except when the fire department had tested the doors by removing the metal link (simulating a fire), most of them didn't close. My job was to fix them so as to make them operate properly. The first thing I discovered was that most of them couldn't be moved at all, even with the counterweight still connected. They were permanently stuck in the open position. This was caused by a number of different things. The wheels from which they hung, and which rolled along the track, were often rusted. The tracks themselves were rusted, and covered with dirt. The whole thing needed to be thoroughly cleaned.
When the door closed, the counterweight lifted. Its path was normally surrounded by a wooden structure, to stop it from being blocked if someone leaned something against it. But some of these wooden cases had broken, with pieces of wood sticking into the path of the counterweight. In other places, assorted debris had fallen off the wall or the ceiling and had blocked the counterweight's path.
One of the things I learned from this experience is that when there's a problem, it doesn't always have a single cause. I might lubricate the roller wheels, then find the door still didn't work. Then I would clean the track, but it still wouldn't work. Then I would open the casing and unblock the counterweight, but it still wouldn't work. And so on.
This lesson applies in many fields. Having located a problem causing a particular symptom, a doctor who is a good diagnostician won't necessarily assume that the problem found is the sole cause of the difficulty. Problems frequently have multiple causes.
Allow me to add one more entirely unrelated note. While working that summer, I have to admit that I was fascinated by the assorted pinups left behind by the workers. The factory walls were covered with calendars and advertising posters largely depicting topless women, although the images would be considered rather tame by today's jaded internet standards. Note 2
In those days it was quite commonplace for companies to distribute advertising material which now would be thought of as unacceptably sexist. There certainly were women working in the company in those days, including on the production floor, although they were probably a minority. But I think they were expected to put up with these sorts of materials without complaining. As for me, working alone fixing the doors on the the empty production floor, I confess I didn't complain about them either.
I'll have a few more troubleshooting stories in future blog entries. I've got a little list.
Note 1: Some of the troubleshooting events I've discussed in other blog entries are:
Fixing another early timeclock problem: See Personalities.
Debugging a communications problem: Occurred at The house of turkey death.
And of course, my troubleshooting prowess was probably why I was made Engineering representative to the Kronos "Y2K Steering Committee", to proactively eliminate problems that might be caused by the advent of the Year 2000. This is described in the entries Y2K, Y2K (bis), Y2K speech, part 1, and Y2K speech, part 2. [return to text]