Tuesday, April 14, 2009

The TI Explorer was a clone of the LMI Lambda. At some point when LMI needed financing, TI and LMI did a technology swap where TI ended up with the Lambda architecture and LMI ended up with some cash and the NuBus. The Explorer was a close enough clone of the Lambda that it could run the same microcode.

Ever run into one of those bugs where you begin to doubt your basic assumptions? I hit one of those on the Explorer. The symptom was simply this: about 15 minutes after booting, the Explorer would crash. The proximate cause of the crash was a bad entry in the page tables. The Lisp machine really had an operating system hiding within it, and there's a virtual memory system with demand paging. There was a table that kept track of what the physical memory was being used for. If physical memory was needed and wasn't available, an existing physical page would be written back to the hard disk. This was when the bug was detected. A check on the page status showed it to be in an undefined state, so the machine halted.

In theory, there was no way for a page to transition to an undefined state, so something really bad must have happened. Fortunately, it was always the same page, so it was clearly a programming logic bug. It was also fortunate that it was in the page fault handler. The Lambda and Explorer microcode is completely single threaded, and interrupts are polled for in known locations. Since the page fault handler would play with the memory mapping hardware, it only polled for interrupts at the very few places that the hardware was in a known, good state. I could pretty much ignore the interrupts for debugging this one.

I put in a couple of quick checks to halt the machine when it wrote the page table entry and rebooted the machine. The machine ran for a while and then halted at my debug break. I single stepped the microcode (from the other processor), and watched it write the page table entry. It read the bits, modified the correct subfield and wrote the bits back correctly. Hmmm. Must be a different write. I let the machine run, but it didn't touch that entry again. The machine halted at the sanity check.

This was weird. The machine correctly modified only the field it was supposed to, yet it crashed. Perhaps the word was smashed after the modification. I added some code to check the value of the word at crash time. When I ran it again, I found that the word contained exactly what was written at my breakpoint.

I had to re-evaluate my assumption that the field being modified had something to do with the crash. Since that page had not previously been in use at all, I assumed it was zeroed out. I added a sanity check to ensure this. My sanity check failed. The pristine page table entry was not zeroed out. This made sense. The code had taken an entry that was not correctly initialized, correctly filled in the right field, but the other bits were wrong and that caused the crash. Clearly there was a bug in the initialization code.

This was sounding like a fencepost error. A fencepost error when initializing a table might leave one entry at the end uninitialized, or it might overwrite an entry in the next table. I didn't see anything obvious in the page table initialization code, though. I put a debug halt in just after the page table was initialized. When the machine halted, I looked at the table. It was fine. Everything was zeroed out correctly, and the next table in memory had no problems, either.

Obviously someone had done something with the entry between the time it was initialized and the time it was first used. I put in some more debug code to halt whenever that particular page table entry was accessed. This was possible because the page table manipulation code was in a single subroutine. I restarted the machine. It first halted when it initialized the page. As expected. It halted a second time when it read the page entry. That wasn't expected. Whoever modified the page table entry didn't go through the normal channels. That would cause a bug, but it was wasn't going to be easy to find.

I wrote a debugging subroutine that would simply validate the contents of that one special address and put a call to that subroutine in the middle of the boot sequence. The machine halted at my new test. I put it one quarter of the way into the boot sequence. It didn't halt. I put it three eights of the way through. Over a period of a couple of hours, I had determined where in the boot sequence the memory location was being overwritten. It was some subroutine that really had nothing to do with the memory system. In fact, it barely used the memory system. I pored over the code, but there seemed to be nothing at all out of the ordinary.

I was baffled. I was beginning to wonder if the memory maps had been corrupted. It is almost inconceivable that the machine would be able to run for more than a few milliseconds with corrupt memory maps, but it was the only thing I could think of that would allow you to make an undetected stray memory write. I wrote some more code to validate the memory maps. They were fine.

Ok, if that memory location started as zero, then later became non-zero, then a memory write cycle must have occurred. There are no two ways about it. I wrote some more microcode that checked the contents of that location and verified that it contained the correct value. I modified the microassembler to insert a microcode subroutine call to my validation code after every memory cycle. I rebooted the machine. It was incredibly slow. Every single memory reference involved an out-of-line microcode call to validate the page map. Nonetheless, the validation code finally caused the machine to halt.

It was in a weird location. Well, weird only in the fact that absolutely nothing interesting was going on. By using the backtrace buffer and looking at the microcode I could see which write instruction must have failed. But there was no reason it should have failed because the nearby writes to the same page worked just fine. I looked at the contents of the virtual address register and the memory data register. Surprisingly, they were the expected values. In fact, the memory data register did not contain the bogus value that was written to the page table. This is, in a word, impossible.

So what were the implicit assumptions I was making? One of them was wrong. Bad memory board? The memory test passed. I swapped the memory cards on the machine, but it still crashed in the same place. It was too deterministic and predictable to be flaky hardware. But the Lisp processor wasn't writing to that location.

On the other hand, the Lisp processor wasn't the only device on the bus. The corrupted memory location contained a byte full of 1s. I asked our hardware people: Is there any reason a device would write a byte full of 1s to another card on the bus?

As a matter of fact, there was. The NuBus doesn't have separate interrupt or I/O channels. It signals an interrupt by doing a memory cycle and plopping a byte full of 1s into another card's address space. For some reason, a device in the Explorer was trying to post an interrupt and ended up dropping a turd right where the Explorer put its physical page tables. This happened fairly early on, but after the page table were initialized. Then, some 15 minutes later, when the Explorer finally got around to using that page, it bombed on the corrupted entry.

I found the resident Explorer experts and mentioned that something was trying to post an interrupt. They looked at the initialization sequence and discovered that the code to initialize one of the devices had a bug and didn't correctly turn off the interrupt. The device would post an interrupt every few seconds. It was a couple of lines to fix it, but it took me several days of intense debugging to find it.

2 comments:

Unknown said...

very interesting read, as always, joe.
Thanks!

TJIC said...

Fascinating.

It's been a long time since I worked that close to the metal (writing drivers for automated test equipment at Teradyne), and I don't really miss in, but it is nice to hear a tale from the trenches!