Thursday, January 12, 2017

Whose bug is it?

My favorite quote from chapter 1, page 1 of http://dl.acm.org/citation.cfm?id=2742705 includes "'If you can fix a hardware bug in firmware, it’s not a bug but a documentation issue.'  —An anonymous hardware manager." I still recall being aghast upon hearing that utterance early in my career, but over time I grew to understand the wisdom. In fact, this inaugural chapter from 2015 book further elaborates on work-around's of hardware concerns that can be implemented in firmware.

This same sentiment was echoed in the position paper https://github.com/vincentjzimmer/Documents/blob/master/Neither-Seen-Nor-Heard.pdf and presentation https://github.com/vincentjzimmer/Documents/blob/master/Neither_Seen_Nor_Heard_Presentation-HLDVT-2010.pdf for the 2010 IEEE International High-Level Design Validation and Test Workshop. Specifically, the firmware can be modified at the 23rd hour to fix a work-around in hardware, leading to a potentially long list of 'firmware changes' during a product's postmortem. This sentiment is expressed in 'Incentives to fix issues in firmware cause root cause to be commonly incorrectly assigned to firmware' of the presentation and the following section of the paper:

      "There could be confusion between the root cause and the fix
      for issues. In particular, there is great pressure to resolve
      hardware issues in firmware due to the order-of-magnitudes
      difference in the cost of resolution. A “spin” of a chip may
      take many weeks and cost millions of dollars whereas a
      firmware fix may cost a few thousand dollars and a day or two
      of total work. In fact, many modern chips are designed so the
      firmware can configure the chips to work around issues in the
      field rather than having the hardware recalled. The public is
      simply told that there is a firmware issue when, in fact, the
      firmware resolved what was a hardware issue. "

I also recall being given a harried call on a Friday night about a the number of 'firmware bugs' on a product. I replied to the caller with some of the sentiments above, including the caveat 'I think that we can get the firmware update out by Monday, but I don't believe we can get a new stepping of the SOC by then.' Regrettably programs don't roll up the reason for the firmware changes and people are just shocked by the cardinality.

And often these bugs are not easy to find Bohrbugs, but Heisenbugs or Hindenbugs where there is a subtle interaction between host firmware and opaque hardware state machines. A week of investigation may yield a solution wherein one line of code changing one bit in a register access yields a solution. A few months ago a firmware manager at a conference related to me that the promotion process in his company entailed upper management reviewing the Git commits of the engineers. The firmware manager had to defend the smaller number of code changes versus the OS kernel guys as each delivering similar business value. But I still recall the final quote from the manager when he smiled and said 'The coolest part is that they are reviewing code changes, right?'



Sometimes these work-around's mitigate errata in the hardware (board, silicon, etc), and sometimes the changes are for cost savings. An example of the latter I recall includes a board design where the integrated Super I/O could be decoded as I/O addresses 0x2E/0x2F or 0x4E/0x4F by application of a pull-down or pull-up resistor, respectively. The hardware design engineer omitted the resistor in order to save costs on the Bill of Materials (BOM), so the boot firmware had to probe for what port value being decoded on each machine restart. And this was a cost savings of a single surface mount resistor.

The above discourse isn't meant to be an apologist view about firmware bugs, though. In general there are many bugs based upon programming flaws, not just hardware work-around's. If one is interested in the latter, take a look at
https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues. But hopefully this posting will provide an alternate view into firmware and firmware changes.


No comments: