How I Learned to Stop Worrying and Love LEGO Robot Programming

Duck-Cover-Front-FSDM2Although today many consider Artificial Intelligence the greatest threat to humanity, when I was FLL age nuclear war was the greatest threat to humanity.  In retrospect, the greatest threat wasn’t an intentional war between the US and USSR, but accidental or rogue use of nuclear weapons.  For example, wikipedia lists a large number of US military nuclear accidents from the 1940s onward.

This list is probably an underestimation given that it includes only declassified mishaps largely from the West.  Scientists and engineers play a critical role is mitigating risks posed by the technologies we have created.  The same risk assessment thinking about quantifying, minimizing and dealing with failure of mission critical systems is what we need to do to take our FLL robot to the next level.

One notable military nuclear accident occurred over NC called the 1961 Goldsboro Incident when a B-52 bomber carrying two Mark 39 nuclear bombs broke up in mid-air and dropped the Mark 39s over Goldsboro, NC.  Each Mark 39 bomb had a yield of 3.8 Megatons or 250 times as powerful as the 16 Kiloton Little Boy detonated over Hiroshima, Japan in 1945.

In 2013 declassified documents suggest that 3 of the 4 arming mechanisms on one of the bombs had activated and only one switch prevented detonation.  If detonated, that single Mark 39 bomb would’ve produced an explosion greater than all the conventional weapons in all wars in history plus the two atomic bombs dropped over Japan.  Thank goodness for the engineers who designed that last safety switch!

Goldsboro_Mk_39_Bomb_1-close-up( The deployed parachute was one of the three systems activated in the Mark 39 )

So it is no wonder that engineers have to design around failure inherent in physical systems like atomic bombs.  Permission Action Links were developed to prevent the unauthorized or accidental arming/detonation of nuclear weapons.  These fail-safe systems incorporate components and/or procedures to safeguard the devices from misuse.   Interlocks are devices incorporated to prevent a faulty system from harming itself.  Redundancy is also an engineering design concept that allows a mission critical system to continue operating in the face of component failure.

In addition, other critical systems like medical life support systems, banking transactions, aerospace and civil engineering structures anticipate the possibility of multiple failures.  These systems are designed to deal with potential failure with fault tolerance strategies and over-engineering.  Over engineering creates a robust system designed to perform well above expectation to drastically reduces the chance of failure.  Fault tolerance allows a system to successfully deal failure that arises ideally allowing for completion of the task, a safe failover to another working system or safe shutdown in a manner called elegant or graceful degragation.

runaway-elevator-2

( Good example of over-engineered safety devices for runaway elevator )

Distribution_with_FO_between_dc( Multiple points of fail-over for Oracle Network Directory Service )

Here are two major engineering considerations for building a fault tolerant system taken from Wikipedia entry on fault tolerance:

Criteria

Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:[6]

  • How critical is the component? In a car, the radio is not critical, so this component has less need for fault tolerance.
  • How likely is the component to fail? Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
  • How expensive is it to make the component fault tolerant? Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.

An example of a component that passes all the tests is a car’s occupant restraint system. While we do not normally think of the primary occupant restraint system, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents causing occupant ejection were quite common before seat belts, so we pass the second test. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms or weight and space, so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea. Other “supplemental restraint systems”, such as airbags, are more expensive and so pass that test by a smaller margin.

Requirements

The basic characteristics of fault tolerance require:

  1. No single point of failure – If a system experiences a failure, it must continue to operate without interruption during the repair process.
  2. Fault isolation to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying the fault or failing component. The National Institute of Standards and Technology (NIST) categorizes faults based on locality, cause, duration, and effect.
  3. Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the “rogue transmitter” which can swamp legitimate communication in a system and cause overall system failure. Firewalls or other mechanisms that isolate a rogue transmitter or failing component to protect the system are required.
  4. Availability of reversion modes[clarification needed]

In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.

Fault-tolerant systems are typically based on the concept of redundancy.


 

When we began the FLL season, we were all intrigued with the missions and the myriad of amazing mechanical attachments we could device to solve them.  What seemed straightforward and relatively boring was programming the robot to navigate the field:  go straight 720 degrees, turn left 30 degrees, lift arm while going straight another 1080 degrees, etc.  It was as intellectually exciting as doing the hokey pokey.

Only by our second tournament at Districts did we begin to realize the real challenge of the robot was navigation and programming.  The key was not to come up with the most clever mechanism to solve missions, but to program FlopBotZilla to navigate the field while incorporating as many safeguards to account as many things that can go wrong as possible much like the Mark 39.

Our Robot Challenge is far more complex than ensuring an oxygen pump or an electronic bank transfer work correctly.  Both our motors and sensors are notoriously imperfect and subject to outright failure which could easily leave us stranded on the mat or blindly wandering around tearing up missions.  There may be multiple ways to monitor and measure our movements on the mat, but the nominally more accurate methods (eg gyro) generally have a higher error/failure rate.

In the end, we can trust nothing works exactly as programmed, have to assume an inherently large degree of inaccuracy and must program continuous monitoring of our systems and prepare for a number of graceful fail-over strategies.  This is quite an interesting intellectual exercise of anticipating, prioritizing and programming a lot of what-if scenarios.  This is a good case where you can learn much more from (methodically avoiding) failure than (seeking easy) success and it applies to everything from engineering to financial modeling.

RealityPie-Crop-420 ( Einstein didn’t really make that comment about pie )


 

* To answer Trey’s question, yes – there is a known mishap over Ohio.  On July 13, 1950 a B-50 bomber crashed near Lebanon, OH resulting in a high explosive detonation of the conventional explosives found in a nuclear bomb.  Fortunately, a coordinated nuclear chain reaction was not initiated and such accidents appear far less frequent with today.

** As dangerous as nuclear weapons are, they are inherently limited in scope.  Many nuclear weapons could be detonated on earth and the human species would likely survive albeit scarred.  The prospect of Artificial Intelligent Life arising and eventually rapidly surpassing human intelligence seems inevitable, less containable and profoundly universal in it’s effects on the future of humanity.  One good way to manage such future risks is to have a society with as many individuals as possible who have a deeper understanding of such technology through programs like FLL.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s