Tuesday, August 26, 2014

Coding a Car (A Work-Related Post)

There are a million (I'm guessing) things that can go wrong with a car. Tires can go flat, lights can stop working, batteries can go dead. Hoses can rupture. It can run out of fuel. The driver can drive it off a cliff.

As the car manufacturer, you can't control for all of these things. But you can design the car to work well and be intuitive, things that will make it easy for someone who's never driven your car before to get behind the wheel and not drive it off a cliff whilst trying to change the turn on the turn signal. You can source quality parts and stress test to make sure hose ruptures and timing belt failures are rare. And you can test it under extreme conditions to make sure that your car can handle most of the crazy things your customers might throw at it, even if you hadn't intended for the car to be able to do that in the first place. Because let's face it, your customers are humans - inquisitive, irrational, unpredictable.

But there's two really important things to do.

Documentation - you create guides for mechanics who will do much of the repair for you.

Awareness - you anticipate some of the most common problems the car can have and you build in systems to monitor, record, track and alert. They may tell the driver (a chime and light to tell you "low fuel"), they may call ahead to the mechanic ("a part has failed" or "a part is about to fail, please order it and call my owner to schedule service"), they may simply record for later review by the manufacturer or law enforcement ("the car was going 45 mph in reverse with the windshield wipers on when it went off the cliff")

Good software will go through all of this thought as well. A lot has been said about these first few:

Good design:

  • Looks good.
  • Placement of controls are logical and use existing conventions or best practices.
  • Workflows make sense, tasks can be completed.
  • The product is true, authentic, real, accurate. (That is, a salesperson can sell it without lying or guessing.)
  • Is it going to hold up reasonably well under wear and tear.

Compliance:

  • Does it meet policy/law regarding the safe provision and transit of publicly identifiable information or payment processing and financial?
  • Is it secure?

Testing:

  • Meets intended scenarios and business requirements.
  • Withstands innocent incorrect usage. (Better yet, recognizes these lost users and guides them back to the path.)
  • Withstands hacking attempts - doesn't let you hide SQL commands inside form fields, protects itself against DDOS attack, etc. Hire people to try to break your system.
  • You've pushed it to its limits to see how it responds.

But less thought and attention may be given to these - and that's dangerous.

Documentation:

While the goal may be to produce a product so perfect that everyone just instinctively knows how to use it, that's just wishful/prideful thinking.  You need to see how real people interact with your product and then tweak and adjust as you go. But you also need to know that you probably can't make it foolproof. You will need labels, guides, self-service documentation that allows the user to figure out some tasks without coming back to you.

It also may be that you're going to hand the product off to a different team to support once you've written it. Perhaps an On-Call team or a Network Operations Center who will be watching your system. They need to know how to support the product, how it works, what levers and pulleys are at their disposal to right the ship if things start to go sideways. (And who to wake up if it completely falls over.)

System Health - Monitoring, Alerting, Notification, Logging:

You want your system to be resilient, to never fail, to just work perfectly and flawlessly all the time. If you think that's possible, you're incredibly vain or prideful. Or you have an unlimited budget and your product will never ever actually ship because you'll be coding forever.

The developers who build the system must spend time imagining how it can break down. Those scenarios must be turned into things that can generate logs, monitors, alerts and notifications. There is a tendency to overlook these items during planning - they aren't sexy, they aren't business requirements from the business, they are extra work for the developers, but they must be considered, must be added to the product backlog, must be delivered before the product goes live. These are your insurance policy. Unless you're willing to be making payments on a car that someone has stolen.

Logs - call it the black box of your software program - you need to know what was happening when things went south. In the midst of the crisis, you may be too busy trying to keep the thing flying to stop and ask what went wrong, but you need a way later to go back and better understand the original causes. This is only possible if you take the time to build in the code that captures snapshots of what was happening.

Monitoring - computer software is easy to set and forget. There's often no tangible product. Monitoring is at-a-glance statistics and measurements elevated to a pretty screen. It may tell you about network connectivity or disk space, about how many successful logins have occurred in the past 15 minutes or how many orders have been placed. It will be real-time and where appropriate should show trending/trailing stats so you know if something has changed.

Alerting - something has gone yellow. It's not yet time to wake someone up, but it's an early warning sign that something might go bad. It could be something innocent (a commercial drove a lot of traffic to your site) or something not so innocent (a distributed denial-of-service attack has just started against your system). These appear within your monitoring tool (on the screen if you're watching it) and your logs.

Notifications - ok, now there's a fire. Something is "hard down" or something's been in the yellow state for more than 15 minutes or some metric is so off (shopping cart creation vs completed order, signin attempts vs successful signins) that it's time to wake someone up. People are getting text messages or automated phone calls at this point.

And there's actually one more piece beyond that, a place where the car analogy no longer works so well.

Your software product must be resilient. As products become more complex, you eventually find that you don't control the entire stack. You may depend on someone else's API to tokenize your credit card, to process your credit card, to place the order. The product catalog may be on a different system, curated by non-technical people. There may be physical components - perhaps your software controls robots that assemble and ship orders or even cars.

It's time to stop making assumptions and start honoring chain of custody. If your system has a particular piece of information that it must pass to another system, it cannot simply assume that a hand-off was successful. It now needs a way to safely and securely store that piece of information (an order, complete with credit card details, for instance) until it is assured that the next system has successfully received (and confirms receipt of the data).  This is a somewhat newer mindset, but if you want to make sure nothing falls through the cracks, you must be willing to be responsible for your data until you have proof positive that it's been accepted by the next layer in the stack.

So when you hear "I just want to code" be very afraid. You may need some code written, but far more importantly, you need really solid end-to-end thinking and you need a culture that says isn't "making babies" but "raising a child" - if you're churning something out and you're tossing it over the fence as soon as the commit is made, you're solving for today but you won't need to worry about tomorrow because your company will go off the cliff and you won't know why.