Sunday, January 19, 2020

Code Freeze 2020 - Observabillity

Sorry for the loose nature of the notes rather than a good writeup, but I wanted to get things collated so I can work with them.  Good conference.  I ate at Al's for breakfast and Hong Kong Noodles for lunch.  And I realized I take the green line over to the U of MN for lunch anytime I'm downtown at work, which didn't occur to me these last six months.  If I had my bicycle, it's even a short ride that way across the campus bridge.  I need to get my urban on.  Not as much practical knowledge at this one (for me) as some of the past events, but that means I can focus on the few things I think have practical value rather than being all over the place.

Observability and the Glorious Future - Charity Majors (

  • O'Reilly Database Reliability Engineering (November 2017:
  • How often do you deploy.  How long, how often do you fail, recovery time - the basics.
  • Hires for communication skills (initial tech interview is to get them talking at the in person).  "Empowered to do their jobs". 
  •  "How do I know if it breaks?" - all changes, all features
  • "Serverless was a harbinger.  Deployless is coming."
  • Developers (senior+) should amplify the hidden costs.
  • Team happiness = customer happiness (Steve says this too)

Observability in Big Analytics - Bonnie Holub, Teradata

50 Years of Observability - Mary Poppendieck

  • What is the equivalent of metal fatigue in software?  Operator fatigue. >> e.g. what Steve pushes that a focus on PIs is important.
  • Talked planes, bridges, three mile island
  • She likes the Control series by Brian out on Youtube....they're deep:
  • Observable - all critical states known from system outputs
  • Observable is at war with complexity.
  • Controllable activator - sensor can get back to a set state in a set time.
  • If it's not observable, can it be totally controlled? (no)
  • Fault Tolerance: replication and isolation.
  • Responsibility (and understanding the big picture) leads to desire for observability (and isolation/duplication). >> PLEX team at VP is a form of big picture.

What's Happening in Your Production Data and ML Systems  - Don Sawyer, PhData

  • Most practical of the lectures.
  • Focus on decoupled systems: Data warehouse, ML Models.
  • Talked Provenance as both origin and change over time.
  • Timestamp everything UTC (use Google Time API as an example to change it during compute).
  • Focus on: audit trails, data quality, repeatability, added info (pipeline).
  • Metadata payload.  PROCESS: id/version, start/end, transformations, inputs, configuraitons, DATA VERSIONS: traces of issues, data change history, defect data, LINEAGE: sources, frequencuu of read.
  • Last point was a little messy (from me) but you want to trace right down to the node data touched in transit so you can hydrate anything from the last known good state.
  • NOT ALL DATA RECORDS require granular povenance.  Can be expensive (so much data).  Use a flexible or generic schema.  Don't use S3 (slow).  Storage considerations.
  • Storage: 1.) attach info to the record (can get big, note that Avro and Parquet are meant to do this), 2 send a separate event message - separate provenance API, 3.) only track some.  Note that for API approaches you may end up going down a rabbit hole of tracking the tracking api.
  • Alternatives: Amundsen (Lyft), Marques (WeWork), DataBook (Uber), DataHub (LinkedIn)
  • Look at Apache Nifi (there's a pluralsight class)

Evolving Chaos Engineering - Casey Rosenthal, Verica

  • Ships, shoes, fruit (apricots), helium mining.  He's a very funny guy.
  • LOOK FOR  A VIDEO to watch with the team
  • Reversibility: blue/green, feature flags, ci/cd, agile to waterfall.
  • Moved responsibility away from the people who do the work (hierarchy)
  • Myths:
  • 1. remove the people causing the accidents.
  • 2. document best practices and use runbooks.  (most interesting problems are unique)
  • 3. defend against prior root causes, aka defense in depth.  Root cause analysis: "at best, you are wasting your time."  Was our sponsor audience issue an example?  The answer was in part to restrict audience size.  But the dig highlighted system no longer supports system-wide features after growth, high processing cost of feature, inability to test with all users, etc.
  • 4. enforce procedures
  • 5. avoid risk
  • 6. simplify
  • 7. add redundancy
  • Do NOT eliminate complexity.  Navigate it.  CI, CD, CV - continuous verification (here's a link to a CV article:  That's New Relic for us.
  • Has two books: Chaos Engineering and Learning Chaos Engineering.  First book comes out June 2020.

No comments: