Velocity Session on a Day in the Life of Facebook

A Day in the Life of Facebook Operations

Tom Cook (Facebook)
1:45pm Wednesday, 06/23/2010
Operations, Velocity Culture Ballroom AB
Facebook is now the #2 global website, responsible for billions of photos, conversations, and interactions between people all around the world running on top of tens of thousands of servers spread across multiple geographically-separated datacenters. When problems arise in the infrastructure behind the scenes it directly impacts the ability of people to connect and share with those they care about around the World.

Facebook’s Technical Operations team has to balance this need for constant availability with a fast-moving and experimental engineering culture. We release code every day. Additionally, we are supporting exponential user growth while still managing an exceptionally high radio of users per employee within engineering and operations.

This talk will go into how Facebook is “run” day-to-day with particular focus on actual tools in use (configuration management systems, monitoring, automation, etc), how we detect anomalies and respond to them, and the processes we use internally for rapidly pushing out changes while still keeping a handle on site stability.

People planning to attend this session also want to see:
Change Management: A Scientific Classification
Psychology of Performance
Hidden Scalability Gotchas in Memcached and Friends
Stupid Web Caching (and Other Intermediary) Tricks
Tom Cook

Facebook
Tom is a Systems Engineer on the Technical Operations team at Facebook, where he is responsible for a variety of low-level services and systems within the production environment. During his time at Facebook, the systems footprint has expanded over 10x. Prior to joining the company, Tom worked for a number of smaller tech companies in Texas.

Notes on the Session

Amazing…the room is standing room only. There are dozens of folks sitting on the floor ready to watch this session.

Some quick stats

  • 16 billion minutes on facebook per day
  • 6 billion pieces of content per work
  • 3 billion photos per month
  • 1 million facebook connection (integrations)
  • 400+ million active users (users who return monthly…50% return daily)

Rapid Growth
Launch as a single server in Harvard. He added more servers each week. Facebook is building their own data center in Prineville, Oregon. They are split in the Bay Area and Virginia right now. There are

The Stack

  • Load balancers
  • Web Servers
    • Hip Hop for PhP (Used to be a pure LAMP stack)
      • They really like PhP which is why they don’t abandon it.
    • Memcached
      • 300+ TB in live RAM
    • MySQL
      • MySQLatFacebook
    • Services
      • News Feed, Search, Chat, Ads, Media, etc…

Core Operating System

  • Pure Linux architecture (CentOS 5)

Systems Management

  • Configuration Management: Using tools that are about 4 years old.
    • Use CFEngine (Version 2)
      • Update their servers every 15 minutes and takes about 30s to run the rules engine.
  • On Demand Tools: Need quick point based tools
    • Used to use DSH
    • Rewrote their own tool: Ran a uname -a on 10000 hosts geographically distributed in 18s.
      • Sounds like Fusion right Patrick?

Deployment

  • Code is pushed on Frontend or Backend
    • Web Push
      • Pushed 1 time a day or multiple times a day
      • Push new features 1x a week
      • Highly coordinate…software push is built on-top of on-demand tools
      • Code distributed via internal BitTorrent swarm (fast…fast…fast)
        • Able to do a daily push in 1 minute across every system worldwide
          • Just pushing file distribution (Doesn’t include server restarts)
    • Backend deployments
      • Facebook got rid of formal QA
        • Run with Engineering + Operations model
        • Engineers write, test and deploy their own code
          • Quickly make performance decisions
          • Expose changes to subset of real traffic
        • No commit and quite (culture is pretty intense)
          • Engineering is deeply involved in moving services to production
          • Ops embedded in engineering
            • Help make architecture decisions
            • Better understand needs of product

Change Logging

  • Aggressively log and audit a feature
    • Log start and end time

Monitoring

  • Use Ganglia for quick drill-ins
  • Use ODS for application focus (Home Grown)
  • Plain monitoring they use Naggios (ping, ssh)
    • Very distributed and integrates with internal tools

Constant Growth

  • Deal with non-stop failure at all levels of the stack
    • Automated tools that deal with this failure
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s