Oct 2, 2007

Notes From Geeksessions: Beyond the Database

Below are raw notes from tonight's Geeksessions event in San Francisco. All told, while the speakers did fine, I wasn't wowed with the content. See for yourself.

Josh Fergus from Sun

  • “the pro-RDBMS guy”

  • relational dbs good for some things (security), not others (scalability)

  • data is “durable”: survives changes to the application

  • not really much to say, use dbs for what they’re good for

Chad Walters from Powerset on Giant Scale Systems

  • ACID means constraints

  • SQL is hard to grok at scale

  • talking about very large computational domains delivered at very high volumes

  • use a ton of commodity hardware to overcome failures

  • “modern rdbms don’t deliver on reliability”

  • replication starts to be a headache with rdbms

  • use something like GFS or Hadoop that was designed for the task

  • why the application/db divide?

  • “move the computation to the data” – MapReduce, specialized data structures

  • sounds in general like a digestion of the Google approach

Paul Querna from Bloglines

  • BloglinesFS: “store every blog post ever”

  • currently storing several billion posts

  • inspired by MogileFS and GFS, but different

  • two main components: PodServer (serves metadata) on ItemDBs (storage nodes)

  • PodServer: stateless, finds nodes on startup, provides cross-data center replication, “spigots” for dumping data

  • ItemDB: serves web sites as “chunks”, local indexes for for attributes like Date, Post URL, etc.

  • request goes from app server to PodServer to ItemDB

  • crawler writes to PodServer which then writes to ItemDBs

  • 75 data machines of 2×300GB disks, one ItemDB per disk, no RAID

  • crawl all blogs every 30 minutes, thousands of concurrent reads + writes per second

  • “we’d do it all over again because there aren’t any choices out there”

  • Hadoop too specialized for web search, not good for write-heavy data

  • goal isn’t to build an open source project, but to launch a product

Arnold Goldberg from eBay

  • “eBay by the numbers” (not talking about PayPal, Skype, just eBay ecommerce platform)

  • tons of users, transactions, listings, api developers

  • 600 database instances, 20 billion sql statements per day (95% reads), 5600 active tables, 1.8 petabytes

  • original pattern: separating writes and reads: write once and replicate to many

  • “replication will kill you if you try to do it at scale [...] it’s a mess”

  • new pattern: lookup host finds where your data lives (by primary key)

  • use persistent database connections, but know the costs: each connection takes a process/thread

  • figure out where your “scalability cliff” is and test

  • connection concentration tier was a pain

  • to scale, vector based on what you’re looking for and distribute

0 comments: