Below are raw notes from tonight's Geeksessions event in San Francisco. All told, while the speakers did fine, I wasn't wowed with the content. See for yourself.
Josh Fergus from Sun
- “the pro-RDBMS guy”
- relational dbs good for some things (security), not others (scalability)
- data is “durable”: survives changes to the application
- not really much to say, use dbs for what they’re good for
Chad Walters from Powerset on Giant Scale Systems
- ACID means constraints
- SQL is hard to grok at scale
- talking about very large computational domains delivered at very high volumes
- use a ton of commodity hardware to overcome failures
- “modern rdbms don’t deliver on reliability”
- replication starts to be a headache with rdbms
- use something like GFS or Hadoop that was designed for the task
- why the application/db divide?
- “move the computation to the data” – MapReduce, specialized data structures
- sounds in general like a digestion of the Google approach
Paul Querna from Bloglines
- BloglinesFS: “store every blog post ever”
- currently storing several billion posts
- inspired by MogileFS and GFS, but different
- two main components: PodServer (serves metadata) on ItemDBs (storage nodes)
- PodServer: stateless, finds nodes on startup, provides cross-data center replication, “spigots” for dumping data
- ItemDB: serves web sites as “chunks”, local indexes for for attributes like Date, Post URL, etc.
- request goes from app server to PodServer to ItemDB
- crawler writes to PodServer which then writes to ItemDBs
- 75 data machines of 2×300GB disks, one ItemDB per disk, no RAID
- crawl all blogs every 30 minutes, thousands of concurrent reads + writes per second
- “we’d do it all over again because there aren’t any choices out there”
- Hadoop too specialized for web search, not good for write-heavy data
- goal isn’t to build an open source project, but to launch a product
Arnold Goldberg from eBay
- “eBay by the numbers” (not talking about PayPal, Skype, just eBay ecommerce platform)
- tons of users, transactions, listings, api developers
- 600 database instances, 20 billion sql statements per day (95% reads), 5600 active tables, 1.8 petabytes
- original pattern: separating writes and reads: write once and replicate to many
- “replication will kill you if you try to do it at scale [...] it’s a mess”
- new pattern: lookup host finds where your data lives (by primary key)
- use persistent database connections, but know the costs: each connection takes a process/thread
- figure out where your “scalability cliff” is and test
- connection concentration tier was a pain
- to scale, vector based on what you’re looking for and distribute

0 comments:
Post a Comment