Febuary 28th, Cloud Camp Seattle was held at an Amazon facility in Seattle. Cloud Camp is described organizers as an unconference where early adapters of Cloud Computing technologies exchange ideas. With the rapid change occurring in the industry, we need a place we can meet to share our experiences, challenges and solutions. At CloudCamp, you are encouraged you to share your thoughts in several open discussions, as we strive for the advancement of Cloud Computing. End users, IT professionals and vendors are all encouraged to participate.
The Cloud Camp schedule is at: http://www.cloudcamp.com/.
Jeanine Johnson attended the event and took excellent notes. Jeanine’s notes follow.
It began with a series of “lightening presentations” – 5 minute presentations on cloud topics that are now online (http://www.seattle20.com/blog/Can-t-Make-it-to-Cloud-Camp-Watch-LIVE.aspx). Afterwards, there was a Q&A session with participants that volunteered to share their expertise. Then, 12 topics were chosen by popular vote to be discussed in an “open space” format, in which the volunteer who suggested the topic facilitated its 1 hour discussion.
Highlights from the lightening presentations:
· AWS has launched several large data sets (10-220GB) in the cloud and made them publically available (http://aws.amazon.com/publicdatasets/). Example data sets are the human genome and US census data; large data sets that would take hours, days, or even weeks to download locally with a fast Internet connection.
· A pyramid was drawn, with SaaS (e.g. Hotmail, SalesForce) on top, followed by PaaS (e.g. GoogleApp Engine, SalesForce API), IaaS (e.g. Amazon, Azure; which leverages virtualization), and “Traditional hosting” as the pyramid’s foundation, which was a nice and simple rendition of the cloud stack (http://en.wikipedia.org/wiki/Cloud_computing). In addition, SaaS applications were shown to have more functionality, and traveling down that pyramid stack resulted in less functionality, but more flexibility.
Other than that info, the lightening presentations were too brief with no opportunity for Q&A to learn much. After the lightening presentations, open space discussions were held. I attended three: 1) scaling web apps, 2) scaling MySql, and 3) launching MMOGs (massively multiplayer online games) in the cloud – notes for each session follow.
1. SCALING WEB APPS
One company volunteered themselves as a case study for the group of 20ish people. They run 30 physical servers, with 8 front-end Apache web servers on top of 1 scaled-up MySql database, and they use PHP channels to access their Drupal http://drupal.org content. Their MySql machine has 16 processors and 32GB RAM, but is maxed-out and they’re having trouble scaling it because they currently hover around 30k concurrent connections, and up to 8x that during peak usage. They’re also bottlenecked by their NFS server, and used basic Round Robin for load balancing.
Using CloudFront was suggested, instead of Drupal (where they currently store lots of images). Unfortunately, CloudFront takes up to 24 hours to notice content changes, which wouldn’t work for them. So the discussion began around how to scale Drupal, but quickly morphed into key-value-pair storage systems (e.g. SimpleDB http://aws.amazon.com/simpledb/) versus relational databases (e.g. MySql) to store backend data.
After some discussion around where business logic should reside, in StoredProcs and Triggers or in the code via an MVC http://en.wikipedia.org/wiki/Model-view-controller paradigm, the group agreed that “you have to know your data: Do you need real-time consistency? Or eventual consistency?”
Hadoop http://hadoop.apache.org/core/ was briefly discussed, but once someone said that popular web-development frameworks Rails http://rubyonrails.org/ and Django http://www.djangoproject.com/ steer folks towards relational databases, the discussion turned to scaling MySql. Best practice tips given to scale MySql were:
· When scaling-up, memory becomes a bottleneck, so use memcach http://www.danga.com/memcached/ to extend your system’s lifespan.
· Use MySql cluster http://www.mysql.com/products/database/cluster/.
· Use MySql proxy http://forge.mysql.com/wiki/MySQL_Proxy and shard your database, such that users are associated with a specific cluster (devs turn to sharding because horizontal scaling for WRITES isn’t as effective as it is for READS, aka replication processing becomes untenable).
Other open source technologies mentioned included:
· Galary2 http://www.gallery2.org/, an open source photo album.
· Jingle http://www.slideshare.net/stpeter/jingle, Jabber-based VoIP technology.
2. SCALING MYSQL
Someone volunteered from the group of 10ish people to white-board the “ways to scale MySql,” which were:
· Master / Slave, which can use Dolphin/Sakila http://forge.mysql.com/wiki/SakilaSampleDB, but becomes inefficient around 8+ machines.
· MySql proxy, and then replicate each machine behind the proxy.
· Master : Master topology using sync replication.
· Master ring topology using MySql proxy. It works well, and the replication overhead can be helped by adding more machine, but several thought it would be hard to implement this setup in the cloud.
· Mesh topology (if you have the right hardware). This is how a lot of high-performance systems work, but recovery and management are hard.
· Scale-up and run as few slaves as possible – some felt that this “simple” solution is what generally works best.
Someone then drew a “HA Druple Stack in the cloud,” which consisted of 3 front-end load balancers with hot-swap for failures to either the 2nd or 3rd machines, followed by 2 web-servers, 2 master/slave databases in the backend. If using Drupal, 2 additional NFS servers should be setup for static content storage with hot swap (aka fast Mac failover). However, it was recommended that Drupal be replaced with a CDN when the system begins to need scaling-up. This configuration in the Amazon cloud costs around $700 monthly to run (plus network traffic).
Memcach (http://memcachefs.sourceforge.net/) was mentioned as a possibility as well.
3. LAUNCHING MMOGs IN THE CLOUD
This topic was suggested by a game developer lead. He explained to the crowd of 10ish people that MMOs require persistent connections to servers, and their concurrent connections has a relatively high standard deviation daily, with a trend over the week that peaks around Saturday and Sunday. MMO producers must plan their capacity a couple months in advance of publishing their game. And since up to 50% of a MMO’s subscriber base is active on the first day, they usually end up with left-over capacity after launch, when active subscribers drop to 20% of their base and continue to dwindle down until the end of the game’s lifecycle. As a result, it’d be ideal to get MMOGs into the cloud, but no one in the room knew how to get around the latency induced by virtualization, which is too much for flashy MMOGs (although the “5%-ish” perf-hit is fine for asynchronous or low-graphics games). On a side note, iGames http://www.igames.org/ was mentioned as a good way to market games.
Afterwards, those people that were left went to the Elysian on 1st for drinks, and continued their cloud discussions.
James Hamilton, Amazon Web Services
1200, 12th Ave. S., Seattle, WA, 98144
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | james@amazon.com
H:mvdirona.com | W:mvdirona.com/jrh/work | blog:http://perspectives.mvdirona.com
Regarding the Drupal site #1 item from the notes. Fairly recently I did a deployment of a Drupal site for a TV media company that was capable of well over 2 billion page views a month. I did this "in the cloud." I load tested it to that rate sustained. The peak it saw was reported to me at just under a 2 billion run rate. Regular traffic was much lower steady state.
It required the following key actions:
– Add "cacherouter" to Drupal
– Add an in memory caching cluster (memcached) in this case and use it via cache router for most things
– Modify Drupal to split reads and writes. All writes to a Master-Master pair (for HA) and all Reads to a load balanced set of read slaves (directly from one of the masters). This bit is tricky.
– Split off static content (requires modifying drupal again) to a pair of nginx static asset servers. Not as tricky but still much more difficult that it should be.
– Use a CDN to offload from the static to "the edge"
– Use PHP opcode caching (php mod via APC) w/ a reasonable configuration for my case
– I also split off admin tasks to a separate box in this case so it could be done on a box that didn’t have live traffic hitting it.
As you will notice, caching is crucial for this sort of brute force scaling. This isn’t the whole enchilada. There are lots of details to deal with and there are still other things that were on the to do list. But, in this case, it was good enough. But, Drupal can certainly be scaled significantly.
It’s a shame it’s this much trouble though.
Anyway, I published a scaled down sanitized version if you want to take a look at the basic diagram on my company site.
http://www.nscaled.com/solutions.php
I’ve done similar deployments w/ Django (Python), Rails, and Symfony. The patterns are all similar. The problems are often similar. What I’d really like to see is CMS’s be more elegant and get away from RDBMS models all together for their primary data store. It’s just not necessary any more. But, it’s also not easy sometimes to just plug in a new data tier like that. Especially going from RDBMS to Document (Couch/MemcacheDB/etc..etc.. many choices now)
Cheers,
Kent Langley
twitter.com/kentlangley
I was at CloudCamp that Saturday. Those conference rooms have an amazing view.
I lead the session referred to as "Scaling Web Apps.," at least I think I did. I remember covering a lot of the ground in the notes, but not all of it. It was really well attended, and there were a few side conversations. I remember Jeanine questioning someone about using triggers and more at the side of the room while I was focused on the main discussion.
The issues I was trying to get at might be posed as "SimpleDB, MegaStore or run something else on a VPS like EC2? If the latter, what should it be and how should we go about preparing for it? Unfortunately, no one in the room had gone too far down either fork in that road, or if they did, they didn’t speak up. Some people had been playing with AppEngine and SimpleDB with small datasets and very low concurrency, but didn’t seem to be making big plans based on that experience.
We discussed ways to push apps built around a relational DB as far as possible with a bag of tricks that included MemcacheD, replicated read slaves, extensive partitioning and denormalization.
I didn’t feel like we came up with any rules of thumb as to when it was time to come to terms with the fact that your relational DB wasn’t a relational DB anymore. Also no real answer on what to do at that point if being tied to SimpleDB, MegaStore or SQL Server Data Services was unattractive. Someone one threw out MongoDB (http://www.mongodb.org/display/DOCS/Home). I think I mentioned lighterweight key/value stores like MemcacheDB (the well supported memcache network protocol on top of BerkleyDB’s storage and replication engine) and someone else mentioned CouchDB. The best we could do was "know your app/data," a good starting point, but not really even a rough map for scaling into the clouds.
One guy from an unnamed stealth mode startup in Vancouver BC had been experimenting with using Hadoop and maybe Hypertable. Someone had expressed concerns with latency, but he said it wasn’t an issue for their application. They were crunching a bunch of data and pushing the digested data up to their webservers where it was being served from the filesystem.
The Drupal site (which shall remain nameless) came up because I asked if anyone was struggling with scaling a traditional LAMP architecture, where the DB was bearing the entire brunt of dealing with concurrency issues. He was the only one to step forward. They’d partitioned their DB by table to push one class of query to a separate server, but it sounded like doing more would require major surgery on Drupal (an extensible open source PHP community CMS package). Doing a fan-out to multiple replicated read-only slaves sounded like only a minor win, but it sounds like Drupal writes to the DB on every authenticated user’s page view to track user activity. This discussion ended up being continued in the next "scaling mysql session." I cut out of that one early to check out a session that was supposed to be about cloud computing success stories, but ended up being more about cloud computing aspirations.
Two needs seemed clear after attending CloudCamp:
1) There must be a lot of people that need something between a DIY SQL RDBMS failover cluster on a VPS and a super-scalable pay-as-you-go non-relational DB in the cloud like SimpleDB or MegaStore. Easy scale-up and down over a range of 5-10x (say from an 2x EC2 compute units with 4GB RAM, up to the equivalent of 2 XL Instances) with their existing app architectures in the cloud would carry a lot of apps over their entire lifespan, and give plenty of others the opportunity to take advantage of cloud economics while revamping their apps for larger scale. It would probably buy the guy with the Drupal site a year or so. Is Microsoft’s SQL DB Services going to meet this need? It wasn’t clear to me, sounded to me like existing apps will need some amount of rework.
2) Better software for scaling SQL DBs in the cloud, better education in patterns on how to take advantage of it, and also when and where to look for non-relational options.
Thanks to Jeanine for the report from the the MMORG talk. I missed it because I went to a session on how to encourage more use of cloud resources for scientific computing. Among other things, it sounds like it will take work to overcome institutional barriers that get reinforced by the policies of granting agencies, and departmental politics. Some people also worried about technical constraints, not being assured of low-enough latency to get reliable performance for classes of problems that needed tighter coupling between nodes.