James Hamilton's Blog RSS 2.0
 Friday, June 27, 2008

Alex Mallet and Viraj Mody of the Windows Live Mesh team took great notes at the Structure ’08 (Put Cloud Computing to Work) conference (appended below).

 

Some pre-reading information was made available to all attendees as well: Refresh the Net: Why the Internet needs a Makeover?

 

Overall

-          Interesting mix of attendees from companies in all areas of “cloud computing”

-          The quality of the presentations and panels was somewhat uneven

-          Talks were not very technical

-          Amazon is the clear leader in mindshare; MS isn’t even on the board

-          Lots of speculation about how software-as-a-service, platform-as-a-service, everything-as-a-service is going to play out: who the users will be, how to make money, whether there will be cloud computing standards etc

 

5 min Nick Carr video [author of “The Big Switch”]

-          Drew symbolic link between BillG retiring and Structure ’08, the first “cloud computing” conference, being the same week, marking the shift of computing from the desktop to the datacenter

-          Generic pontificating on the coming “age of cloud computing”

 

“The Platform Revolution: a look into disruptive technologies”, Jonathan Yarmis, research analyst

-          Enterprises always lag behind consumers in adoption of new technology, and IT is powerless to stop users from adopting new technology

-          4 big tech trends: social networks, mobility, cloud computing, alternative business models [eg ad-supported]

-          Tech trends mutually reinforcing: mobility leads to more social networking applications, being able to access data/apps in the cloud leads to more mobility

-          Mobile is platform for next-gen and emerging markets: 1.4 billion devices per year, 20% device growth per year, average device lifetime 21 months; opens up market to new users and new uses

-          Claim: “single converged device will never exist”; cloud computing enables independence of device and location

-          Stream computing: rate of data creation is growing at 50-500% per year, and it’s becoming increasingly important to be able to quickly process the data, determine what’s interesting and discard the rest

-          “Economic value of peer relationships hasn’t been realized yet” – Facebook Beacon was a good idea, but poorly realized

 

“Virtualization and Cloud Computing”, with Mendel Rosenblum, VMWare founder 

-          Virtualization can/should be used to decouple software from hardware even in the datacenter

-          Virtualization is cloud-computing enabler: can decide whether to run your own machines, or use somebody else’s, without having to rewrite everything

-          Coming “age of multicore” makes virtualization even more important/useful

-          Smart software that figures out how to distribute VMs over physical hardware isn’t a commodity yet

-          VMWare is working on merging the various virtualization layers: machine, storage, networks [eg VLANs]

-          HW support for virtualization is mostly being done for server-class machines [?]

-          Rosenblum doesn’t think moving workloads from the datacenter to edge machines, to take advantage of spare cycles, will ever take off – it’s just too much of a pain to try to harness those spare cycles

-          Single-machine hypervisor is becoming commodity, so VMWare is moving to managing the whole datacenter, to stay ahead of the competition

 

Keynote, Werner Vogels, Amazon CTO:

-          Mostly a pitch for Amazon’s web services: EC2, S3, SQS, SimpleDB

-          Gave example of Animoto, company that merges music + photos to create a video, which has no servers whatsoever: had 25K users, launched a Facebook app, and went from 25K users total to adding 25K users/hour; were able to handle it by moving from 50 EC2 instances to 3000 EC2 instances in 2 days

-          Currently 370K registered AWS developers

-          Bandwidth consumed by AWS is bigger than bandwidth consumed by Amazon e-commerce services

-          Shift to service-oriented architecture occurred as result of being approached by Target in 2001/2002, asking whether Amazon could run their e-commerce for them. Realized that their current architecture wouldn’t scale/work, so they re-engineered it

-          Single Amazon page can depend on hundreds of other services

-          Big barrier between developing web app and operating it at scale: loadbalancing, hardware mgmt, routing, storage management etc. Called this the “undifferentiated heavy lifting” that needs to be done to even get in the game

-          Claim: typical company spends 70% effort/money on “undifferentiated heavy lifting” and 30% on differentiated value creation; AWS is intended to allow companies to focus much more on differentiated value creation

-          SmugMug has been at forefront of companies relying on AWS; currently store 600TB of photos in S3, and have launched an entirely new product, SmugVault, based purely on the existence of S3 => AWS not just replacement for existing infrastructure, but enabling new businesses

-          In 2 years, cloud computing will be evaluated along 5 axes: security, availability, scalability, performance, cost-effectiveness

-          Really plugged the pay-as-you-go model

 

“Working the Cloud: next-gen infrastructure for new entrepreneurs” panel

-          Q: is lock-in going to be a problem ie how easy will it be to move an app from one cloud computing platform to another ?

o   A: Strong desire for standards that will make it easy to port apps, but not there yet

o   A: To really use the cloud, you need to embed assumptions about it in your code; even bare-metal clouds require intelligence, like scripts to spin up new EC2 instances, so lock-in is a real concern

o   Side thread: Google person on panel claimed that using Google App Engine didn’t lock in developers, because the GAE APIs are well-documented, and he was promptly verbally mugged by just about everyone else on the panel, pointing out things like the use of BigTable in GAE making it difficult to extract data, or replace the underlying storage layer etc.

o   Prediction: there will be convergence to standards, and choice will come down to whether to use a generic cloud, or a more specialized/efficient cloud, eg one targeted at the medical information sector, with features for HiPPA compliance

-          Need new licensing models for cloud computing, to deal with the dynamic increase/decrease in number of application instances/virtual machines as load changes

-          Tidbit: Google has geographically distributed data centers, and geo-replicates [some] data

-          Q: will we be able to use our old “toys” [APIs, programming models etc] in the cloud ?

o   A: Yes, have to be able to, otherwise people won’t adopt it

o   A:  Yes, just have to be smart about replacing the plumbing underneath the various APIs

o   A: Yes, but current API frameworks are lacking some semantics that become important in cloud computing, like ways to specify how many times an object should be replicated, whether it’s ok to lazily replicate some data etc

 

Mini-note, “Optical networking”, Drew Perkins

-          Video is, by far, the largest consumer of bandwidth on the internet

-          Cost of content is disproportionate to size: 4MB song costs $1, 200MB TV show episode costs $2, 1.5GB movie costs $3-4.

-          Photonic integrated circuits that can be used to build 100GB/s are needed to meet future bandwidth requirements: less power,  need fewer network devices

 

“Race to the next database” panel

-          Quite poorly organized: panelists each got to give an [uninformative] infomercial for their company, and there was very little time for actual questions and discussion

-          Aster Data Systems is back-end data warehouse and analytics system for MySpace: 1 billion impressions/day, 1 TB of new data per day, new data is loaded into 100-node Aster cluster every hour and needs to be available for ad analytics engine to decide which ads to show

-          SQLStream is company that has built a data stream processing product that collapses the usual processing stages [data staging, cleaning, loading etc] into a pipeline that continuously produces results for “standing” queries; useful for real-time analytics

-          Web causes disruption to traditional DB model because of [10x] larger data volumes, need for high interactivity/turn-around, need to scale out instead of up. For example, GreenPlum is building a 20PB data warehouse for one customer.

-          Can’t rely on all the data being in a single store, so need to be able to do federated/distributed queries

 

“MS Datacenters”

-          Presentation centered on MS plans for datacenters-in-a-box

-          Datacenters-in-a-container are long-term strategy for MS, not just transient response to high demand and lack of space

-          Container blocks have lower reliability than traditional datacenters, so applications need to be geo-distributed and redundant to handle downtime

 

“End of boxed software”, Parker Harris, co-founder of Salesforce.com

-          Origins of salesforce.com: Modeled on consumer internet sites - amazon.com; ebay.com

-          Transition from client->server site to a platform (force.com): first instinct is to build a platform, but then you lose touch with why you're building it. As they started building their experience, they abstracted away components and started realizing it could become a platform. Revenue comes from site, platform is a bonus.

-          Initially scaled by buying bigger [Sun] boxes ie scaled up, not out, and ran into lots of complexity. Unclear whether that’s still the case or whether they’ve re-architected.

 

“Scaling to satiate demand” panel

-          Q: “When did you first realize your architecture was broken, and couldn’t scale ?”

o   A: When site started to get slow; Ebay: after massive site outages

-          Q: “How do you handle running code supplied by other people on your servers ?”

o   A: Compartmentalize ie isolate apps; have mgmt infrastructure and tooling to be able to monitor and control uploaded apps; provide developers with fixed APIs and tools so you can control what they do

-          Q: “How do Facebook and Slide [builds Facebook apps] figure out where the problems are if Slide starts failing ?”

o   A: Lots of real-time metrics; ops folks from both companies are in IM contact and do co-operative troubleshooting

-          Q: “How should you handle PR around outages ?”

o   A: Be transparent; communicate; set realistic timelines for when site will be back up; set expectations wrt “bakedness” of features

-          Beware of retrying failed operations too soon, since retries may cause an overloaded system to never be able to come up

-          Ebay: each app is instrumented with the same set of logging infrastructure and there’s a real-time OLAP engine that analyzes the logs and does correlation to try to find troubled spots

-          Facebook and Meebo both utilized their user base to translate their sites into multiple languages

-          Need to know which bits of the system you can turn off if you run into trouble

-          Biggest challenge is scaling features that involve many-many links between users; it’s easy to scale a single user + data

-          Keep monitoring: there are always scale issues to find, even without problems/outages

-          Slide: “Firefox 3 broke all of our apps”

-          Facebook has > 10K servers

 

Mini-note, “Creating fair bandwidth usage on the Internet”, Dr. Lawrence Roberts, leader of the original ARPANET team

-          P2P leads to unfair usage: people not using P2P get less; 5% of users (P2P users) receive 80% capacity

-          Deep packet inspection catches 75% of p2p traffic, but isn’t effective in creating fairness

-          Anagran has flow behavior mgmt: observe per user flow behavior & utilization and then equalize. Equalization is done in memory, on networking infrastructure [routers etc] and at the user level instead of the flow level

 

Mini-note, “Cloud computing and the mid-market”, Zach Nelson, CEO of NetSuite

-          Mid-market is the last great business applications opportunity

-          Cloud computing makes it economical to reach the “Fortune 5 million”

-          Cloud computing still doesn’t solve problem of application integration

-          Consulting services industry is next to be transformed by cloud computing

 

Mini-note, “Electricity use in datacenters”, Dr.Jonathan Koomey

-          Site Uptime Network, an organization of data center operators and designers did study of 19 datacenters from 1999-2006:

o   Floor area remained relatively constant

o   Power density went from 23 W/sq ft to 35 W/sq ft

-          In 2000, datacenters used 0.5% of world’s electricity; in 2005, used 1%.

-          Cooling and power distribution are largest electricity consumers; servers are second-largest; storage and networking equipment accounts for a small fraction

-          Asia-Pacific region’s use of power is increasing the fastest, over 25% growth per year

-          Lots of inefficiencies in facility design: wrong cost metrics [sq feet versus kW], different budgets and costs borne by different orgs [facilities vrs IT], multiple safety factors piled on top of each other

-          Designed Eco-Rack, which, with only a few months of work, reduces power consumption on normalized workload by 16-18%

-          Forecast: datacenter electricity consumption will grow by 76% by 2010, maybe a bit less with virtualization

 

“VC investment in cloud computing infrastructure” panel

-          Overall thesis of panel was that VCs are not investing in infrastructure

-          VCs disagreed with panel theme, and said that it depended on the definition of infrastructure; said they are investing in infrastructure, but it’s moving higher in the stack, like Heroku [?]

-          HW infrastructure requires serious investment, large teams, and long time-frame – not a good fit with VC investment model

-          Any companies that want to build datacenters or commodity storage and compute services are not  a good investment – there are established, large competitors, and it’s very expensive to compete in that space

-          Infrastructure needed for really large scale [like a 400 Gbit/sec switch] has a pretty small market, which makes it hard to justify the investment. If there’s a small market, the buyers all know they’re the only buyers and exert large downward pressure on price, which makes it hard for company to stay in business

-          Quote: “any company that’s doing something worthwhile, and building something defensible, will take at least 24 months to develop”

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Friday, June 27, 2008 5:45:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Wednesday, June 25, 2008

John Breslin did an excellent job of writing up Kai-Fu Lee’s Keynote at WWW2008.  John’s post: Dr. Kai-Fu Lee (Google) – “Cloud Computing”. 

 

There are 235m internet users in China and Kai-Fu believes they want:

1.       Accessibility

2.       Support for sharing

3.       Access data from wherever they are

4.       Simplicity

5.       Security

 

He argues that Cloud Computing is the best answer for these requirements.  He defined the key components of what he is referring to as the cloud to be: 1) data stored centrally without need for the user to understand where it actually is, 2) software and services also delivered from the central location and delivered via browser, 3) built on open standards and protocols (Linux, AJAX, LAMP, etc.) to avoid control by one company, and 4) accessible from any device especially cell phones.  I don’t follow how the use of Linux in the cloud will improve or in any way change the degree of openness and the ease with which a user could move to a different provider.  The technology base used in the cloud is mostly irrelevant. I agree that open and standard protocols are both helpful and a good thing.

 

Kai-Fu then argues that what he has defined as the cloud has been technically possible for decades but three main factors make it practical today:

1.       Falling cost of storage

2.       Ubiquitous broadband

3.       Good development tools available cost effectively to all

 

He enumerated six properties that make this area exciting: 1) user centric, 2) task centric, 3) powerful, 4) accessible, 5) intelligent, and 6) programmable.  He went through each in detail (see Dan’s posting).  In my read I just spent time on the data provided on GFS and Bigtable scaling, hardware selection, and failure rates that were sprinkled throughout the remainder of the talk:

·         Scale-out: he argues that when comparing a $42,000 high-end servers to the same amount spent on $2,500 servers, the commodity scale-out solution is 33x more efficient.  That seems like a reasonable number but I would be amazed if Google spent anywhere near $2,500 for a server.  I’m betting on $700 to perhaps as low as $500. See Jeff Dean on Google Hardware Infrastructure for a picture of what Jeff Dean reported to be the  current Google internally designed server design.

·         Failure management. Kai-Fu stated that a farm of 20,000 servers will have 110 failures per day.  This is a super interesting data point from Google in that failure rates are almost never published by major players. However, 110 per day on a population of 20k servers is ½% a day which seems impossibly high.  That says, on average, the entire farm is turned over in 181 days.  No servers are anywhere close to that unreliable so this failure data must be of all types of failures whether software or hardware. When including all types of issues, the ½% number is perfectly credible.  Assuming there current server population is roughly one million, they are managing 5,500 failures per day requiring some form of intervention.  It’s pretty clear why auto-management systems are needed at anything even hinting at this scale. It would be super interesting to understand how many of these are recoverable software errors, recoverable hardware errors (memory faults etc.), and unrecoverable hardware errors requiring service or replacement. 

·         He reports there are “around 200 Google File System (GFS) clusters in operation. Some have over 5 PB of disk space over 200 machines.”   That ratio is about 10TB per machine.  Assuming they are buying 750GB disks that just over 13 disks.  I’ve argued in the past that a good service design point is to build everything on two hardware SKUs: 1) data light, and 2) data heavy.  Web servers and mid-tier boxes run the former and data stores run the later.  One design I like uses the same server board for both SKUs with 12 SATA disks in SAS attached disk modules.  Data light is just the server board.  Data heavy is the server board coupled with 1 or optionally more disk modules to get 12 , 24, or even 36 disks for each server. Cheap cold storage needs high disk to server ratios.

·         The largest Big table cells are 700TBs over 2,000 servers.”  I’m surprised to see two thousand reported by Kai-Fu as the largest BigTable cell – in the past I’ve seen references to over 5k. Let’s look at the storage to server ratio since he offered both.  700TB storage spread over 2k servers is only 350 GB per node. Given that they are using SATA disks, that would be only a single disk and a fairly small one at that.  That seems VERY light on storage. BigTable is a semi-structured storage layer over GFS.  I can’t imagine a GFS cluster with only 1 disk/server so I suspect the 2,000 node BigTable cluster that Kai-Fu described didn’t include the GFS cluster that it’s running over.  That helps but the number still are somewhat hard to make work.  These data don’t line up well with what’s been published in the past nor do they appear to be the most economic configurations.

 

Thanks to Zac Leow to sending this pointer my way.

 

                                --jrh

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Wednesday, June 25, 2008 8:06:53 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Tuesday, June 24, 2008

Earlier today Nokia announced it will acquire the remaining 52% share of the Symbian Limited to take over controlling interest of the mobile operating system provider with 91% of the outstanding shares.  This alone is interesting but what is fascinating is they also announced their intention to open source Symbian to create “the most attractive platform for mobile innovation and drive the development of new and compelling web-enabled applications”.  The press release reports the acquisition will be completed at 3.647 EUR/share at a total cost of 264m EUR. The new open source project responsible for the Symbian operating systems will be managed by the newly set up Symbian Foundation with support announced by Nokia, AT&T, Broadcom, Digia, NTT docomo, EA Mobile, Freescale, Fujitsu, LG, Motorola, Orange, Plusmo, Samsung, Sony Ericcson, ST, Symbian, Teleca, Texas Instruments, T Mobile, Vodaphone, and Wipro.

 

Other articles on the acquisition:

·         http://www.techcraver.com/2008/06/23/huge-news-nokia-acquires-symbian/

·         http://www.readwriteweb.com/archives/nokia_acquires_symbian.php

 

This substantially changes the mobile operating system world with all major offerings other than Windows Mobile, iPhone, and RIM now available in open source form.  The timing of this acquisition strongly suggests that it’s a response to a perceived threat from Google Android ensuring that, even if Android never gets substantial market traction, it’s already had a lasting impact on the market.

 

                                                                --jrh

 

Sent my way by Eric Schoonover

Update: Added iPhone to prooprietary mobile O/S list (thanks Dare Obasanjo).

 

James Hamilton, Data Center Futures
Bldg 99/2428, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

 

Tuesday, June 24, 2008 4:01:40 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, June 18, 2008

Lars Bak leads the Google Aarhus Denmark lab. He’s one of the original developers of Sun HotSpot Java VM. the Self Programming Language, and the sun Connected Limited Device Configuration VM for mobile phone.  He’s schedule to do a talk at JAOO Aarhaus, Denmark (Sept. 30, 2008).  Unconfirmed rumors report he will be announcing “Google Secret Project” during his JAOO keynote.

 

It’s hard to know for sure what is coming but the popular speculation is that Google will be announcing a dynamic language runtime with support for Python, JavaScript, and Java. A language runtime running on both server-side and client-side with support for a broad range of client devices including mobile phones would be pretty interesting.

 

                                --jrh

 

John Lam pointed me to: https://secure.trifork.com/speaker/Lars+Bak.

 

James Hamilton, Data Center Futures Team
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 |
JamesRH@microsoft.com

H:mvdirona.com | W:research.microsoft.com/~jamesrh  | blog:http://perspectives.mvdirona.com

Wednesday, June 18, 2008 6:54:03 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Saturday, June 14, 2008

Earlier today Google hosted the second Seattle Conference on Scalability. The talk on Chapel was a good description of a parallel language for high performance computing being implemented done at Cray.  The GIGA+ talk described a highly scalable filesystem metadata system implemented in Garth Gibson’s lab at CMU. The Google presentation described how they implemented maps.google.com on various mobile devices. It was full of gems on managing device proliferation and scaling the user interface down to small screen sizes. 

 

The Wikipedia talk showed an interesting transactional DHT implementation using Erlang.  And, the last talk of the day was a well presented talk by Vijay Menon on transactional memory. My rough notes from the 1 day conference follow.

 

                                    --jrh

 

 

Kick-off by Brian Bershad, Director of Google Seattle Lab

 

Communicating like Nemo: Scalability from a fish-eye View

·         Speaker: Jennifer Wong (Masters Student from University of Victoria)

o   Research area: faul tolerance in Mobile collaborative systems

·         Research aimed to bring easy communications at cost and beyond two-people for SCUBA divers.

·         Fairly large market with requirement to communicate

o   Note that PADI has 130k members (there are other many other recreational diving groups and, of course, there are commercial groups as well)

·         Proposal: use acoustic for underwater to surface stations. Wireless between surface stations doing relay.

·         Acoustic unit to be integrated into dive computer.

 

Maidsafe: Exploring Infinity

·         Speaker: David Irvine, Mindsafe.net Ltd.

o   http://www.maidsafe.net/ 

·         Problems with existing client systems

o   Data gets lost

o   Encryption is difficult

o   Remot4e access diff

o   Syncing between devices hard

·         Proposal: chunk files, hash the chunks, xor and then encrypt the chunks and distribute over a DHT over many systems

o   It wasn’t sufficiently clear where the “random data” that was xored in came from or where it was stored.

o   It wasn’t sufficiently clear where the encryption key that was used in the AES encryption came from or was stored.