Sunday, January 29, 2012

Don't be a show-off. Never be too proud to turn back. There are old pilots and bold pilots, but no old, bold pilots.

 

I first heard the latter part of this famous quote made by US Airmail Pilot E. Hamilton Lee back when I raced cars. At that time, one of the better drivers in town, Gordon Monroe, used a variant of that quote (with pilots replaced by racers) when giving me driving advice. Gord’s basic message was that it is impossible to win a race if you crash out of it.

 

Nearly all of us have taken the odd chance and made some decisions that, in retrospect, just didn’t make sense from a risk vs reward perspective. Age and experience clearly helps but mistakes still get made and none of us are exempt. Most people’s mistakes at work don’t have life safety consequences and their mistakes are not typically picked up widely by the world news services as was the case in the recent grounding of the Costa Concordia cruise ship. But, we all make mistakes.

 

I often study engineering disasters and accidents in the belief that understanding mistakes, failures, and accidents deeply is a much lower cost way of learning.  My last note on this topic was What Went Wrong at Fukushima Dai-1 where we looked at the nuclear release following the 2011 Tohuku Earthquake and Tsunami

 

Living on a boat and cruising extensively (our boat blog is at http://blog.mvdirona.com/) makes me particularly interested in the Costa Concordia incident of January 13th 2012. The Concordia is a 114,137 gross ton floating city that cost $570m when it was delivered in 2006. It is 952’ long, has 17 decks, and is power by 6 Wartsila diesel engines with a combined output of 101,400 horse power. The ship is capable of 23 kts (26.5 mph) and has a service speed of 21 kts. At capacity, it carries 3,780 passengers with a crew of 1,100.

 

From: http://en.wikipedia.org/wiki/Costa_Concordia_disaster:

 

The Italian cruise ship Costa Concordia partially sank on Friday the 13th of January 2012 after hitting a reef off the Italian coast and running aground at Isola del Giglio, Tuscany, requiring the evacuation of 4,197 people on board. At least 16 people died, including 15 passengers and one crewman; 64 others were injured (three seriously) and 17 are missing. Two passengers and a crewmember trapped below deck were rescued.

 

The captain, Francesco Schettino, had deviated from the ship's computer-programmed route in order to treat people on Giglio Island to the spectacle of a close sail-past. He was later arrested on preliminary charges of multiple manslaughter, failure to assist passengers in need and abandonment of ship. First Officer Ciro Ambrosio was also arrested.

 

It is far too early to know exactly what happened on the Costa Concordia and, because there was loss of life and considerable property damage, the legal proceedings will almost certainly run for years. Unfortunately, rather than illuminating the mistakes and failures and helping us avoid them in the future, these proceedings typically focus on culpability and distributing blame. That’s not our interest here. I’m mostly focused on what happened and getting all the data I could find on the table to see what lessons the situation yields.

 

A fellow boater, Milt Baker pointed me towards an excellent video that offers considerable data into exactly what happened in the final 1 hour and 30 min. You can find the video at: Grounding of Costa Concordia. Another interesting data source is the video commentary available at: John Konrad Narrates the Final Maneuvers of the Costa Concordia. In what follows, I’ve combined snapshots of the first video intermixed with data available from other sources including the second video.

 

The source data for the two videos above is a wonderful safety system called Automatic Identification System. AIS is a safety system required on larger commercial craft and also used on many recreational boats as well. AIS works by frequently transmitting (up to every 2 seconds for fast moving ships) via VHF radio the ships GPS position, course, speed, name, and other pertinent navigational data. Receiving stations on other ships automatically plot transmitting AIS targets on electronic charts. Some receiving systems are also able to plot an expected target course and compute the time and location of the estimated closest point of approach. AIS an excellent tool to help reduce the frequency of ship-to-ship collisions.

 

Since AIS data is broadcast over VHF radio, it is widely available to both ships and land stations and this data can be used in many ways. For example, if you are interested in the boats in Seattle’s Elliott Bay, have a look at MarineTraffic.com and enter “Seattle” as the port in the data entry box near the top left corner of the screen (you might see our boat Dirona there as well).

 

AIS data is often archived and, because of that, we have a very precise record of the Costa Concordia’s course as well as core navigational data as it proceeded towards the rocks. In the pictures that follow, the red images of the ship are at the ship’s position as transmitted by the Costa Concordia’s AIS system. The black line between these images is the interpolated course between these known locations. The video itself (Costa Concordia Interpolated.wmv) uses a roughly 5:1 time compression.

In this screen shot, you can see the Concordia already very close to the Italian Isol del Giglio. From the BBC report the Captain has said he turned too late (Costa Concordia: Captain Schettino ‘Turned Too Late’). From that article:

 

According to the leaked transcript quoted by Italian media, Capt Schettino said the route of the Costa Concordia on the first day of its Mediterranean cruise had been decided as it left the port of Civitavecchia, near Rome, on Friday.

 

The captain reportedly told the investigating judge in the city of Grosseto that he had decided to sail close to Giglio to salute a former captain who had a home on the Tuscan island. "I was navigating by sight because I knew the depths well and I had done this maneuver three or four times," he reportedly said.

 

"But this time I ordered the turn too late and I ended up in water that was too shallow. I don't know why it happened."

 

In this screen shot of the boat at 20:44:47 just prior to the grounding, you can see the boat turned to 348.8 degrees but the massive 114,137 gross ton vessel is essentially plowing sideways through the water on a course of 332.7 degrees. The Captain can and has turned the ship with the rudder but, at 15.6 kts, it does not follow the exact course steered with inertia tending to widen and straiten the intended turn. 

 

Given the speed of the boat and nearness of shore at this point, the die is cast and the ship is going to hit ground.

 

This screen shot was taken is just past the point of impact. You will note that it has slowed to 14.0 kts. You might also notice the Captain is turning aggressively to the starboard. He has the ship turned to a 8.9 degrees heading whereas the actual ships course lags behind at 356.2 degrees.

 

This screen shot is only 44 seconds after the previous one but the boat has already slowed from 14.0 kts to 8.1 and is still slowing quickly.  Some of the slowing will have come from the grounding itself but passengers report that they heard the boat hard astern after the grounding.

 

You can also see the captain has swung the helm over from the starboard course he was steering trying to avoid the rocks over to port course now that he has struck them. This is almost certainly in an effort to minimize damage. What makes this (possibly counter-intuitive) decision a good one is the ships pivot point is approximately 1/3 of the way back from the bow so turning to port (towards the shore) will actually cause the stern to rotate away from the rocks they just struck.

 

The ship decelerated quickly to just under 6.0 knots but, in the two minutes prior to this screen shot, it has only slowed a further 0.9 kts down to 5.1. There were reports of a loss of power on the Concordia. Likely what happened is ship was hard astern taking off speed until a couple of minutes prior to this screen shot when water intrusion caused a power failure. The ship is a diesel electric and likely lost power to its main prop due to rapid water ingress.

 

At 5 kts and very likely without main engine power, the Concordia is still going much too quickly to risk running into the mud and sand shore so the Captain now turns hard away from shore and he is heading back out into the open channel.

 

With the helm hard over the starboard with the likely assistance of the bow thrusters the ship is turning hard which is pulling speed off fairly quickly. It is now down to 3.0 kts and it continues to slow.

 

The Concordia is now down to 1.6 kts and the Captain is clearly using the bow thrusters heavily as the bow continues to rotate quickly. He has now turned to a 41 degree heading.

 

It now has been just over 29 min since the ship first struck the rocks. It has essentially stopped and the bow is being brought all the way back round using bow thrusters in an effort to drive the ship back in towards shore presumably because the Captain believes it is at risk of sinking so he is seeking shallow water.

 

The Captain continues to force the Concordia to shore under bow thruster power. In this video narrative (John Konrad Narrates the Final Maneuvers of the Costa Concordia), the commentator reported that the combination of bow thrusters and the prevailing currents where being used in combination by the Captain to drive the boat into shore.

 

A further 11 min and 22 seconds have past and the ship has now accelerated back up to 0.9 kts now heading towards shore.

 

It has been more than an hour and 11 minutes since the original contact with the rocks and the Costa Concordia is now at rest in its final grounding point.

 

The Coast Guard transcript of the radio communications with the Captain are at Costa Concordia Transcript: Coastguard Orders Captain to return to Stricken Ship. In the following text De Falco is the Coast Guard Commander and Schettino is the Captain of the Costa Concordia:

 

De Falco: "This is De Falco speaking from Livorno. Am I speaking with the commander?"

Schettino: "Yes. Good evening, Cmdr De Falco."

De Falco: "Please tell me your name."

Schettino: "I'm Cmdr Schettino, commander."

De Falco: "Schettino? Listen Schettino. There are people trapped on board. Now you go with your boat under the prow on the starboard side. There is a pilot ladder. You will climb that ladder and go on board. You go on board and then you will tell me how many people there are. Is that clear? I'm recording this conversation, Cmdr Schettino …"

Schettino: "Commander, let me tell you one thing …"

De Falco: "Speak up! Put your hand in front of the microphone and speak more loudly, is that clear?"

Schettino: "In this moment, the boat is tipping …"

De Falco: "I understand that, listen, there are people that are coming down the pilot ladder of the prow. You go up that pilot ladder, get on that ship and tell me how many people are still on board. And what they need. Is that clear? You need to tell me if there are children, women or people in need of assistance. And tell me the exact number of each of these categories. Is that clear? Listen Schettino, that you saved yourself from the sea, but I am going to … really do something bad to you … I am going to make you pay for this. Go on board, (expletive)!"

Schettino: "Commander, please …"

De Falco: "No, please. You now get up and go on board. They are telling me that on board there are still …"

Schettino: "I am here with the rescue boats, I am here, I am not going anywhere, I am here …"

De Falco: "What are you doing, commander?"

Schettino: "I am here to co-ordinate the rescue …"

De Falco: "What are you co-ordinating there? Go on board! Co-ordinate the rescue from aboard the ship. Are you refusing?"

Schettino: "No, I am not refusing."

De Falco: "Are you refusing to go aboard, commander? Can you tell me the reason why you are not going?"

Schettino: "I am not going because the other lifeboat is stopped."

De Falco: "You go aboard. It is an order. Don't make any more excuses. You have declared 'abandon ship'. Now I am in charge. You go on board! Is that clear? Do you hear me? Go, and call me when you are aboard. My air rescue crew is there."

Schettino: "Where are your rescuers?"

De Falco: "My air rescue is on the prow. Go. There are already bodies, Schettino."

Schettino: "How many bodies are there?"

De Falco: "I don't know. I have heard of one. You are the one who has to tell me how many there are. Christ!"

Schettino: "But do you realize it is dark and here we can't see anything …"

De Falco: "And so what? You want to go home, Schettino? It is dark and you want to go home? Get on that prow of the boat using the pilot ladder and tell me what can be done, how many people there are and what their needs are. Now!"

Schettino: "… I am with my second in command."

De Falco: "So both of you go up then … You and your second go on board now. Is that clear?"

Schettino: "Commander, I want to go on board, but it is simply that the other boat here … there are other rescuers. It has stopped and is waiting …"

De Falco: "It has been an hour that you have been telling me the same thing. Now, go on board. Go on board! And then tell me immediately how many people there are there."

Schettino: "OK, commander."

De Falco: "Go, immediately!"

 

At least 16 died in the accident and 17 were still missing when this was written (Costa Concordia Disaster).The Captain of the Costa Concordia, Francesco Schettino, has been charged with manslaughter and abandoning ship.

 

At the time of the grounding, the ship was carrying 2,200 metric tons of heavy fuel oil and 185 metric tons of diesel and remains environmental risk remains (Costa Concordia Salvage Experts Ready to Begin Pumping Fuel from Capsized Cruise Ship Off Coast of Italy). The 170 year old salvage firm Smit Salvage will be leading the operation.

 

All situations are complex and few disasters have only a single cause. However, the facts as presented to this point pretty strongly towards pilot error as the primary contributor in this event.  The Captain is clearly very experienced and his ship handling after the original grounding appear excellent. But, it’s hard to explain why the ship was that close to the rocks, the captain has reported that he turned too late, and public reports have him on the phone at or near the time of the original grounding.

 

What I take away from the data points presented here is that experience, ironically,  can be our biggest enemy. As we get increasingly proficient at a task, we often stop paying as much attention. And, with less dedicated focus on a task, over time, we run the risk of a crucial mistake that we probably wouldn’t have made when we were effectively less experienced and perhaps less skilled. There is danger in becoming comfortable.

 

The videos referenced in the above can be found at:

·         Grounding of Costa Concordia Interpolated

·         gCaptain’s John Konrad Narrates the Final Maneuvers of the Costa Concordia

 

If you are interested in reading more:

·         http://www.masslive.com/news/index.ssf/2012/01/costa_concordia_salvage_expert.html

·         http://www.bbc.co.uk/news/world-europe-16620807

·         http://www.washingtonpost.com/world/divers-in-grounded-costa-concordia-112/2012/01/25/gIQAOkD2PQ_video.html

·         http://www.bbc.co.uk/news/world-europe-16620807

·         http://www.foxnews.com/slideshow/world/2012/01/14/luxury-ship-runs-aground-off-italy-bodies-found/#slide=22

·         http://www.telegraph.co.uk/news/worldnews/europe/italy/9042826/Wife-of-Costa-Concordia-captain-says-it-is-not-for-those-on-land-to-judge-her-husband.html

·         http://www.telegraph.co.uk/news/interactive-graphics/9018076/Concordia-How-the-disaster-unfolded.html

·         http://news.qps.nl.s3.amazonaws.com/Grounding+Costa+Concordia.pdf

·         http://www.bellenews.com/2012/01/14/world/europe-news/italian-captain-of-costa-concordia-cruise-ship-has-been-arrested/

·         http://en.wikipedia.org/wiki/Costa_Concordia

·         http://en.wikipedia.org/wiki/Costa_Concordia_disaster

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, January 29, 2012 11:24:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [8] - Trackback
Ramblings
 Thursday, January 26, 2012

Ordinarily I focus this blog on areas of computing where I spend most of my time from high performance computing to database internals and cloud computing. An area that interests me greatly  but I’ve seldom written about is entrepreneurship and startups.

 

One of the Seattle areas startups with which I stay in touch is Socrata. They are focused on enabling federal, state, and local governments to improve the reach, usability and social utility of their public information assets.  Essentially making public information available and useful to their constituents. They are used by: the World Bank, the United Nations, the World Economic Forum, the US Data.Gov, Health & Human Services, Centers for Disease Control, several most major cities including NYC, Seattle, Chicago, San Francisco and Austin and many county and state governments. Even foreign governments like the Country of Kenya have adopted Socrata.

 

I first met Kevin Merritt, the founder and CEO of Socrata, back in 2005 when I was doing technical diligence for the Microsoft acquisition of the LA-based Frontbridge Technologies. I love doing diligence on startups because it’s an opportunity to dive in and spend a day or more digging deeply and understanding what smart people have produced, where things worked really well, and areas where things didn’t pan out as well as they could have. I’ve learned a lot in these roles and I’m  lucky to have been able to do many of them first at IBM, later at Microsoft, and now at Amazon.

 

What made this one a bit different is I got a call shortly after the deal closed asking if I wanted to be the General Manager of the Microsoft subsidiary that was formed in the acquisition. An opportunity to run mid-sized business in its entirety. Development, test, operations, and customer support. Absolutely! I’ve never learned so much as I did in the first year or so at what would become Microsoft Exchange Hosted Services.

 

It was a great experience and I’ve been 100% focused on cloud services since that time. And, as a consequence of leading Frontbridge, I got to know Kevin Merritt well. He is an excellent strategic thinker and an even better operator. Whenever Kevin was involved, customers were happy and the service was rapidly improving and expanding.  Kevin eventually left to form Socrata and he and I have stayed in touch since then. He knows I’m a sucker for a beer and some wings :-).

 

Based in Seattle, Socrata is venture-backed with a small and talented engineering team.  They are enjoying strong customer demand and their market success is fueling growth in the engineering team. They are currently looking for a CTO and, if I didn’t already have one of the best job out there, I would seriously considering joining Kevin and the team.  If you are a technology leader interested in big data, cloud computing, architecture of distributed systems, ops automation, and the user experience of making data easy to find and use, you should send Kevin, their founder and CEO, a note at kevin.merritt@socrata.com.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Thursday, January 26, 2012 5:13:24 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Wednesday, January 18, 2012

Finally! I’ve been dying to talk about DynamoDB since work began on this scalable, low-latency, high-performance NoSQL service at AWS. This morning, AWS announced availability of DynamoDB: Amazon Web Services Launches Amazon DynamoDB – A New NoSQL Database Service Designed for the Scale of the Internet.

 

In a past blog entry, One Size Does Not Fit All, I offered a taxonomy of 4 different types of structured storage system, argued that Relational Database Management Systems are not sufficient, and walked through some of the reasons why NoSQL databases have emerged and continue to grow market share quickly. The four database categories I introduced were: 1) features-first, 2) scale-first, 3) simple structure storage, and 4) purpose-optimized stores. RDBMS own the first category.

 

DynamoDB targets workloads fitting into the Scale-First and Simple Structured storage categories where NoSQL database systems have been so popular over the last few years.  Looking at these two categories in more detail, Scale-First is:

 

Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS. As soon as a single RDBMS instance won’t handle the workload, there are two broad possibilities: 1) shard the application data over a large number of RDBMS systems, or 2) use a highly scalable key-value store.

 

And, Simple Structured Storage:

 

There are many applications that have a structured storage requirement but they really don’t need the features, cost, or complexity of an RDBMS. Nor are they focused on the scale required by the scale-first structured storage segment. They just need a simple key value store. A file system or BLOB-store is not sufficiently rich in that simple query and index access is needed but nothing even close to the full set of RDBMS features is needed. Simple, cheap, fast, and low operational burden are the most important requirements of this segment of the market.

 

More detail at: One Size Does Not Fit All.

 

The DynamoDB service is a unified purpose-built hardware platform and software offering. The hardware is based upon a custom server design using Flash Storage spread over a scalable high speed network joining multiple data centers.

 

DynamoDB supports a provisioned throughput model. A DynamoDB application programmer decides the number of database requests per second their application should be capable of supporting and DynamoDB automatically spreads the table over an appropriate number of servers. At the same time, it also reserves the required network, server, and flash memory capacity to ensure that request rate can be reliably delivered day and  night, week after week, and year after year.  There is no need to worry about a neighboring application getting busy or running wild and taking all the needed resources. They are reserved and there whenever needed.

 

The sharding techniques needed to achieve high requests rates are well understood industry-wide but implementing them does take some work. Reliably reserving capacity so it is always there when you need it, takes yet more work.  Supporting the ability to allocate more resources, or even less, while online and without disturbing the current request rate takes still more work. DynamoDB makes all this easy. It supports online scaling between very low transaction rates to applications requiring millions of requests per second. No downtime and no disturbance to the currently configured application request rate while resharding. These changes are done online only by changing the DynamoDB provisioned request rate up and down through an API call.

 

In addition to supporting transparent, on-line scaling of provisioned request rates up and down over 6+ orders of magnitude with resource reservation, DynamoDB is also both consistent and multi-datacenter redundant. Eventual consistency is a fine programming model for some applications but it can yield confusing results under some circumstances. For example, if you set a value to 3 and then later set it to 4, then read it back, 3 can be returned. Worse, the value could be set to 4, verified to be 4 by reading it, and yet 3 could be returned later. It’s a tough programming model for some applications and it tends to be overused in an effort to achieve low-latency and high throughput.  DynamoDB avoids forcing this by supporting low-latency and high throughout while offering full consistency. It also offers eventual consistency at lower request cost for those applications that run well with that model. Both consistency models are supported.

 

It is not unusual for a NoSQL store to be able to support high transaction rates. What is somewhat unusual is to be able to scale the provisioned rate up and down while on-line. Achieving that while, at the same time, maintaining synchronous, multi-datacenter redundancy is where I start to get excited.

 

Clearly nobody wants to run the risk of losing data but NoSQL systems are scale-first by definition. If the only way to high throughput and scale, is to run risk and not commit the data to persistent storage at commit time, that is exactly what is often done. This is where  DynamoDB really shines. When data is sent to DynamoDB, it is committed to persistent and reliable storage before the request is acknowledged. Again this is easy to do but doing it with average low single digit millisecond latencies is both harder and requires better hardware. Hard disk drives can’t do it and in-memory systems are not persistent so flash memory is the most cost effective solution.

 

But what if the server to which the data was committed fails, or the storage fails, or the datacenter is destroyed? On most NoSQL systems you would lose your most recent changes.  On the better implementations, the data might be saved but could be offline and unavailable. With dynamoDB, if data is committed just as the entire datacenter burns to the ground, the data is safe, and the application can continue to run without negative impact at exactly the same provisioned throughput rate. The loss of an entire datacenter isn’t even inconvenient (unless you work at Amazon :-)) and has no impact on your running application performance.

 

Combining rock solid synchronous, multi-datacenter redundancy with average latency in the single digits, and throughput scaling to the millions of requests per second is both an excellent engineering challenge and one often not achieved.

 

More information on DynamoDB:

·         Press Release: http://phx.corporate-ir.net/phoenix.zhtml?c=176060&p=irol-newsArticle&ID=1649209&highlight=

·         DynamoDB detail Page: http://aws.amazon.com/dynamodb/

·         DynamoDB Developer Guide: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/

·         Blog entries:

o     Werner: http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html

o    Jeff Barr: http://aws.typepad.com/aws/2012/01/amazon-dynamodb-internet-scale-data-storage-the-nosql-way.html

·         DynamoDB Frequently Asked Questions: http://aws.amazon.com/dynamodb/faqs/

·         DynamoDB Pricing: http://aws.amazon.com/dynamodb/pricing/

·         GigaOM: http://gigaom.com/cloud/amazons-dynamodb-shows-hardware-as-mean-to-an-end/

·         eWeek: http://www.eweek.com/c/a/Database/Amazon-Web-Services-Launches-DynamoDB-a-New-NoSQL-Database-Service-874019/

·         Seattle Times: http://seattletimes.nwsource.com/html/technologybrierdudleysblog/2017268136_amazon_unveils_dynamodb_databa.html

 

Relational systems remain an excellent solution for applications requiring Feature-First structured storage. AWS Relational Database Service supports both the MySQL and Oracle and relational database management systems: http://aws.amazon.com/rds/.

 

Just as I was blown away when I saw it possible to create the world’s 42nd most powerful super computer with a few API calls to AWS (42: the Answer to the Ultimate Question of Life, the Universe and Everything), it is truly cool to see a couple of API calls to DynamoDB be all that it takes to get a scalable, consistent, low-latency, multi-datacenter redundant, NoSQL service configured, operational and online.

 

                                                --jrh

  

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, January 18, 2012 1:00:06 PM (Pacific Standard Time, UTC-08:00)  #    Comments [6] - Trackback
Services
 Monday, January 16, 2012

Occasionally I come across a noteworthy datacenter design that is worth covering. Late last year a very interesting Japanese facility was brought to my attention by Mikio Uzawa an IT consultant who authors the Agile Cat blog. I know Mikio because he occasionally translates Perspectives articles for publication in Japan.

 

Mikio pointed me to the Ishikari Datacenter in Ishikari City, Hokkaido Japan. Phase I of this facility was just completed in November 2011. This facility is interesting for a variety of reasons but the design features I found most interesting are: 1) High voltage direct current power distribution, 2) whole building ductless cooling, and 3) aggressive free air cooling.

 

High Voltage Direct Current Power Distribution

I first came across the use of direct current when Annabel Pratt took me through the joint work Intel was doing with Lawrence Berkeley National Lab on datacenter HVDC distribution (Evaluation of Direct Current Distribution in Data Centers to Improve Energy Efficiency). In this approach they distribute 400V direct current rather than the more conventional 208V to 240V alternating current used in most facilities today.

 

High voltage direct current work in datacenters has been around for around a decade and it is in extensive test at many facilities world-wide.  Many companies are 100% focused on HVDC design consulting with Validus being one of the better known. 

 

The savings potential of HVDC are often shown to be very exciting with numbers beyond 30% frequently quoted. But the marketing material I’ve gone through in detail compare excellent HVDC designs with very poor AC designs. Predictably the savings are around 30%. Unfortunately, the difference between good AC and bad AC designs are also around 30% :-).

 

When I look closely at HVDC distribution, I see slight improvements in efficiency at around 3 to 5%, somewhat higher costs of equipment since it is less broadly used, less equipment availability and longer delivery times, and somewhat more complex jurisdictional issues with permitting and other approvals taking longer in some regions. Nonetheless, the picture continues to improve, the industry as a whole continues to learn, and I think there is a good chance that high voltage DC distribution will end up becoming a more common choice in modern datacenters.

 

The Ishikari facility is a high voltage DC distribution design. I’m looking forward to learning more about this aspect of the facility and watching how the system performs.

 

Whole Building Ductless Cooling

Air handling ducts costs money and restrict flow so why not recognize that the entire purpose of a datacenter shell is to keep the equipment dry and secure and to transport heat. Instead of installing extensive duct work, just treat the entire building as a very large air duct.

 

Perhaps the nicest mechanical design I’ve come across based upon ductless cooling is the Facebook Prineville facility. In this design, they use the entire second floor of the building for air handling and the lower floor for the server rooms.

The Ishikari design shares many design aspects with the Intel Jones Farms facility where the IT equipment is on the second floor and the electrical equipment is on the first.

 


Aggressive Free-Air Cooling

Looking at the air flow diagram above, you can see that the Ishikari Datacenter is making good use of the datacenter friendly climate of Japan and aggressively using free-air cooling. Free-air cooling, often called air side economization, is one of the most effective ways of driving down datacenter costs and substantially increasing overall efficiency. It’s good to see this design point spreading rapidly.

 

More information is available at: http://ishikari.sakura.ad.jp/index_eng.html

 

Some datacenter designs I’ve covered in the past:

·         Facebook Prineville Mechanical Design

·         Facebook Prineville UPS & Power Supply

·         Example of Efficient Mechanical Design

·         46MW with Water Cooling at a PUE of 1.10

·         Yahoo! Compute Coop Design

·         Microsoft Gen 4 Modular Data Centers

 

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, January 16, 2012 10:00:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Monday, January 02, 2012

Years ago, Dave Patterson remarked that most server innovations were coming from the mobile device world. He’s right. Commodity system innovation is driven by volume and nowhere is there more volume than in the mobile device world.  The power management techniques applied fairly successfully over the last 5 years had their genesis in the mobile world.  And, as processor power efficiency improves, memory is on track to become the biggest power consumer in the data center. I expect the ideas to rein in memory power consumption will again come from the mobile device world. Just as Eskimo’s are reported (apparently incorrectly) to have 7 words for snow, mobile memory systems have a large array of low power states with subtly different power dissipations and recovery times. I expect the same techniques will arrive fairly quickly to the server world.

 

ARM processors are used extensively in cell phones and embedded devices. I’ve written frequently of the possible impact of ARM on the server-side computing world.

·         Linux/Apache on ARM Processors

·         ARM Cortex-A9 SMP Design Announced

·         Very Low-Cost, Low-Power Servers

·         NVIDIA Project Denver: ARM Powered Servers

 

ARM remain power efficient while at the same time they are rapidly gaining the performance and features needed to run demanding server-side workloads. A key next step was made late last year when ARM announced the ARM V8 architecture. Key attributes of the new ARM architecture are:

·         64 bit virtual addressing

·         40 bit physical addresses

·         HW virtualization support

The first implementation of the ARM V8 architecture was announced the same day by Applied Micro Devices. The APM design is available in an FPGA implementation for development work this month and is expected to be in final system-on-a-chip form in 2H2012. The APM X-Gene offers:

 

·         64bit addressing

·         3 Ghz

·         Up to 128 cores

·         Super-scalar, quad issue processor

·         CPU and I/O virtualization support

·         Out of order processing

·         80 GB/sec memory throughput

·         Integrated Ethernet and PCIe

·         Full LAMP software stack port

 

APM X-Gene announcement:

·         Press Release: AppliedMicro Showcases World’s First 64-bit ARM v8 Core

·         Slides: Applied Micro Announces X-Gene

 

More ARM and low power servers reading:

·         ARM V8 Press Release: http://www.arm.com/about/newsroom/arm-discloses-technical-details-of-the-next-version-of-the-arm-architecture.php

·         AnandTech: http://www.anandtech.com/show/5027/appliedmicro-announces-xgene-arm-based-socs-for-cloud-computing

·         Ars technica: http://arstechnica.com/business/news/2011/10/arm-aims-for-the-server-room-with-its-new-64-bit-armv8-architecture.ars

·         CIDR Paper on low power computing: http://mvdirona.com/jrh/talksandpapers/jameshamilton_cems.pdf

·         The Case for Energy Proportional Computing: http://research.google.com/pubs/pub33387.html

·         ARM V8 Architecture: http://www.arm.com/files/downloads/ARMv8_Architecture.pdf

 

In the 2nd half of 2012 we will have a very capable, 64bit, server-targeted ARM processor implementation available to systems builders.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, January 02, 2012 9:21:02 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Hardware
 Wednesday, December 14, 2011

If you work in the database world, you already know Phil Bernstein. He’s the author of Principles of Transaction Processing and has a long track record as a successful and prolific database researcher.  Past readers of this blog may remember Phil’s guest blog posting on Google Megastore. Over the past few years, Phil has been working on an innovative NoSQL system based upon flash storage. I like the work because it pushes the limit of what can be done on a single server with transaction rates approaching 400,000, leverages the characteristics of flash storage in a thought provoking way, and employs interesting techniques such as log-only storage.

 

Phil presented Hyder at the Amazon ECS series a couple of weeks back (a past ECS presentation at: High Availability for Cloud Computing Database Systems.

 

In the Hyder system, all cores operate on a single shared transaction log. Each core (or thread) processes Optimistic Concurrency Control (OCC) database transactions one at a time. Each transaction posts its after-image to the shared log. One core does OCC and rolls forward the log. The database is a binary search tree serialized into the log (A B-tree would work equally well in this application). Because the log is effectively a no-overwrite, log-only datastore, a changed node require that the parent must now point to this new node which forces the parent to be updated as well. Now its parent needs updating and this cascading set of changes proceeds to the root on each update.

 

The tree is maintained via copy-on-write semantics where updates are written to the front of the log with references to unchanged tree nodes pointing back to the appropriate locations in the log. Whenever a node changes, the changed node is written to the front of the log. Consequently all database changes result in changes to all nodes to the top of the search tree.

 

This has the downside of requiring many tree nodes to be updated on each database update but has the upside of the writes all being sequential at the front of the log. Since it is a no-overwrite store, when an update is made, the old nodes remain so transactional time travel is easy. The old search tree root still point to a complete tree that was current as of the point in time when that root was the current root of the search tree.  As new nodes are written, some old nodes are no longer part of the current search tree and can be garbage collected over time.

Transactions are implemented by writing an intention log record to the front of the log with all changes required by this transaction and these tree nodes point either to other nodes within the intention record or to unchanged nodes further back in the log. This can be done quickly and all updates can proceed  in parallel without need for locking or synchronization.

 

Before the transaction can be completed, it must now be checked for conflict using Optimistic Concurrency Control. If there are no conflicts, the root of the search tree is atomically moved to point to the new root and the transaction is acknowledged as successful. If the transaction is in conflict, it is failed and the tree root is not advanced and the intention record becomes garbage.

 

Most of the transactional update work can be done concurrently without locks but two issues come to mind quickly:

 

1)      Garbage collection: because the systems is constantly rewriting large portions of the search tree, old versions of the tree a spread throughout the log and need to be recovered.

2)      Transaction Rate: The transaction rate is limited by the rate at which conflicts can be checked and the tree root advanced.

 

The latter is the biggest concern and the rest of the presentation focuses on the rate with which this bottleneck can be processed.  The presenter showed that rates in 400,000 transaction per second where obtained in performance testing so this is a hard limit but it is a fairly high hard limit. This design can go a long way before partitioning is required.

 

If you want to dig deeper, the Hyder presentation is at:

http://mvdirona.com/jrh/TalksAndPapers/Hyder4Amazon5Dec2011.pdf

 

More detailed papers can be found at:

 

Philip A. Bernstein, Colin W. Reid, Sudipto Das: Hyder - A Transactional Record Manager for Shared Flash. CIDR 2011: 9-20

http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf

 

Philip A. Bernstein, Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees. PVLDB 4(11): 944-955 (2011)

http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf

 

Colin W. Reid, Philip A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE Data Eng. Bull. 33(4): 14-20 (2010)

http://sites.computer.org/debull/A10dec/hyder.pdf

 

Mahesh Balakrishnan, Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement: Flash-Log - A High Throughput Log. DISC 2010: 401-403

http://www.springerlink.com/content/c732l27h3mrn3170/

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Wednesday, December 14, 2011 9:43:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Software
 Sunday, November 27, 2011

While at Microsoft I hosted a weekly talk series called the Enterprise Computing Series (ECS) where I mostly scheduled technical talks on server and high-scale service topics. I said “mostly” because the series occasionally roamed as far afield as having an ex-member of the Ferrari Formula 1 team present. Client-side topics are also occasionally on the list either because I particularly liked the work or technology behind it or thought it was a broadly relevant topic.

 

The Enterprise Computing Series has an interesting history. It was started by Jim Gray at Tandem.  Pat Helland picked up the mantle from Jim and ran it for years before Pat moved to Andy Heller’s Hal Computer Systems. He continued the ECS at HAL and then brought it with him when he joined Microsoft where he continued to run it for years. Pat eventually passed it to me and I hosted the ECS series for 8 or 9 years myself before moving to Amazon Web Services. Ironically when I arrived at Amazon, I found that Pat Helland had again created a series in the same vein as the ECS called the Principals of Amazon (PoA) series.

 

The PoA series is excellent but it doesn’t include external speakers and is hosted on a fixed day of the week so I occasionally come across a talk that I would like to host at Amazon that doesn’t fit the PoA. For those occasions, the Enterprise Computing Series lives on!

 

In this ECS talk Ashraf Aboulnaga of the University of Waterloo presented High Availability for Database Systems in Cloud Computing Environments. Ashraf presented two topics, 1) RemusDB: Database high availability using virtualization, and 2) DBECS: Database high availability and availability using eventually consistent cloud storage. The first topic was based upon the VLDB 2011 Best Paper Award “RemusDB: Transparent HighAvailability for Database Systems” by Umar Farooq Minha, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Ken Salem, and Andrew Warfield. The second topic is work that is not yet published nor as fully developed.

 

Focusing on the first paper, they built an active/standby database system using Remus. Remus implements transparent high availability for Xen VMs. It does this by reflecting all writes to memory in the active virtual machine to the non-active, backup VM.  Remus keeps the backup VM ready to take over with exactly the same memory state as the primary server. On failover, it can take over with the same memory contents including an already warm cache.

Remus is a simple and easy to understand approach to getting very fast takeover from a primary VM. The challenge is that memory write latencies are a fraction of network latencies so any solution that turns memory write latencies into network write latencies simply will not perform adequately for most workloads. Remus tackles this problem using the expected solution: batching many requests in a single network transfer. By default, every 25msec Remus suspends the primary VM, copies all changed pages to a Dom0 (hypervisor) buffer and the allows the VM to continue. The Dom0 buffer is used to minimized the length of time that the guest VM needs to be suspended but comes at the expense of requiring sufficient Dom0 memory for the largest group of changed pages in 25msec.

 

Once the guest machine changed pages are copied to Dom0, the primary VM is released from suspend state and the changes just copied to dom0 are then transferred to the secondary system and applied to the ready to run backup VM.

 

The downsides to the Remus approach are 1) a potentially large dom0 buffer is required and 2) up to 25msec of forward progress can be lost on failover, 3) the checkpoint work consumes considerable resources including time. The time to copy the changed pages may be acceptable but the other overheads are sufficiently high that it is very difficult to host demanding workloads like database workloads on Remus.

 

The authors tackle this problem but noting that Remus actually does more than is needed for database workloads. Or, worded differently, a Remus optimized for database workloads can dramatically reduce the implementation overhead. They introduced the following optimizations:

·         Asynchronous checkpoint compression: Maintain an LRU buffer of recent pages and only ship a delta of these pages. This optimization is based upon the assumption that DB systems modify some pages frequently and typically only change a small part of these pages between checkpoints.

·         Disk read tracking: don’t mark pages read from disk as dirty since they are already available to the backup server via an I/O

·         Memory deprotection: allows DB to declare regions of memory that don’t need to be replicated. This turned out not to be as powerful an optimization as the others and had the further downside of requiring database engine changes

·         Network optimization/Commit protection: Remus buffers every outgoing network packet to ensure clients never see the results of unsafe execution but this increases latency by not allowing any response back to the client until the next Remus checkpoint. Because DBs can fail and transactions can be aborted, they DB optimization is to send all packets back to client in real time except for commit, abort, or other database transaction state changing operations. On failover, any client in an unprotected network state (changes have been sent since the last checkpoint) has the transaction failed. A correct client will re-run the transaction and proceed without issue.

 

What was achieved is Remus, fast-failover protection for database workloads and far lower replication overhead. The authors used the database transaction benchmark TPC-C to show that Remus with DB optimizations has all the protection of Remus but with roughly 1/10th the overhead.

 

                Slides: http://mvdirona.com/jrh/TalksAndPapers/AshrafAboulnaga20111114.pdf

                VLDB Paper: http://www.cs.uwaterloo.ca/~ashraf/pubs/pvldb11remusdb.pdf

 

I'm not 100% convinced Remus is the best solution to the database high availability problem but I like the solution, learned from the proposed optimizations, and enjoyed the talk. Thanks to Pradeep Madhavarapu, who leads part of the Amazon database kernel engineering team (and is hiring :-)), for organizing this talk and to  Ashraf Aboulnaga for doing it.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, November 27, 2011 12:50:18 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Software
 Tuesday, November 22, 2011

Netflix is super interesting in that they are running at extraordinary scale, are a leader in the move to the cloud, and Adrian Cockcroft, the Netflix Director of Cloud Architecture, is always interesting in presentations. In this presentation Adrian covers similar material to his HPTS 2011 talk I saw last month.

 

His slides are up at: http://www.slideshare.net/adrianco/global-netflix-platform and my rough notes follow:

·         Netflix has 20 milion streaming members

o    Currently in US, Canada, and Latin America

o    Soon to be in UK and Ireleand

·         Netflix is 100% public cloud hosted

·         Why did Netflix move from their own high-scale facility to a public cloud?

o    Better business agility

o    Netflix was unable build datacenters fast enough

o    Capacity growth was both accelerating and unpredictable

o    Product launch spikes require massive new capacity (iPhone, Wii, PS3, & Xbox)

Netflix grew 37x from Jan 2010 through Jan 2011

 

·         Why did Netflix choose AWS as their cloud solution?

o    Chose AWS using Netflix own platform and tools

o    Netflix has unique platform requirements and extreme scale needing both agility & flexibility

o    Chose AWS partly because it was the biggest public cloud

§  Wanted to leverage AWS investment in features and automation

§  Wanted to use AWS availability zones and regions for availability, scalability, and global deployment

§  Didn’t want to be the biggest customer on a small cloud

o    But isn’t Amazon a competitor?

§  Many products that compete with Amazon run on AWS

§  Netflix is the “poster child” for the AWS Architecture

§  One of the biggest AWS customers

§  Netflix strategy: turn competitors into partners

o    Could Netflix use a different cloud from AWS

§  Would be nice and Netflix already uses 3 interchangeable CDN vendors

§  But no one else has the scale and features of AWS

·         “you have to be tall to ride this ride”

·         Perhaps in 2 to 3 years?

o    “We want to use cloud, we don’t want to build them”

§  Public clouds for agility and scale

§  We use electricity too but we don’t want to build a power station

§  AWS because they are big enough to allocated thousands of instances per hour when needed

 

 

o    Netflix Global PaaS

§  Supports all AWS Availability Zones and Regions

§  Supports multiple AWS accounts (test, prod, etc.)

§  Supports cross Regions and cross account data replication & archiving

§  Supports fine grained security with dynamic AWS keys

§  Autoscales to thousands of instances

§  Monitoring for millions of metrics

o    Portals and explorers:

§  Netflix Application Console (NAC): Primary AWS provisioning & config interface

§  AWS Usage Analyzer: cost breakdown by application and resource

§  SimpleDB Explorer: browse domains, items, attributes, values,…

§  Cassandra Explorer: browse clusters, keyspaces, column families, …

§  Base Service Explorer: browse endpoints, configs, perf metrics, …

o    Netflix Platform Services:

§  Discovery: Service Register for applications

§  Introspections: Endpoints

§  Cryptex: Dynamic security key management

§  Geo: Geographic IP lookup engine

§  Platform Serivce: Dynamic property configuration

§  Localization: manage and lookup local translations

§  EVcache: Eccetric Volatile (mem)Cached

§  Cassadra: Persistence

§  Zookeeper: Coordination

o    Netflix Persistence Services:

§  SimpleDB: Netflix moving to Cassandra

·         Latencies typically over 10msec

§  S3: using the JetS3t based interface with Netflix changes and updates

§  Eccentric Volatile Cache (evcache)

·         Discovery aware memcached based backend

·         Client abstractions for zone aware replication

·         Supports option to write to all zones, fast read from local

·         On average, latencies of under 1 msec

§  Cassandra

·         Chose because they value availability over consistency

·         On average, latency of “few microseconds”

§  MongoDB

§  MySQti: supports hard to scale, legacy, and small relational models

o    Implemented a Multi-Regional Data Replication system:

§  Oracle to SimpleDB and queued reverse path usingj SQS

o    High Availability:

§  Cassandra stores 3 local copies, 1 per availability zone

§  Each AWS availability zone is a separate building with separate power etc. but still fairly close together so synchronous access is practical

§  Synchronous access, durable, and highly available

 

Adrian’s slide deck is posted at: http://www.slideshare.net/adrianco/global-netflix-platform.

 

                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

 

Tuesday, November 22, 2011 1:09:55 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Friday, November 18, 2011

I seldom write consumer product reviews and this blog is about the furthest thing from a consumer focused site but, every so often, I come across a notable tidbit that is worthy of mention. A few weeks ago, it was Sprint unilaterally changing the terms of their wireless contracts (Sprint is Giving Free Customer Service Lessons). It just seemed a sufficiently confused decision that it was worthy of mention.

 

Here’s one that just nails it on the other side of the equation by obsessing over the customer experience: Roku. I’ve long known about Roku but I’m not a huge TV watcher so I’ve only been peripherally interested in the product. But we are both Netflix and Amazon Prime Instant Video customers and Roku supports both. And the entry level Roku streaming appliance is only $49 so we figured let’s give it a try. It actually ended up a bit more than $49 in that we first managed to upsell ourselves to a $59 Roku 2 to get HD, and then to a $79 device to get 1080P and then to a $99 device to het 1080P HD with a hardwired Ethernet connection. So we ended up with a $100 device. I think $50 is close to where this class of devices needs to end up but $100 is reasonable as well.

The device is amazing and shows what can be done with a focus on clean industrial design. It is incredibly small at only 3” square. I plugged it in, it booted up, updated its software, found its remote, upgraded the software on the remote and went live without any user interaction. I setup a Roku account, linked my Amazon account for access to Prime Instant Video, linked our Netflix account and it was ready to go.

 

The device is tiny, produces close to no heat, you don’t have to read the manual, the user interface is clean and notable for its snappiness. I expected a sluggish UI as many companies scrimp on processing power to get costs down but it is very snappy. In fact Netflix on a Roku is faster than the same support on an Xbox. The UI is clean, simple, snappy, and very elegant.

 

I love where consumer appliances are heading: simple, cheap, dedicated, purpose-build devices with clean user interfaces, and the hybrid delivery model where the user interface is delivered by the appliance but most of the functionality is hosted in the cloud.  The combination of cheap microelectronics, open source operating systems, and cloud hosting allows incredibly high function devices to be delivered at low cost. 

 

The Kindle Fire takes the hybrid cloud connected model a long way where the Fire’s Silk browser UI runs directly on the device close to the user where it can be highly interactive and responsive. But the power and network-bandwidth hungry browser backend is hosted on Amazon EC2 where connectivity is awesome and compute power is not battery constrained. I love the hybrid model and we are going to see more and more devices delivering a hybrid user experience where the compute intensive components are cloud hosted and user interface is in the device. My belief is that this is the future of consumer electronics and, as prices drop to the $30 to $50 range, everyone will have 10s of these special-purpose, cloud-connected devices.

 

For the first time in my life, I’m super interested in consumer devices and the possibilities of what can be done in the hybrid cloud-connected appliance model.

 

                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Friday, November 18, 2011 7:44:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Services
 Tuesday, November 15, 2011

Yesterday the Top 500 Supercomputer Sites was announced. The Top500 list shows the most powerful commercially available supercomputer systems in the world. This list represents the very outside of what supercomputer performance is possible when cost is no object. The top placement on the list is always owned by a sovereign funded laboratory. These are the systems that only government funded agencies can purchase. But they have great interest for me because, as the cost of computing continues to fall, these performance levels will become commercially available to companies wanting to run high scale models and data intensive computing. In effect, the Top500 predicts the future so I’m always interested in the systems on the list.

 


What makes this list of the fastest supercomputers in the world released yesterday particularly unusual can be found at position #42.  42 is an anomaly of the first order. In fact, #42 is an anomaly across enough dimensions that its worth digging much deeper.

 

Virtualization Tax is Now Affordable:

I remember reading through the detailed specifications when the Cray 1 supercomputer was announced and marveling that it didn’t even use virtual memory. It was believed at the time that only real-mode memory access could deliver the performance needed.

We have come a long way in the nearly 40 years since the Cray 1 was announced. This #42 result was run not just using virtual memory but with virtual memory in a guest operating system running under a hypervisor. This is the only fully virtualized, multi-tenant super computer on the Top500 and it shows what is possible as the virtualization tax continues to fall. This is an awesome result and many more virtualization improvements are coming over the next 2 to 3 years.

 

Commodity Networks can Compete at the Top of the Performance Spectrum:

This is the only Top500 entrant below number 128 on the list that is not running either Infiniband or a proprietary, purpose-built network. This result at #42 is an all Ethernet network showing that a commodity network, if done right, can produce industry leading performance numbers.

 

What’s the secret?  10Gbps directly the host is the first part. The second is full non-blocking networking fabric where all systems can communicate at full line rate at the same time.  Worded differently, the network is not oversubscribed. See Datacenter Networks are in my Way for more on the problems with existing datacenter networks.

 

Commodity Ethernet networks continue to borrow more and more implementation approaches and good network architecture ideas from Infiniband, scale economics continues to drive down costs so non-blocking networks are now practical and affordable, and scale economics are pushing rapid innovation. Commodity equipment in a well-engineered overall service is where I see the future of networking continuing to head.

 

Anyone can own a Supercomputer for an hour:

You can’t rent supercomputing time by the hour from Lawrence Livermore National Laboratory. Sandia is not doing it either. But you can have a top50 supercomputer for under $2,600/hour. That is one of the world’s most powerful high performance computing systems  with 1,064 nodes and 8,512 cores for under $3k/hour. For those of you not needing quite this much power at one time, that’s $0.05/core hour which is ½ of the previous Amazon Web Services HPC system cost.

 

Single node speeds and feeds:

·         Processors: 8-core, 2 socket Intel Xeon @ 2.6 Ghz with hyperthreading

·         Memory: 60.5GB

·         Storage: 3.37TB direct attached and Elastic Block Store for remote storage

·         Networking: 10Gbps Ethernet with full bisection bandwidth within the placement group

·         Virtualized: Hardware Assisted Virtualization

·         API: cc2.8xlarge

 

Overall Top500 Result:

·         1064 nodes of cc2.8xlarge

·         240.09 TFlops at an excellent 67.8% efficiency

·         $2.40/node hour on demand

·         10Gbps non-blocking Ethernet networking fabric

 

Database Intensive Computing:

This is a database machine masquerading as a supercomputer. You don’t have to use the floating point units to get full value from renting time on this cluster. It’s absolutely a screamer as an HPC system. But it also has the potential to be the world’s highest performing MapReduce system (Elastic Map Reduce) with a full bisection bandwidth 10Gbps network directly to each node.  Any database or general data intensive workload with high per-node computational costs and/or high inter-node traffic will run well on this new instance type. 

 

If you are network bound, compute bound, or both, the EC2 cc2.8xlarge instance type could be the right answer. And, the amazing thing is that the cc2 instance type is ½ the cost per core of the cc1 instance.

 

Supercomputing is now available to anyone for $0.05/core hour. Go to http://aws.amazon.com/hpc-applications/ and give it a try. You no longer need to be a national lab or a government agency to be able run one of the biggest supercomputers in the world.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, November 15, 2011 10:21:38 AM (Pacific Standard Time, UTC-08:00)  #    Comments [5] - Trackback
Services
 Monday, November 14, 2011

Last week I got to participate in one of my favorite days each year, serving on the judging panel for the AWS Startup Challenge. The event is a fairly intense day where our first meeting starts at 6:45am and the event closes at 9pm that evening. But it is an easy day to love in that the entire day is spent with innovative startups who have built their companies on cloud computing.

 

I’m a huge believer in the way cloud computing is changing the computing landscape and that’s all I’ve worked on for many years now. But I have still not tired of hearing “Without AWS, we wouldn’t have even been able to think about launching this business.”

 

Cloud computing is allowing significant businesses to be conceived and delivered at scale with only tiny amounts of seed funding or completely bootstrapped. Many of the finalist we looked at last week’s event had taken less than $200k of seed funding and yet had already had thousands of users. That simply wouldn’t have been possible 10 years ago and I just love to see it.

 

The finalist for this year’s AWS Startup Challenge were:

Booshaka - United States (Sunnyvale, California)

Booshaka simplifies advocacy marketing for brands and businesses by making sense of large amounts of social data and providing an easy to use software-as-a-service solution. In an era where people are bombarded by media, advertisers face significant challenges in reaching and engaging their customers. Booshaka combines the social graph and big data technology to help advertisers turn customers into their best marketers.

Deputy.com - Australia (Sydney)

Deputy.com is an online business management solution specifically addressing the HR department. The powerful online and mobile platform engages all staff across an enterprise, builds positive culture and drives business growth.

Fantasy Shopper - UK (Exeter)

Fantasy Shopper is a social shopping game. The shopping platform centralizes, socializes and “gamifies” online shopping to provide a real-world experience.

Flixlab - United States (Palo Alto, California)

With Flixlab, people can instantly and automatically transform raw videos and photos from their smartphone or their friends’ smartphones, into fun, compelling stories with just a few taps and immediately share them online. After creation, viewers can then interact with these movies by remixing them and creating personally relevant movies from the shared pictures and videos.

Getaround - United States (San Francisco, California)

Getaround is a peer-to-peer car sharing marketplace that enables car owners to rent their cars to qualified drivers by the hour, day, or week. Getaround facilitates payment, provide 24/7 roadside assistance, and provide complete insurance backed by Berkshire Hathaway with each rental.

Intervention Insights - United States (Grand Rapids, Michigan)

Intervention Insights provides a medical information service that combines cutting edge bioinformatics tools with disease information to deliver molecular insights to oncologists describing an individual’s unique tumor at a genomic level. The company then provides a report with an evidenced-based list of therapies that target the unique molecular basis of the cancer.

Localytics - United States (Cambridge, Massachusetts)

Localytics is a real-time mobile application analytics service that provides app developers with tools to measure usage, engagement and custom events in their applications. All data is stored at a per-event level instead of as aggregated counts. This allows app publishers, for example, to create more accurately targeted advertising and promotional campaigns from detailed segmentation of their dedicated customers.

Judging this year’s competition was even more difficult than last year because of the high quality of the field. Rather than a clear winner just jumping out, nearly all the finalist were viable winners and each clearly led in some dimensions.

 

As I write this and reflect on the field of finalist, some notable aspects of the list: 1) it is truly international in that there are several very strong entrants from outside the US and more than ½ of the finalists come from outside of Silicon Valley – the combination of two trends is powerful: first the economics of cloud computing supports successful startups without venture funding and, second, the spread of venture and angel funding throughout the world. Both trends make for a very strong field. Continuing on the notable attributes list, 2) very early stage startups are getting traction incredibly quickly – cloud computing allows companies to go to beta without having to grow a huge company. And, 3) Diversity. There were consumer offerings, developer offerings, and services aimed at highly skilled professionals.

The winner of the AWS Startup Challenge this year was Fantasy Shopper from Exeter, United Kingdom. Fantasy Shopper is a small, mostly bootstrapped startup led by CEO Chris Prescott and CTO Dan Noz with two other engineers. Fantasy Shopper is a social shopping game. They just went into beta on October 18th and already have thousands of incredibly engaged users. My favorite example of which is this video blog posted to YouTube November 6th: http://www.youtube.com/watch?v=h_sKDgdEexk. Watch the first 60 to 90 seconds and you’ll see what I mean.

 

Congratulations to Chris, Dan, Brendan, and Findlay at Fantasy Shopper and keep up the great work.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Monday, November 14, 2011 7:24:52 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Ramblings
 Thursday, November 03, 2011

As rescue and relief operations continue in response to the serious flooding in Thailand the focus has correctly been on human health and safety. Early reports estimated 317 fatalities, 700,000 homes and 14,000 factories impacted with over 660,000 not able to work. Good coverage mostly from the Bangkok Post is available at Newley.com authored by a reporter in the regoin. For example: http://newley.com/2011/11/02/thailand-flooding-update-november-2-2011-front-page-of-todays-bangkok-post/.

 

The floods are far from over and, as we look beyond the immediate problem in country, the impact on the technology world is expected to continue for just over a year even if the floods do recede in 3 to 4 weeks as expected. Disk drives are particularly hard hit with Digitimes Research reporting that the flood will create a 12% HDD supply gap in the 4th quarter of 2011 and the gap may increase into 2012.  Digitimes estimates the 4Q11 hard disk drive shortage to reach 19 million units.

Western Digital was hit the hardest by the floods with Tim Leyden, WD COO describing the situation in the last investor quarterly report as:

 

The flooded buildings in Thailand include our HDD assembly, test and slider facilities where a substantial majority of our slider fabrication capacity resides. In parallel with the internal slider shortages resulting from the above disruption, we are also experiencing other shortages on component parts from vendors located in several Thai industrial parks that have already been inundated by the floods, or have been affected by protective plant shutdowns.  We are evaluating the situation on a continuous basis, but  in order to get these facilities back up and running, we need the water level to stabilize, after which point it will take some period of time for the floods to recede. We are assessing our options so that we can safely begin working to accelerate the water removal and either extract and transfer the equipment to clean-rooms in other locations or prepare it for operation on-site. As a result of these activities, at this point in time, we estimate that our regular capacity and possibly our suppliers capacity will be significantly constrained for several quarters.

 

Toshiba reports Impact of the Floods in Thailand they were seriously impacted as well:

 

Location: Navanakorn Industrial Estate Zone, Pathumtani, Thailand
Main Product: Hard Disk Drive

·         Damage status: The water is 2 meters high on the site and the surrounding area and more than 1 meter deep in the buildings. Facilities are damaged but no employees have been injured in the factory.

·         Alternative sites: We have started alternative production at other factories, but the production volume will be limited by available capacity.

·         Operation: All the employees have been evacuated from the industrial zone, at the order of the Thai government. With the water at its current level, we anticipate a long-term shutdown. The date of resumption of operation is unpredictable.

 

Because the hard disk supply chain is heavily represented in this region, many hard disk manufacturers with unaffected plants will still lose capacity. Noble Financial Equity Research made the following 4th quarter shipped volume estimates:

 

Continuing with data from Noble Financial Equity Research:

·         Due to the effects of flooding, we do not expect the industry to return to normalcy for 3 to 4 quarters

·         We see only 120M drives shipped this quarter versus the TAM (total addressable market) of 175M to 180M units

·         Due to lack of channel and finished goods inventory, the supply shortfall in the March quarter is also expected to be sever despite higher expected drive shipments and component availability

·         By shifting production out of Asian plants, critical component supplier Nidec believes it can ramp to an output of 170 drive motors by the March quarter

·         We see significantly higher drive and component prices persisting into the summer months of 2012

·         Seagate will be the principal beneficiary of the supply shortage and higher pricing

·         We believe Hutchinson (drive suspension manufacturer) will be able to rapidly ramp its US assembly operations and higher suspension prices will offset the reduced business from Western Digital

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, November 03, 2011 7:41:22 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3] - Trackback
Hardware
 Monday, October 31, 2011

I’m not sure why it all happens at once but it often does.  Last Monday I kicked off HPTS 2011 in Asilomar California and then flew to New York City to present at the Open Compute Summit. 

 

I love HPTS. It’s a once every 2 year invitational workshop that I’ve been attending since 1989. The workshop attracts a great set of presenters and attendees: HPTS 2011 agenda. I blogged a couple of the sessions if you are interested:

·         Microsoft COSMOS at HPTS

·         Storage Infrastructure Behind Facebook Message

 

The Open Compute Summit was kicked off by Frank Frankovsky of Facebook followed by the legendary Andy Bechtolsheim of Arista Networks.  I did a talk after Andy which was a subset of the talk I had done earlier in the week at HPTS.

·         HPTS Slides: Internet-Scale Data Center Economics: Costs and Opportunities

·         OCP Summit: Internet Scale Infrastructure Innovation

 

Tomorrow I’ll be at University of Washington presenting Internet Scale Storage at the University of Washington Computer Science and Engineering Distinguished Lecturer Series (agenda). Its open to the public so, if you are in the Seattle area and interested, feel free to drop by EEB-105 at 3:30pm (map and directions).

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Monday, October 31, 2011 6:07:27 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Ramblings
 Saturday, October 29, 2011

Sometimes the most educational lessons are on what not to do rather than what to do. Failure and disaster can be extraordinarily educational as long as the reason behind the failure is well understood.  I study large system outages, infrastructure failures, love reading post mortems (when they actually have content), and always watch carefully how companies communicate with their customers during and right after large scale customer impacting events. I don’t do it because I enjoy failure – these things all scare me. But, in each there are lessons to be learned.

 

 

Sprint advertising from: http://unlimited.sprint.com/?pid=10 (2011/10/29).

 

I typically point out the best example rather than the worst but every once in a while you see a blunder so big it just can’t be ignored.   Sprint is the 3rd place wireless company in an industry where backbreaking infrastructure costs strongly point towards there only being a small number of surviving companies unless services are well differentiated.  All the big wireless players work hard on differentiation but it’s a challenge and, over time, the biggest revenue, supports the biggest infrastructure investment, and its gets harder and harder to be successful as a #3 player.

 

Sprint markets that they are better than the #1 and #2 carrier because they really have unlimited data rather than merely using the word “unlimited” in the marketing literature. They say “at Sprint you get unlimited data, no overage charges, and no slowing you down” (http://unlimited.sprint.com/?pid=10).

 

We live on a boat and so 4G cellular is about as close as we can get to broadband. I like to do all I can to encourage broad competition because it is good for the industry and good for customers.  That is one of the reasons we are Sprint customers today.  We use Sprint because they offer unlimited 4G and I really would like there to be more than 2 surviving North American wireless providers.

 

Unfortunately, Sprint seems less committed to my goal of keeping the #3 wireless player healthy and solvent.  Looking at the Sprint primary differentiating feature, unlimited data, they plan to shut it off this month. That’s a tough decision but presumably it was made with care and there exists a plan to grow the company with some other feature or features making them a better choice than Verizon or AT&T.  Just being a 3rd choice with a less well developed network and with less capital to invest into that network doesn’t feel like a good long term strategy for Sprint.

 

What makes Sprint’s decision notable is the way the plan was rolled out. Sprint has many customers under 2 year, unlimited data contracts. Rather than risk the negative repercussions and customer churn from communicating the change, they went the stealth route.  The only notification was buried in the fine print of the October bill:

 

Mobile Broadband Data Allowance Change
Eff. on your next bill, Mobile Broadband Data Plan 4G usage will be combined with your current 3G monthly data allowance and no longer be unlimited. On-network data overage rate for 3G/4G is $.05/MB. Monitor combined data usage at sprint.com. Please visit sprint.com/servicechange for details.

In November, many of us are going to get charged an overage fee of $0.05/MB on what has been advertised heavily as the only “real” unlimited plan.  For many customers, the only reason they have a Sprint contract is that the data plan was uncapped. Both my phone and Jennifer’s are with AT&T. The only reason we are using Sprint for connectivity from the boat WiFi system is Sprint offered unbounded data. Attempting a stealth change of the primary advertised characteristic of a service shows very little respect for customers even when compared with other telcos, an industry not generally known for customer empathy.

 

I agree that almost nobody is going to read the bill and I suppose some won’t notice when subsequent bills are higher. But many eventually will. And, even for those that don’t notice and are silently are getting charged more, when they do notice, they are going to be unhappy. No matter how you cut it, the experience is going to be hard on customer trust. And, at the same time they showing little respect for customers, they are releasing them all from contract at the same time. Any Sprint customer is now welcome to leave without termination charge.

 

Some analysts have speculated that Sprint doesn’t have the bandwidth to support their launch of iPhone.  This billing structure change strongly suggests that Sprint really does have a bandwidth problem. I’ve still not yet figured out why an iPhone is more desirable at Sprint than it is at Verizon or AT&T. And I still can’t figure out why the #3 provider with the same data caps is more desirable than the big 2 but it’s not important that I understand. That’s a Sprint leadership decision.

 

Let’s assume that the Sprint network is in capacity trouble and they have no choice but to cap the data plans even though they are changing the very terms they advertised as their primary advantage. Even if that is necessary, I’m 100% convince the right way to do it is to support the existing contact terms for the duration of those contracts. If the company really is teetering on failure and is unable to honor the commitments they agreed to, then they need to be upfront with customers. You can’t slip in new contract terms quietly into the statement and hope nobody notices. Showing that little respect for customers is usually rewarded by high churn rates and a continuing to shrink market share.  Poor approach.

 

I called Sprint and pointed out they were kind of missing the original contact terms. They said “there was nothing they could do” however, they would be willing to offer a $100 credit if we would agree to another 2 year contract term. Paying only $100 to get a customer signed up for another 2 years would be an incredible bargain for Sprint. Most North American carriers spend at least that on device subsidies when getting customers committed to an additional 2 year term. This would be cheap for Sprint and would get customers back under contract after this term change effectively released them. The Sprint customer service representative did correctly offer to waive early cancellation fees since they were changing the contract terms of the original contract. 

 

Sprint customers are now all able to walk away today from the remaining months in their wireless contacts without any cost. They are all free to leave. From my perspective, it is just plain nutty for Sprint to give their entire subscription base the freedom to walk away from contracts without charge while, at the same time, treating them poorly. It’s a recipe for industry leading churn.

 

                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Saturday, October 29, 2011 6:40:18 AM (Pacific Standard Time, UTC-08:00)  #    Comments [9] - Trackback
Ramblings
 Tuesday, October 25, 2011

One of the talks that I particularly enjoyed yesterday at HPTS 2011 was Storage Infrastructure Behind Facebook Messages by Kannan Muthukkaruppan. In this talk, Kannan talked about the Facebook store for chats, email, SMS, & messages.

 

This high scale storage system is based upon HBase and Haystack. HBase is a non-relational, distributed database very similar to Google’s Big Table. Haystack is simple file system designed by Facebook for efficient photo storage and delivery. More on Haystack at: Facebook Needle in a Haystack.

 

In this Facebook Message store, Haystack is used to store attachments and large messages.  HBase is used for message metadata, search indexes, and small messages (avoiding the second I/O to Haystack for small messages like most SMS).

 

Facebook Messages takes 6B+ messages a day. Summarizing HBase traffic:

·         75B+ R+W ops/day with 1.5M ops/sec at peak

·         The average write operation inserts 16 records across multiple column families

·         2PB+ of cooked online data in HBase. Over 6PB including replication but not backups

·         All data is LZO compressed

·         Growing at 250TB/month

 

The Facebook Messages project timeline:

·         2009/12: Project started

·         2010/11: Initial rollout began

·         2011/07: Rollout completed with 1B+ accounts migrated to new store

·         Production changes:

o   2 schema changes

o   Upgraded to Hfile 2.0

 

They implemented a very nice approach to testing where, prior to release, they shadowed the production workload to the test servers.

After going into production the continued the practice of shadowing the real production workload into the test cluster to test before going into production:

 

The list of scares and scars from Kannan:

·         Not without our share of scares and incidents:

o   s/w bugs. (e.g., deadlocks, incompatible LZO used for bulk imported data, etc.)

§  found a edge case bug in log recovery as recently as last week!

·         performance spikes every 6 hours (even off-peak)!

o   cleanup of HDFS’s Recycle bin was sub-optimal! Needed code and config fix.

·         transient rack switch failures

·         Zookeeper leader election took than 10 minutes when one member of the quorum died. Fixed in more recent version of ZK.

·         HDFS Namenode – SPOF

·         flapping servers (repeated failures)

·         Sometimes, tried things which hadn’t been tested in dark launch!

o   Added a rack of servers to help with performance issue

§  Pegged top of the rack network bandwidth!

§  Had to add the servers at much slower pace. Very manual .

§  Intelligent load balancing needed to make this more automated.

·         A high % of issues caught in shadow/stress testing

·         Lots of alerting mechanisms in place to detect failures cases

o   Automate recovery for a lots of common ones

o   Treat alerts on shadow cluster as hi-pri too!

·         Sharding service across multiple HBase cells also paid off

 

Kannan’s slides are posted at: http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, October 25, 2011 1:03:10 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services

Rough notes from a talk on COSMOS, Microsoft’s internal Map reduce systems from HPTS 2011. This is the service Microsoft uses internally to run MapReduce jobs. Interesting, Microsoft plans to use Hadoop in the external Azure service even though COSMOS looks quite good: Microsoft Announces Open Source Based Cloud Service. Rough notes below:

 

Talk: COSMOS: Big Data and Big Challenges

Speaker: Ed Harris

·         Petabyte storage and computation systems

·         Used primarily by search and advertising inside Microsoft

·         Operated as a service with just over 4 9s of availability

·         Massively parallel processing based upon Dryad

o   Dryad is very similar to MapReduce

·         Use SCOPE (structured Computation Optimized for Parallel Execution) over Dryad

o   A SQL-like language with an optimizers implemented over Dryad

·         They run hundreds of virtual clusters. In this model, internal Microsoft teams buy servers and given them to COSMOS and are subsequently assured at least these resources

o   Average 85% CPU over the cluster

·         Ingest 1 to 2 PB/day

·         Roughly 30% of the Search fleet is running COSMOS

·         Architecture:

o   Store Layer

§  Many extent nodes store and compress streams

§  Streams are sequences of extents

§  CSM: Cosmos Store Layer handles names, streams, and replication

·         First level compression is light. Data that is kept more than a week is more aggressively compressed after a week on the assumption that data that lives a week will likely live longer

o   Execution Layer:

§  Jobs queue up on virtual clusters and then executed

o   SCOPE Layer

§  Compiler and optimizer for SCOPE

§  Ed said that the optimizer is a branch of the SQL Server optimizer

·         They have 60+ Phd internships each year and hire ~30 a year

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Tuesday, October 25, 2011 8:37:20 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0] - Trackback
Services
 Sunday, October 23, 2011

From the Last Bastion of Mainframe Computing Perspectives post:

 

The networking equipment world looks just like mainframe computing ecosystem did 40 years ago. A small number of players produce vertically integrated solutions where the ASICs (the central processing unit responsible for high speed data packet switching), the hardware design, the hardware manufacture, and the entire software stack are stack are single sourced and vertically integrated.  Just as you couldn’t run IBM MVS on a Burrows computer, you can’t run Cisco IOS on Juniper equipment. When networking gear is purchased, it’s packaged as a single sourced, vertically integrated stack. In contrast, in the commodity server world, starting at the most basic component, CPUs are multi-sourced. We can get CPUs from AMD and Intel. Compatible servers built from either Intel or AMD CPUs are available from HP, Dell, IBM, SGI, ZT Systems, Silicon Mechanics, and many others.  Any of these servers can support both proprietary and open source operating systems. The commodity server world is open and multi-sourced at every layer in the stack.

 

Last week the Open Network Summit was hosted at Stanford University.  This conference focused on Software Defined Networks in general and Openflow specifically. Software defined networking separates out the router control plane responsible for what is in the routing table from the data plane that makes network packet routing decisions on the basis of what is actually in the routing table.  Historically, both operations have been implemented monolithically in each router. SDN, separates these functions allowing networking equipment to compete in how efficiently they route packets on the basis of instructions from a separate SDN control plane.

 

In the words of OpenFlow founder Nick Mckeown, Software Defined Networks (SDN), will: 1) empower network owners/operators, 2) increase the pace of network innovation, 3) diversify the supply chain, and 4) build a robust foundation for future networking innovation.

 

This conference was a bit of a coming of age for software defined networking for a couple of reasons. First, an excellent measure of relevance is who showed up to speak at the conference. From academia, attendees included Scott Shenker (Berkeley), Nick McKeown (Stanford), and Jennifer Rexford (Princeton).  From industry most major networking companies were represented by senior attendees including Dave Ward (Juniper), Dave Meyer (Cisco), Ken Duda (Arista), Mallik Tatipamula (Ericsson), Geng Lin (Dell), Samrat Ganguly (NEC),  and Charles Clark (HP). And some of the speakers from major networking user companies included: Stephen Stuart (Google), Albert Greenberg (Microsoft), Stuart Elby (Verizon), Rainer Weidmann (Deutsche Telekom), and Igor Gashinsky (Yahoo!). The full speaker list is up at: http://opennetsummit.org/speakers.html.

 

The second data point in support of SDN really coming of age was Dave Meyer, Cisco Distinguished Engineer, saying during his talk that Cisco was “doing Openflow”. I’ve always joked that Cisco would rather go bankrupt than support Openflow so this one definitely caught my interest. Since I wasn’t in attendance myself during Dave’s talk I checked in with him personally. He corrected that it wasn’t a product announcement. They have Openflow running on Cisco gear but “no product plans have been announced at this time”. Still exciting progress and hat’s off for Cisco for taking the first step. Good to see.

 

If you want a good summary of what is Software Defined Networking, perhaps the best description were the slides that Nick presented at the conference: http://mvdirona.com/jrh/TalksAndPapers/NickMckeown_ON%20Summit%20NickM%2010%202011.pdf.

 

If you are interested in what Cisco’s Dave Meyer presented at the summit, I’ve posted his slides here: http://mvdirona.com/jrh/TalksAndPapers/DavidMeyer_openflow_and_sdn_for_enterprises.pdf.

 

Other related postings I’ve made:

·         Datacenter Networks are in my Way

·         Stanford Clean Slate CTO Summit

·         Changes in Networking Systems

·         Software Load Balancing Using Software Defined Networking

 

Congratulations to the Stanford team for hosting a great conference and in helping to drive software defined networking from a great academic idea to what is rapidly becoming a supported option industry-wide.

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Sunday, October 23, 2011 7:57:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4] - Trackback
Hardware | Software
 Thursday, October 20, 2011

Last night EMC Chief Executive Joe Tucci laid out his view of where the information processing world is going over the next decade and where EMC will focus.  His primary point was cloud computing is the future and big data is the killer app for the cloud. He laid out the history of big transitions in our industry and argued the big discontinuities were always driven by a killer application. He sees the cloud as the next big and important transition for our industry.

 

This talk was presented as part of the University of Washington Distinguished Lecturer Series. With six TV cameras covering the action, there were nearly as many as some University of Washington Huskies games and it was well attended. The next talk in the series will be Bill Gates on October 27 presenting The Opportunity Ahead: A Conversation with Bill Gates. I’ll be presenting Internet Scale Storage on November 1st.

 

If you are interested in any of the talks in the series, all are open to the public and the upcoming schedule is posted at: http://www.cs.washington.edu/news/newdlshome.html.

 

The most notable statistic from the Joe Tucci talk was the massive investment that EMC is making mergers and acquisitions. He said over the next 5 years, EMC will spend $10.5B in R&D – this number alone is amazingly large -- but what I found really startling was they expect to spend even more purchasing companies. They expect to spend $14.0B on M&A during this same period. That’s nearly $3B/year from just a single company. Amazing.

 

With many large companies increasingly looking to the startup community for new ideas and innovation, there is incredible opportunity for startups.  Joe emphasized the opportunity, saying that Washington in general and especially the University of Washington will likely be the source of many of these new companies. As large companies lean more on the startup community for new ideas, products, and services, it’s a good time to be starting a company.

 

My rough notes from the talk:

 

·         IDC reports:

o   This decade WW information content will grow 44x (0.9 zettabytes to 35.2)

o   90% unstructured

·         Big data has arrived

o   Mobile sensors

o   Social media

o   Video surveillance

o   Smart grids

o   Gene sequencing

o   Medical imaging

o   Geophysical exploration

·         73% maintaining existing infrastructure (true for 10 years)

o   JRH: I’ve heard this statistic before but it seems like nearly has to be the case the most companies are spending at least 3/4s of their investment continuing to running the business and around a ¼ on new applications. The statistic is usually presented as a problem but it feels like it might be close to the right ratio.

·         3D movie is about a petabyte with all camera angles and footage included

·         The average company is attacked 300 times per week

o   All CIO say this is way light – my home router gets nailed that many times in a good hour

·         IT staffing will increase less than 50% in next 10 years but the data under management will grow faster.

o   JRH: Again this seems like the desirable outcome where the data under management should be able to grow far faster than administrative team

·         EMCs Mission: To lead customers towards a hybrid cloud

o   Leading customers to x86 based private clouds and hybrid clouds

o   Burst, test & development, etc. into the public cloud

o   Hybrid cloud between private and public is the “big winner”

·         VM is basically a cloud operating systems

o   EMC still owns 80% of VMWare

o   There are now more than virtual machines shipped than physical machines

o   62% virtualized out of the gate

·         Applications like SAP, Oracle, and Microsoft are now available in the cloud

·         Killer app for the cloud is big data

o   Real time data analytics

·         New end user computing

o   IOS devices, android, windows, …

·         Tenets of cloud computing

o   Efficiency, control, choice => Agility

o   Control through policy, service levels, and cost

·         Big  competitors

o   IBM, HP, Cisco, Microsoft, …

o   EMC is big at $20B but not close to as big as these incumbents

o   JRH: I’ve never thought of EMC as the small, nimble competitor but I guess it’s all relative

·         Recent acquisitions in drive to cloud & big data

o   Isilon

o   Greenplum

o   Datadomain

o   RSA

·         Mammoth 5 year M&A plan: roughly ½ of investments in R&D and ½ in M&A

o   14.0B M&A

o   10.5B: R&D

·         EMC has 14,000 sales people so there is huge potential synergy in any acquisition

o   Adding a 14,000 person sales team to any reasonable product is going to produce considerable new revenue quickly

·         EMC is now 152 in fortune 500

o   Revenue is $17B

o   Free cash flow: $3.4b

 

Thanks to Ed Lazowska for hosting this talk and many in the University of Washington Distinguished Lecturer Series.

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, October 20, 2011 6:24:29 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1] - Trackback
Services
 Thursday, October 13, 2011

We see press releases go by all the time and most of them deserve the yawn they get. But, one caught my interest yesterday. At the PASS Summit conference Microsoft Vice President Ted Kummert announced that Microsoft will be offering a big data solution based upon Hadoop as part of SQL Azure. From the Microsoft press release, “Kummert also announced new investments to help customers manage big data, including an Apache Hadoop-based distribution for Windows Server and Windows Azure and a strategic partnership with Hortonworks Inc.”

 

Clearly this is a major win for the early startup Hortonworks. Hortonworks is a spin out of Yahoo! and includes many of the core contributors to the Apache Hadoop distribution: Hortonwoks Taking Hadoop to Next Level.

 

This announcement is also a big win for the MapReduce processing model. First invented at Google and published in MapReduce: Simplified Data Processing on Large Clusters. The Apache Hadoop distribution is an open source implementation of MapReduce. Hadoop is incredibly widely used with Yahoo! running more than 40,000 nodes of Hadoop with their biggest single cluster now at 4,500 servers. Facebook runs a 1,100 node cluster and a second 300 node cluster. Linked in runs many clusters including deployments of 1,200, 580, and 120 nodes. See the Hadoop Powered By Page for many more examples.

 

In the cloud, AWS began offering Elastic MapReduce back in early 2009 and has been expanding the features supported by this offering steadily over the last couple of years adding support for Reserved Instances, Spot Instances, and Cluster Compute instances (on a 10Gb non-oversubscribed network – MapReduces just loves high bandwidth inter-node connectivity)and support for more regions with EMR available in Northern Virginia, Northern California, Ireland, Singapore, and Tokyo.

 

Microsoft expects to have a pre-production (what they refer to as a "community technology Preview") version of a Hadoop service available by the “end of 2011”.  This is interesting for a variety of reasons. First, its more evidence of the broad acceptance and applicability of the MapReduce model.  What is even more surprising is that Microsoft has decided in this case to base their MapReduce offering upon open source Hadoop rather than the Microsoft internally developed MapReduce service called Cosmos which is used heavily by the Bing search and advertising teams. The What is Dryad blog entry provides a good description of Cosmos and some of the infrastructure build upon the Cosmos core including Dryad, DryadLINQ, and SCOPE.

 

As surprising as it is to see Microsoft planning to offer MapReduce based upon open source rather than upon the internally developed and heavily used Cosmos platform, it’s even more surprising that they hope to contribute changes back to the open source community saying “Microsoft will work closely with the Hadoop community and propose contributions back to the Apache Software Foundation and the Hadoop project.”  

 

·         Microsoft Press Release: Microsoft Expands Data Platform

·         Hortonsworks Press Release: Hortonworks to Extend Apache Hadoop to Windows Users

·         Hortonworks Blog Entry: Bringing Apache Hadoop to Windows

 

Past MapReduce postings on Perspectives:

·         MapReduce in CACM

·         MapReduce: A Minor Step Forward

·         Hadoop Summit 2010

·         Hadoop Summit 2008

·         Hadoop Wins TeraSort

·         Google MapReduce Wins TeraSort

·         HadoopDB: MapReduce over Relational Data

·         Hortonworks Taking Hadoop to Next Level

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 

Thursday, October 13, 2011 7:08:10 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback
Services
 Wednesday, October 05, 2011

Earlier today we lost one of the giants of technology. Steve Jobs was one of most creative, demanding, brilliant, hard-driving, and innovative leaders in the entire industry. He has created new business areas, introduced new business models, brought companies back from the dead, and fundamentally changed how the world as a whole interacts with computers. He was a visionary of staggering proportions with an unusual gift in his ability to communicate a vision and also the drive to seek perfection in the execution of his ideas. We lost a giant today.

 

From Apple: http://www.apple.com/pr/library/2011/10/05Statement-by-Apples-Board-of-Directors.html

 

                                                                --jrh

 

James Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

Wednesday, October 05, 2011 5:39:29 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2] - Trackback

Disclaimer: The opinions expressed here are my own and do not necessarily represent those of current or past employers.

Archive
<January 2012>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
2930311234

Categories
This Blog
Member Login
All Content © 2014, James Hamilton
Theme created by Christoph De Baene / Modified 2007.10.28 by James Hamilton