I’ve worked in or near the database engine world for more than 25 years. And, ironically, every company I’ve ever worked at has been working on a massive-scale, parallel, clustered RDBMS system. The earliest variant was IBM DB2 Parallel Edition released in the mid-90s. It’s now called the Database Partitioning Feature.
Massive, multi-node parallelism is the only way to scale a relational database system so these systems can be incredibly important. Very high-scale MapReduce systems are an excellent alternative for many workloads. But some customers and workloads want the flexibility and power of being able to run ad hoc SQL queries against petabyte sized databases. These are the workloads targeted by massive, multi-node relational database clusters and there are now many solutions out there with Oracle RAC being perhaps the most well-known but there are many others including Vertica, GreenPlum, Aster Data, ParAccel, Netezza, and Teradata.
What’s common across all these products is that big databases are very expensive. Today, that is changing with the release of Amazon Redshift. It’s a relational, column-oriented, compressed, shared nothing, fully managed, cloud hosted, data warehouse. Each node can store up to 16TB of compressed data and up to 100 nodes are supported in a single cluster.
Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed.
The core node on which the Redshift clusters are build, includes 24 disk drives with an aggregate capacity of 16TB of local storage. Each node has 16 virtual cores and 120 Gig of memory and is connected via a high speed 10Gbps, non-blocking network. This a meaty core node and Redshift supports up to 100 of these in a single cluster.
There are many pricing options available (see http://aws.amazon.com/redshift for more detail) but the most favorable comes in at only $999 per TB per year. I find it amazing to think of having the services of an enterprise scale data warehouse for under a thousand dollars by terabyte per year. And, this is a fully managed system so much of the administrative load is take care of by Amazon Web Services.
Service highlights from: http://aws.amazon.com/redshift
Fast and Powerful – Amazon Redshift uses a variety to innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. First, it uses columnar storage and data compression to reduce the amount of IO needed to perform queries. Second, it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes. Finally, it has a massively parallel processing (MPP) architecture, which enables you to scale up or down, without downtime, as your performance and storage needs change.
You have a choice of two node types when provisioning your own cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large node (8XL) with 16TB of compressed storage. You can start with a single XL node and scale up to a 100 node eight extra large cluster. XL clusters can contain 1 to 32 nodes while 8XL clusters can contain 2 to 100 nodes.
Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse to improve performance or increase capacity, without incurring downtime. Amazon Redshift enables you to start with a single 2TB XL node and scale up to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Resize functionality is not available during the limited preview but will be available when the service launches.
Inexpensive – You pay very low rates and only for the resources you actually provision. You benefit from the option of On-Demand pricing with no up-front or long-term commitments, or even lower rates via our reserved pricing option. On-demand pricing starts at just $0.85 per hour for a two terabyte data warehouse, scaling linearly up to a petabyte and more. Reserved Instance pricing lowers the effective price to $0.228 per hour, under $1,000 per terabyte per year.
Fully Managed – Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse, from provisioning capacity to monitoring and backing up the cluster, and to applying patches and upgrades. By handling all these time consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business insights.
Secure – Amazon Redshift provides a number of mechanisms to secure your data warehouse cluster. It currently supports SSL to encrypt data in transit, includes web service interfaces to configure firewall settings that control network access to your data warehouse, and enables you to create users within your data warehouse cluster. When the service launches, we plan to support encrypting data at rest and Amazon Virtual Private Cloud (Amazon VPC).
Reliable – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically replaces any component, as necessary.
Compatible – Amazon Redshift is certified by Jaspersoft and Microstrategy, with additional business intelligence tools coming soon. You can connect your SQL client or business intelligence tool to your Amazon Redshift data warehouse cluster using standard PostgreSQL JBDBC or ODBC drivers.
Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service and Amazon Elastic MapReduce coming soon.
Petabyte-scale data warehouses no longer need command retail prices of upwards $80,000 per core. You don’t have to negotiate an enterprise deal and work hard to get the 60 to 80% discount that always seems magically possible in the enterprise software world. You don’t even have to hire a team of administrators. Just load the data and get going. Nice to see.
James Hamilton e: firstname.lastname@example.org w: http://www.mvdirona.com b: http://blog.mvdirona.com / http://perspectives.mvdirona.com
Disclaimer: The opinions expressed here are my own and do not
necessarily represent those of current or past employers.