Welcome!

The interface between the worlds of Cloud Computing & the Semantic Web

Paul Miller

Subscribe to Paul Miller: eMailAlertsEmail Alerts
Get Paul Miller via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Apache Web Server Journal, Data Mining, SOA Best Practices Digest, SOA & WOA Magazine, SOA in the Cloud Expo, Eucalyptus Cloud Journal, Big Data on Ulitzer

Ecalyptus: Blog Post

Crunching Data in the Cloud

Amazon tethers balloons for now; attention turns to Elastic MapReduce web service

Warning for laserbeam, symbol D-W010 according...
Image via Wikipedia

Amid mounting international concern that the guidance lasers aboard Jeff Bezos‘ new Floating Amazon Cloud Environment would interfere with Rudolph’s sense of direction, sources close to the Amazon Web Services team tell me that they’ve been forced to alter priorities and switch attention to an early release of the next product on their roadmap.

Today sees the release of Amazon’s latest web service; the Hadoop-powered Elastic MapReduce;

“Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.”

The company’s press release quotes VP for Product Management & Developer Relations, Adam Selipsky, who notes;

Some researchers and developers already run Hadoop on Amazon EC2, and many of them have asked for even simpler tools for large-scale data analysis. Amazon Elastic MapReduce makes crunching in the cloud much easier as it dramatically reduces the time, effort, complexity and cost of performing data-intensive tasks.”

MapReduce was brought to prominence by Google, and is one of the principal techniques at that company’s disposal in enabling them to break massive data sets into manageable chunks suitable for cost-effective processing on the commodity hardware for which they are known. The abstract for a Google research paper on the topic outlines the value proposition reasonably succinctly;

“MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

Hadoop is a Yahoo!-nurtured Open Source equivalent to Google’s MapReduce, managed as a project of the Apache Software Foundation, and reputedly scalable to handle many petabytes of data distributed across thousands of CPUs.

As Adam noted in the press release, customers (such as the New York Times and Netflix) are already using Hadoop on Amazon’s Web Services. Today’s announcement makes it easier to cost-effectively and transparently commission (and decommission) the required compute resources. This is the ‘elasticity’ referred to in the new service’s name, and is an increasingly important aspect of the current generation of Cloud-based compute services; much of the economic value proposition lies in only using (and therefore paying for) the resources you actually need to complete a task. If demand increases, the number of (virtual) machines available should rapidly increase to cope, and they should shut back down just as rapidly when the demand passes;

Amazon Elastic MapReduce enables you to use as many or as few compute instances running Hadoop as you want. You can commission one, hundreds, or even thousands of instances to process gigabytes, terabytes, or even petabytes of data. And, you can run as many job flows concurrently as you wish. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.

Elastic MapReduce is currently available only for data centres in Amazon’s US region (so non-US customers can use the service; they just have to be able/willing to transfer the data beyond their borders), and is priced in addition to existing EC2 instances with Elastic MapReduce on a $US0.10 per hour ’small’ instance costing a further $US0.015 per hour (yes, 1 and a half cents per hour) and on a $US0.80 per hour ‘extra large’ instance costing a further $US0.12 per hour.

Elastic MapReduce is another nice example of slow, incremental improvement to Amazon’s core Web Services offer.

It remains to be seen, as developers get down to using it for real, whether it’s pitched as a low-end disruptor that simply rounds out another piece of the emerging AWS whole, or if it’s a viable competitor in its own right to the recently announced Cloudera which sees taking Hadoop to mainstream enterprise customers as its raison d’etre;

Cloudera can help you install, configure and run Hadoop for large-scale data processing and analysis. Get Cloudera’s Distribution for Hadoop and start working with Big Data today.

Update: Amazon’s Jeff Barr provides a lot more detail in a post to the AWS Blog.

Reblog this post [with Zemanta]

Read the original blog entry...

More Stories By Paul Miller

Paul Miller works at the interface between the worlds of Cloud Computing and the Semantic Web, providing the insights that enable you to exploit the next wave as we approach the World Wide Database. He blogs at www.cloudofdata.com.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.