Code-Named “Dryad” Ships in Beta!
Wednesday, May 4, 2011 – 2:47 PMWell, it finally happened… earlier this week the HPC team shipped their beta and it included Dryad.
We’re happy to announce that as part of the beta for Microsoft HPC Pack 2008 R2 SP2 we’re shipping a beta of the project code-named “Dryad.” Dryad is Microsoft’s solution for “Big Data”. What’s Big Data? Today’s environment is full of ever growing mountains of data. From web logs and social networking feeds to fraud detection and recommendation engines or large science and engineering problems like genomic analysis and high energy physics. Big data is becoming a more and more common scenario.
We got mentioned on InsideHPC and Mary-Jo Foley’s blog.
There’s a lot of cool things to say about Dryad but the thing nearest and dearest to my heart is the Dryad programming model, which is based on LINQ. Dryad uses a LINQ based programming model to enable developers to express their queries using a familiar syntax. Behind the scenes Dryad does all the heavy lifting of creating a query plan, deploying your assemblies and executing the query in a scalable and robust way.
Suppose I had a whole load of text files and I wanted to find out the most common words in the entire set of files. Here’s how you go about doing that.
If you’re familiar with LINQ then writing DryadLINQ queries is very straightforward. Simply create an HpcLinqContext instance and execute a query.
HpcLinqConfiguration config = new HpcLinqConfiguration(SampleConfiguration.HeadNode); config.JobFriendlyName = "Histogram sample"; HpcLinqContext context = new HpcLinqContext(config);
Next create a query:
IQueryable<Pair> results = context.FromDsc<LineRecord>(inputFileSetName) .SelectMany(line => line.Line.Split( new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries)) .GroupBy(word => word) .Select(word => new Pair(word.Key, word.Count())) .OrderByDescending(pair => pair.Count) .Take(200);
Here’ the query reads data from inputFileSet as a type of LineRecord, representing lines in a text file. It then splits each line into words and groups identical words together. It can then create a collection of Pair that contain a word and the number of times it occurs. Finally these are sorted and the most common 200 words are returned.
Two really cool things are going on here.
One… The individual LINQ operations are distributed across the nodes of an HPC Cluster and the results are aggregated.
Two… Dryad leverages the LINQ programming model rather than forcing you to rethink your query in terms of a particular pattern like MapReduce.
Finally print the results. Queries are lazy so this will actually cause the query to be executed on the cluster.
foreach (Pair result in results) Console.WriteLine(" {0,-20} : {1}", result.Word, result.Count);As you’re job runs you can use the HPC Job Manager to view its progress, just like any other job running in HPC Server.
That’s it! Distributed word count using Dryad in just a few lines of code. The word count example is just one of several samples we’re shipping with the beta to get you started. Others include; sort, k-means clustering, table joins and SQL connectivity to mention just a few.
You can download the beta from https://connect.microsoft.com/hpc and ask questions about it on our MSDN forum.
1 Trackback(s)
Sorry, comments for this entry are closed at this time.