Subscribe to DSC Newsletter

hpc (2)

Where The Cloud Meets The Grid

Guest blog post by Peter Higdon

Companies build or rent grid machines when data length doesn't fit into HDFS, or the latency of parallel interconnects is too slow in the cloud. This review explores the overlap of the two paradigms at the ends of the parallel processing latency spectrum. The comparison is almost poetic and leads to many other comparisons in languages, interfaces, formats, and hardware, but there is amazingly little overlap.

Your Laptop Is A Supercomputer

To put things in perspective, 60 years ago, "computer" was a job title. When The Wu Tang Clan dropped 36 Chambers, the bottom ranking machine in the TOP500 was a quad-core Cray. Armed with your current machine, you should be able to dip your toes into any project before diving in head first. Take a small slice of the data to get a glimpse of the obstacles ahead first. Start with 1/10th, 1/8, 1/4.. until your machine can't handle it anymore. Usually by that time, your project will encounter problems that can't be fixed simply by getting a bigger computer.


Depending on the kind of problem you are solving, building your own Beowulf cluster out of old commodity hardware might be the way to go. If you need a constant run of physics simulations or BLAST alignments, a load of wholesale off-lease laptops should get the job done for under 2000$.

Some Raspberry Pi enthusiasts have built a 32 node cluster, but it has limited use cases, given the limitations of the RPi's ARM.

Password hashing and BitCoin farms use ASICs and FPGAs. In these cases, latency of interconnects is much less important than single-thread processing.

Move To The Cloud

You don't need to go through the hassle of wiring and configuring a cluster for a short-term project. The hourly cost savings of running your own servers quickly diminish as you struggle through the details of MIS: DevOps, provisioning, deployment, hardware failure, etc. Small development shops and big enterprises like Netflix are happy to pay premiums for a managed solution. We have a staggering variety of SLAs available today as service providers compete to capture new markets.

Cloud Bursting

When your cluster can't quite handle the demand of your process, rent a few servers from the cloud to handle the over-flow.

Cloud Bridging

Use your cluster to handle sensitive private data, and shift non-critical data to a public cloud.

GPU Hosting

Companies like EMC use graphics cards in cloud clusters to handle vector arithmetic. It works great for a specific sub-set of business solutions that use SVMs and other kernel methods.


Vectorization is at the heart of optimization parallel processes. Understanding how your code uses low-level libraries will help you write faster code. De-vectorized R code is a well-known performance killer.

Julia: The convergence of Big Data and HPC

"Julia makes it easy to connect to a bunch of machines—collocated or not, physical or virtual—and start doing distributed computing without any hassle. You can add and remove machines in the middle of jobs, and Julia knows how to serialize and deserialize your data without you having to tell it. References to data that lives on another machine are a first-class citizen in Julia like functions are first-class in functional languages. This is not the traditional HPC model for parallel computing but it isn’t Hadoop either. It’s somewhere in between. We believe that the traditional HPC and “Big Data” worlds are converging, and we’re aiming Julia right at that convergence point." -Julia development team.

Julia is designed to handle the vectorization for you, making de-vectorized code run faster than vectorized code.

New Compile Targets via LLVM

Scripting languages are built on top of low-level libraries like BLAS, so that under the hood, you are actually running FORTRAN.

Python can be efficient because libraries like NumPy have optimized how they use underlying libraries.

LLVM is acting as the middle-man between the scripting languages and machine code.

Asm.js runs C/C++ code in the browser by converting LLVM-generated bytecode into a subset of JavaScript with surprising efficiency.OpenCL and Heterogenous Computing

AMD has bet their future on the convergence of the CPU and GPU with their heterogeneous system architecture (HSA) and OpenCL. Most Data Scientists will never write such low-level code, but it is worth noting in this review.

Read more…

Guest blog post by Michael Walker

High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.

HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.

HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.

Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.

HPC technology today is implemented in multidisciplinary areas including:

• Finance and trading

• Oil and gas industry

• Electronic design automation

• Media and entertainment

• Biosciences

• Astrophysics

• Geographical data

• Climate research

In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

More News