Subscribe to DSC Newsletter

cloud (2)

Guest blog post by Vincent Granville

This could be a new startup idea: creating an Excel-compatible software, working just like Excel, but able to handle bigger datasets, much faster.

Like most data scientists, I've been using Excel a lot in my career, and it definitely has some very powerful features. Probably the greatest one is its ability to help design, test, and update models in real time: just changing the value of a few core parameters, and all your cells, thousands of them, and all you charts, get updated at once. If you are not familiar with this, download our most recent spreadsheet to see how it works.

Other nice features includes Excel ability to connect to databases or the Internet (e.g. to the Bing API) and extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a very strong feel of old-fashioned, SQL relational database environment. But it is still useful in many contexts. Not sure if Excel can easily retrieve data via the Internet, from non-Microsoft API's, e.g. Google API's. It should.

Yet the main drawback is Excel slowness. It is slow in ways that are unexpected. If you sort 500,000 observations (one column) it's actually quite fast. But let's say that you simultaneously sort two columns: A, and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, very fast, and automatically.

As a general rule, any operation that involves re-computing (say) 200,000+ rows across multiple cells linked via dependence relationships, is done very slowly if:

  • one of the columns includes functions such as VLOOKUP at each cell 
  • SORT or other sub-linear processes are required (by sub-linear, I mean processes that are O(n log n) or worse)

There's a very easy, efficient (but ugly) way around this, and I'm wondering why it's not built-in in Excel, and transparent to the user. For instance:

  • Replace VLOOKUP formulas by hard-coded values, perform the update on hard-coded values, then put back the formulas 
  • Perform SORT only on the minimum number of columns where necessary, then update cells in columns involving formulas

Also, I don't know why VLOOKUP is so slow in the first place. I use it all the time to join a (say) 3,000,000 row dataset with a 200,000 lookup table in Perl, and it's very fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days if it was possible to perform this "join" (it isn't). Maybe Excel indirectly performs a full join, thus exponentially slowing down the operation? I almost never do, and people almost never do a full join on large datasets. This is an area where significant improvements could be done.

Finally, not sure if Excel leverages the cloud, but this would be a great way to further speed computations and process data sets far bigger than 1 million rows. Microsoft should allow the user to export the data on some (Microsoft?) cloud in a transparent way, in one click (e.g. just click on "export to cloud") then allow the user to simulate the time-consuming operations on the local Excel version on her desktop (this amounts to using your local Excel spreadsheet as a virtual spreadsheet), and when done, click on "retrieve from cloud" and get your spreadsheet updated. The "retrieve from cloud" would:

  1. Send your Excel formulas to the cloud via the Internet
  2. Apply your formula to your cloud version of your data, leveraging Map Reduce as needed
  3. Get the processed data back to your local spreadsheet on your desktop

Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Let's say you want to store the product of A and B in column C.

  • Firstly, select the whole Column C.
  • Enter the formula of =A1*B1
  • Press the Ctrl key and Enter key together.

I wish it would be easier than that, something like ==A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user-friendly. Many times, there are some obscure ways to efficiently do something. We'll see an another example in my next spreadsheet, which will teach you how to write a formula that returns multiple cells - in particular with LINEST which returns the regression coefficients associated with a linear regression. Yes you can do it in basic Excel, without add-in!

For those interested in creating a column with 1,000,000 values (e.g. to test the speed of some Excel computations), here's how to proceed - this would indeed be a good job interview question:

  • Start with 200 cells.
  • Duplicate these cells, now you have 400.
  • Dupicate these cells, now you have 800.

Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach. And if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.

Finally, I use Perl, Python, R, C, Sort, Grep and other languages to do data summarization, before feeding stuff to Excel. But if Excel came with the features I just discussed, much more could be done within Excel. Users would spend more time in Excel.

And one weird question to finish: why does Microsoft not show ads in Excel? They would get advertising dollars that could be used to improve Excel. They could then ship a version of Excel that transparently uses the cloud, a version that even the non technical guys could use, with all the benefits that I mentioned.

Related articles:

Read more…

Where The Cloud Meets The Grid

Guest blog post by Peter Higdon

Companies build or rent grid machines when data length doesn't fit into HDFS, or the latency of parallel interconnects is too slow in the cloud. This review explores the overlap of the two paradigms at the ends of the parallel processing latency spectrum. The comparison is almost poetic and leads to many other comparisons in languages, interfaces, formats, and hardware, but there is amazingly little overlap.

Your Laptop Is A Supercomputer

To put things in perspective, 60 years ago, "computer" was a job title. When The Wu Tang Clan dropped 36 Chambers, the bottom ranking machine in the TOP500 was a quad-core Cray. Armed with your current machine, you should be able to dip your toes into any project before diving in head first. Take a small slice of the data to get a glimpse of the obstacles ahead first. Start with 1/10th, 1/8, 1/4.. until your machine can't handle it anymore. Usually by that time, your project will encounter problems that can't be fixed simply by getting a bigger computer.

DIY HPC

Depending on the kind of problem you are solving, building your own Beowulf cluster out of old commodity hardware might be the way to go. If you need a constant run of physics simulations or BLAST alignments, a load of wholesale off-lease laptops should get the job done for under 2000$.

http://www.comsol.com/blogs/building-beowulf-cluster-faster-multiphysics-simulations/

Some Raspberry Pi enthusiasts have built a 32 node cluster, but it has limited use cases, given the limitations of the RPi's ARM.

http://www.zdnet.com/build-your-own-supercomputer-out-of-raspberry-pi-boards-7000015831/


Password hashing and BitCoin farms use ASICs and FPGAs. In these cases, latency of interconnects is much less important than single-thread processing.

Move To The Cloud

You don't need to go through the hassle of wiring and configuring a cluster for a short-term project. The hourly cost savings of running your own servers quickly diminish as you struggle through the details of MIS: DevOps, provisioning, deployment, hardware failure, etc. Small development shops and big enterprises like Netflix are happy to pay premiums for a managed solution. We have a staggering variety of SLAs available today as service providers compete to capture new markets.

Cloud Bursting

When your cluster can't quite handle the demand of your process, rent a few servers from the cloud to handle the over-flow.
http://archives.opennebula.org/documentation:rel4.4:introh

Cloud Bridging

Use your cluster to handle sensitive private data, and shift non-critical data to a public cloud.
http://www.citrix.com/products/cloudbridge/tech-info.html

GPU Hosting

Companies like EMC use graphics cards in cloud clusters to handle vector arithmetic. It works great for a specific sub-set of business solutions that use SVMs and other kernel methods.

Vectorization

Vectorization is at the heart of optimization parallel processes. Understanding how your code uses low-level libraries will help you write faster code. De-vectorized R code is a well-known performance killer.

Julia: The convergence of Big Data and HPC

"Julia makes it easy to connect to a bunch of machines—collocated or not, physical or virtual—and start doing distributed computing without any hassle. You can add and remove machines in the middle of jobs, and Julia knows how to serialize and deserialize your data without you having to tell it. References to data that lives on another machine are a first-class citizen in Julia like functions are first-class in functional languages. This is not the traditional HPC model for parallel computing but it isn’t Hadoop either. It’s somewhere in between. We believe that the traditional HPC and “Big Data” worlds are converging, and we’re aiming Julia right at that convergence point." -Julia development team.

Julia is designed to handle the vectorization for you, making de-vectorized code run faster than vectorized code.

New Compile Targets via LLVM

Scripting languages are built on top of low-level libraries like BLAS, so that under the hood, you are actually running FORTRAN.

Python can be efficient because libraries like NumPy have optimized how they use underlying libraries.

http://www.slideshare.net/teoliphant/numba-siam-2013

LLVM is acting as the middle-man between the scripting languages and machine code.

Asm.js runs C/C++ code in the browser by converting LLVM-generated bytecode into a subset of JavaScript with surprising efficiency.OpenCL and Heterogenous Computing

AMD has bet their future on the convergence of the CPU and GPU with their heterogeneous system architecture (HSA) and OpenCL. Most Data Scientists will never write such low-level code, but it is worth noting in this review.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

More News