Subscribe to DSC Newsletter

All Posts (11)

This article was written by Michael Walker.

 High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.

HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.

HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.

Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.

HPC technology today is implemented in multidisciplinary areas including:

• Finance and trading

• Oil and gas industry

• Electronic design automation

• Media and entertainment

• Biosciences

• Astrophysics

• Geographical data

• Climate research

In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.

Originally posted here.

Read more…

New Directions in Cryptography

Here I propose an alternative to traditional cryptography. Traditional cryptography relies on hash functions (Bitcoin) or large prime numbers (RSA.) The foundations of this new technology are based on my book on numeration systems, available for free to DSC members. The key idea is to use numeration systems with non-integer bases, to represent numbers. 

1. Representation of numbers in a non-integer base

The following also applies to integer bases such as base 2 (binary), base 10 (decimal) or base 16 (hexadecimal). However our interest is in bases that are real numbers between 1.5 and 1.9. In these bases, the digits -- just like in the binary base-2 system -- are always 0 or 1. Unlike the binary system though, the proportions of 0's and 1's are not equal to 50%, and there is some auto-correlation in the sequence of digits. I explain in the next section how to handle this problem. Another issue is that in these small bases, a digit carries less than one bit of information, thus making the message to be transmitted, longer than the binary code (base-2) that represents it. More precisely, the amount of information stored in one digit is equal lo log(b) / log(2) bit, where b is the base. This is why we want to avoid bases that are smaller than 1.5. 

Algorithm to compute the digits

To compute the digits of a number x between 0 and 1, in base b, one proceeds as follows:

  • Start with x(1) = x and a(1) = INT(bx). 
  • Iteratively compute x(n) = b * x(n-1) - INT(b * x(n-1)), and a(n) = INT(b * x(n)).

Here INT represents the integer function, also called floor function. The above algorithm is just a version of the greedy algorithm.  In all cases, x(n) is a real number between 0 and 1, and a(n) is the n-th digit of x, in base b

Once the digits in base b are known, it is easy to retrieve the number x, using the formula

Typically, you need to use high performance computing if you want to compute more than 45 digits or so, due to limitations in machine precision. How to do it is described in chapter 8, in my book. It is not a challenging problem if you only need a few hundred, maybe 2,000 digits, which is the case in practice. Note that RSA, which relies on large prime numbers with hundreds of digits, also requires high precision computing.

Finally, note that two different numbers have two different representations in the same base, an advantage over hash functions (subject to collisions.)

2. Application to cryptography

You have a message x, to be transmitted, for example the binary (base-2) representation of your original, text message. You encode x in a base b, with b chosen between 1.5 and 1.9. The base b can even be a transcendental number.

Your encoded message simply consists of the digits of x in base b, and these digits can be shared publicly. What is kept secret is the base b, so that if an attacker knows the digits, he still can't retrieve the message as he does not know the base. The base is the equivalent of a key in standard cryptographic systems. 

If you know both the digits and the base, you can easily reconstruct the original message, though it will require a bit of high precision computing. As always, you split the original message (if it is too long) in blocks of (say) 512 bytes, and encode each block separately.  

Challenges

As in all cryptographic systems, there are some challenges to overcome, to make it more robust against attacks.

One of the challenges here is not related to security, but about how many digits (in base b) you need to use to be able to reconstruct the full message. Based on entropy theory, it is likely that using twice as many digits than in the original base-2 version of the message, should be enough, if b is larger than 1.5. However, this has to be tested. Keep in mind that we encode small blocks one at a time, each block having up to 2,000 base-2 digits, though smaller blocks offer some advantages.

The other challenges are about the security of the system. Can an attacker try a large number of test bases to guess which base is used?  Bases that are very close to each other will produce identical digits, at least for the first few digits. It might then be possible to use a dichotomic search to identify the secret base. Another issue is the distribution of 0's and 1's, as well as the auto-correlation structure of the digits, which allow you to identify the secret base, by performing some statistical analysis. In the next subsection, we address this issue.

Improving security

As discussed earlier, using a single base b produces a weak cryptographic system. Yet it still requires advanced statistical knowledge -- not tools available from the dark net -- to break it. And being a new system, it would probably take years before someone can do a successful attack. Indeed, agencies such as NSA might like it, because it appears at first glance to be safe enough to use in commercial applications, yet it gives the government a natural back door to decode messages.

The first idea that comes to mind to make this system stronger, is to use a mapping of x(n)  to scramble the distribution of 0's and 1's: for instance, using x'(n) = SQRT(1 - x(n)), instead of x(n). You must also scramble the auto-correlation structure in the digits. The legit recipient of the message would have to know the base used for encoding, as well as the scrambling mechanism, to retrieve the original message. However it might be next to impossible to retrieve the original message after scrambling, even if you know all the scrambling parameters.

A much easier solution consists of using multiple bases, say two bases b and b'. Digits in even position come from base b, while digits in odd position come from base b'. The legit recipient must know both band b' to decode the message. More sophisticated versions of this trick can be implemented, to increase security. For instance, if a digit is equal to 0, the next digit is in the alternate base, otherwise it is in the same base as the current digit: so digits from both bases are interlaced in a more complex way.

Final note

Readers interested in this article can write and submit a patent, about this technology. This content is offered as open intellectual property, and can be used in any application, commercial or not. Statisticians could be interested in doing simulations to see how easy or difficult it is to break this system, especially by analyzing the digit distribution to identify secret bases. 

Originally posted here. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

Read more…

What are the differences between data science, data mining, machine learning, statistics, operations research, and so on?

Here I compare several analytic disciplines that overlap, to explain the differences and common denominators. Sometimes differences exist for nothing else other than historical reasons. Sometimes the differences are real and subtle. I also provide typical job titles, types of analyses, and industries traditionally attached to each discipline. Underlined domains are main sub-domains. It would be great if someone can add an historical perspective to my article.

Source for the picture

Data Science

First, let's start by describing data science, the new discipline. 

Job titles include data scientist, chief scientist, senior analyst, director of analytics and many more. It covers all industries and fields, but especially digital analytics, search technology, marketing, fraud detection, astronomy, energy, healhcare, social networks, finance, forensics, security (NSA), mobile, telecommunications, weather forecasts, and fraud detection.

Projects include taxonomy creation (text mining, big data), clustering applied to big data sets, recommendation engines, simulations, rule systems for statistical scoring engines, root cause analysis, automated bidding, forensics, exo-planets detection, and early detection of terrorist activity or pandemics, An important component of data science is automation, machine-to-machine communications, as well as algorithms running non-stop in production mode (sometimes in real time), for instance to detect fraud, predict weather or predict home prices for each home (Zillow).

An example of data science project is the creation of the fastest growing data science Twitter profile, for computational marketing. It leverages big data, and is part of a viral marketing / growth hacking strategy that also includes automated high quality, relevant, syndicated content generation (in short, digital publishing version 3.0).

Unlike most other analytic professions, data scientists are assumed to have great business acumen and domain expertize -- one of the reasons why they tend to succeed as entrepreneurs.There are many types of data scientists, as data science is a broad discipline. Many senior data scientists master their art/craftsmanship and possess the whole spectrum of skills and knowledge; they really are the unicorns that recruiters can't find. Hiring managers and uninformed executives favor narrow technical skills over combined deep, broad and specialized business domain expertize - a byproduct of the current education system that favors discipline silos, while true data science is a silo destructor. Unicorn data scientists (a misnomer, because they are not rare - some are famous VC's)  usually work as consultants, or as executives. Junior data scientists tend to be more specialized in one aspect of data science, possess more hot technical skills (Hadoop, Pig, Cassandra) and will have no problems finding a job if they received appropriate training and/or have work experience with companies such as Facebook, Google, eBay, Apple, Intel, Twitter, Amazon, Zillow etc. Data science projects for potential candidates can be found here.

Data science overlaps with

  • Computer science: computational complexity, Internet topology and graph theory, distributed architectures such as Hadoop, data plumbing (optimization of data flows and in-memory analytics), data compression, computer programming (Python, Perl, R) and processing sensor and streaming data (to design cars that drive automatically)
  • Statistics: design of experiments including multivariate testing, cross-validation, stochastic processes, sampling, model-free confidence intervals, but not p-value nor obscure tests of thypotheses that are subjects to the curse of big data 
  • Machine learning and data mining: data science indeed fully encompasses these two domains.
  • Operations research: data science encompasses most of operations research as well as any techniques aimed at optimizing decisions based on analysing data. 
  • Business intelligence: every BI aspect of designing/creating/identifying great metrics and KPI's, creating database schemas (be it NoSQL or not), dashboard design and visuals, and data-driven strategies to optimize decisions and ROI, is data science.

Comparison with other analytic discplines

  • Machine learning: Very popular computer science discipline, data-intensive, part of data science and closely related to data mining. Machine learning is about designing algorithms (like data mining), but emphasis is on prototyping algorithms for production mode, and designing automated systems (bidding algorithms, ad targeting algorithms) that automatically update themselves, constantly train/retrain/update training sets/cross-validate, and refine or discover new rules (fraud detection) on a daily basis. Python is now a popular language for ML development. Core algorithms include clustering and supervised classification, rule systems, and scoring techniques. A sub-domain, close to Artificial Intelligence (see entry below) is deep learning.

  • Data mining: This discipline is about designing algorithms to extract insights from rather large and potentially unstructured data (text mining), sometimes called nugget discovery, for instance unearthing a massive Botnets after looking at 50 million rows of data.Techniques include pattern recognition, feature selection, clustering, supervised classification and encompasses a few statistical techniques (though without the p-values or confidence intervals attached to most statistical methods being used). Instead, emphasis is on robust, data-driven, scalable techniques, without much interest in discovering causes or interpretability. Data mining thus have some intersection with statistics, and it is a subset of data science. Data mining is applied computer engineering, rather than a mathematical science. Data miners use open source and software such as Rapid Miner.

  • Predictive modeling: Not a discipline per se. Predictive modeling projects occur in all industries across all disciplines. Predictive modeling applications aim at predicting future based on past data, usually but not always based on statistical modeling. Predictions often come with confidence intervals. Roots of predictive modeling are in statistical science.

  • Statistics. Currently, statistics is mostly about surveys (typically performed with SPSS software), theoretical academic research, bank and insurance analytics (marketing mix optimization, cross-selling, fraud detection, usually with SAS and R), statistical programming, social sciences, global warming research (and space weather modeling), economic research, clinical trials (pharmaceutical industry), medical statistics, epidemiology, biostatistics.and government statistics. Agencies hiring statisticians include the Census Bureau, IRS, CDC, EPA, BLS, SEC, and EPA (environmental/spatial statistics). Jobs requiring a security clearance are well paid and relatively secure, but the well paid jobs in the pharmaceutical industry (the golden goose for statisticians) are threatened by a number of factors - outsourcing, company mergings, and pressures to make healthcare affordable. Because of the big influence of the conservative, risk-adverse pharmaceutical industry, statistics has become a narrow field not adapting to new data, and not innovating, loosing ground to data science, industrial statistics, operations research, data mining, machine learning -- where the same clustering, cross-validation and statistical training techniques are used, albeit in a more automated way and on bigger data. Many professionals who were called statisticians 10 years ago, have seen their job title changed to data scientist or analyst in the last few years. Modern sub-domains include statistical computing, statistical learning (closer to machine learning), computational statistics (closer to data science), data-driven (model-free) inference, sport statistics, and Bayesian statistics (MCMC, Bayesian networks and hierarchical Bayesian models being popular, modern techniques). Other new techniques include SVM, structural equation modeling, predicting election results, and ensemble models.

  • Industrial statistics. Statistics frequently performed by non-statisticians (engineers with good statistical training), working on engineering projects such as yield optimization or load balancing (system analysts). They use very applied statistics, and their framework is closer to six sigma, quality control and operations research, than to traditional statistics. Also found in oil and manufacturing industries. Techniques used include time series, ANOVA, experimental design, survival analysis, signal processing (filtering, noise removal, deconvolution), spatial models, simulation, Markov chains, risk and reliability models.

  • Mathematical optimization. Solves business optimization problems with techniques such as the simplex algorithm, Fourier transforms (signal processing), differential equations, and software such as Matlab. These applied mathematicians are found in big companies such as IBM, research labs, NSA (cryptography) and in the finance industry (sometimes recruiting physics or engineer graduates). These professionals sometimes solve the exact same problems as statisticians do, using the exact same techniques, though they use different names. Mathematicians use least square optimization for interpolation or extrapolation; statisticians use linear regression for predictions and model fitting, but both concepts are identical, and rely on the exact same mathematical machinery: it's just two names describing the same thing. Mathematical optimization is however closer to operations research than statistics, the choice of hiring a mathematician rather than another practitioner (data scientist) is often dictated by historical reasons, especially for organizations such as NSA or IBM.

  • Actuarial sciences. Just a subset of statistics focusing on insurance (car, health, etc.) using survival models: predicting when you will die, what your health expenditures will be based on your health status (smoker, gender, previous diseases) to determine your insurance premiums. Also predicts extreme floods and weather events to determine premiums. These latter models are notoriously erroneous (recently) and have resulted in far bigger payouts than expected. For some reasons, this is a very vibrant, secretive community of statisticians, that do not call themselves statisticians anymore (job title is actuary). They have seen their average salary increase nicely over time: access to profession is restricted and regulated just like for lawyers, for no other reasons than protectionism to boost salaries and reduce the number of qualified applicants to job openings. Actuarial sciences is indeed data science (a sub-domain).

  • HPC. High performance computing, not a discipline per se, but should be of concern to data scientists, big data practitioners, computer scientists and mathematicians, as it can redefine the computing paradigms in these fields. If quantum computing ever becomes successful, it will totally change the way algorithms are designed and implemented. HPC should not be confused with Hadoop and Map-Reduce: HPC is hardware-related, Hadoop is software-related (though heavily relying on Internet bandwidth and servers configuration and proximity).

  • Operations research. Abbreviated as OR. They separated from statistics a while back (like 20 years ago), but they are like twin brothers, and their respective organizations (INFORMS and ASA) partner together. OR is about decision science and optimizing traditional business projects: inventory management, supply chain, pricing. They heavily use Markov Chain models, Monter-Carlo simulations, queuing and graph theory, and software such as AIMS, Matlab or Informatica. Big, traditional old companies use OR, new and small ones (start-ups) use data science to handle pricing, inventory management or supply chain problems. Many operations research analysts are becoming data scientists, as there is far more innovation and thus growth prospect in data science, compared to OR. Also, OR problems can be solved by data science. OR has a siginficant overlap with six-sigma (see below), also solves econometric problems, and has many practitioners/applications in the army and defense sectors. car traffic optimization is a modern example of OR problem, solved with simulations, commuter surveys, sensor data and statistical modeling.

  • Six sigma. It's more a way of thinking (a business philosophy, if not a cult) rather than a discipline, and was heavily promoted by Motorola and GE a few decades ago. Used for quality control and to optimize engineering processes (see entry on industrial statistics in this article), by large, traditional companies. They have a LinkedIn group with 270,000 members, twice as large as any other analytic LinkedIn groups including our data science group. Their motto is simple: focus your efforts on the 20% of your time that yields 80% of the value. Applied, simple statistics are used (simple stuff works must of the time, I agree), and the idea is to eliminate sources of variances in business processes, to make them more predictable and improve quality. Many people consider six sigma to be old stuff that will disappear. Perhaps, but the fondamental concepts are solid and will remain: these are also fundamental concepts for all data scientists. You could say that six sigma is a much more simple if not simplistic version of operations research (see above entry), where statistical modeling is kept to a minimum. Risks: non qualified people use non-robust black-box statistical tools to solve problems, it can result in disasters. In some ways, six sigma is a discipline more suited for business analysts (see business intelligence entry below) than for serious statisticians.

  • Quant. Quant people are just data scientists working for Wall Street on problems such as high frequency trading or stock market arbitraging. They use C++, Matlab, and come from prestigious universities, earn big bucks but lose their job right away when ROI goes too South too quickly. They can also be employed in energy trading. Many who were fired during the great recession now work on problems such as click arbitraging, ad optimization and keyword bidding. Quants have backgrounds in statistics (few of them), mathematical optimization, and industrial statistics.

  • Artificial intelligence. It's coming back. The intersection with data science is pattern recognition (image analysis) and the design of automated (some would say intelligent) systems to perform various tasks, in machine-to-machine communication mode, such as identifying the right keywords (and right bid) on Google AdWords (pay-per-click campaigns involving millions of keywords per day). I also consider smart search (creating a search engine returning the results that you expect and being much broader than Google) one of the greatest problems in data science, arguably also an AI and machine learning problem. An old AI technique is neural networks, but it is now loosing popularity. To the contrary, neuroscience is gaining popularity.

  • Econometrics. Why it became separated from statistics is unclear. So many branches disconnected themselves from statistics, as they became less generic and start developing their own ad-hoc tools. But in short, econometrics is heavily statistical in nature, using time series models such as auto-regressive processes. Also overlapping with operations research (itself overlaping with statistics!) and mathematical optimization (simplex algorithm). Econometricians like ROC and efficiency curves (so do six sigma practitioners, see corresponding entry in this article). Many do not have a strong statistical background, and Excel is their main or only tool.

  • Data engineering. Performed by software engineers (developers) or architects (designers) in large organizations (sometimes by data scientists in tiny companies), this is the applied part of computer science (see entry in this article), to power systems that allow all sorts of data to be easily processed in-memory or near-memory, and to flow nicely to (and between) end-users, including heavy data consumers such as data scientists. A sub-domain currently under attack is data warehousing, as this term is associated with static, siloed  conventational data bases, data architectures, and data flows, threatened by the rise of NoSQL, NewSQL and graph databases. Transforming these old architectures into new ones (only when needed) or make them compatible with new ones, is a lucrative business.

  • Business intelligence. Abbreviated as BI. Focuses on dashboard creation, metric selection, producing and scheduling data reports (statistical summaries) sent by email or delivered/presented to executives, competitive intelligence (analyzing third party data), as well as involvement in database schema design (working with data architects) to collect useful, actionable business data efficiently. Typical job title is business analyst, but some are more involved with marketing, product or finance (forecasting sales and revenue). They typically have an MBA degree. Some have learned advanced statistics such as time series, but most only use (and need) basic stats, and light analytics, relying on IT to maintain databases and harvest data. They use tools such as Excel (including cubes and pivot tables, but not advanced analytics), Brio (Oracle browser client), Birt, Micro-Sreategy or Business Objects (as end-users to run queries), though some of these tools are increasingly equipped with better analytic capabilities. Unless they learn how to code, they are competing with some polyvalent data scientists that excel in decision science, insights extraction and presentation (visualization), KPI design, business consulting, and ROI/yield/business/process optimization. BI and market research (but not competitive intelligence) are currently experiencing a decline, while AI is experiencing a come-back. This could be cyclical. Part of the decline is due to not adapting to new types of data (e.g. unstructured text) that require engineering or data science techniques to process and extract value.

  • Data analysis. This is the new term for business statistics since at least 1995, and it covers a large spectrum of applications including fraud detection, advertising mix modeling, attribution modeling, sales forecasts, cross-selling optimization (retails), user segmentation, churn analysis, computing long-time value of a customer and cost of acquisition, and so on. Except in big companies, data analyst is a junior role; these practitioners have a much more narrow knwoledge and experience than data scientists, and they lack (and don't need) business vision. They are detail-orirented and report to managers such as data scientists or director of analytics, In big companies, someone with a job title such as data analyst III might be very senior, yet they usually are specialized and lack the broad knowledge gained by data scientists working in a variety of companies large and small. 

  • Business analytics. Same as data analysis, but restricted to business problems only. Tends to have a bit more of a finacial, marketing or ROI flavor. Popular job titles include data analyst and data scientist, but not business analyst (see business intelligence entry for business intelligence, a different domain).

Finally, there are more specialized analytic disciplines that recently emerged: health analytics, computational chemistry and bioinformatics  (genome research), for instance.

This blog post was originally posted here.

Read more…

Guest blog post by Vincent Granville

This could be a new startup idea: creating an Excel-compatible software, working just like Excel, but able to handle bigger datasets, much faster.

Like most data scientists, I've been using Excel a lot in my career, and it definitely has some very powerful features. Probably the greatest one is its ability to help design, test, and update models in real time: just changing the value of a few core parameters, and all your cells, thousands of them, and all you charts, get updated at once. If you are not familiar with this, download our most recent spreadsheet to see how it works.

Other nice features includes Excel ability to connect to databases or the Internet (e.g. to the Bing API) and extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a very strong feel of old-fashioned, SQL relational database environment. But it is still useful in many contexts. Not sure if Excel can easily retrieve data via the Internet, from non-Microsoft API's, e.g. Google API's. It should.

Yet the main drawback is Excel slowness. It is slow in ways that are unexpected. If you sort 500,000 observations (one column) it's actually quite fast. But let's say that you simultaneously sort two columns: A, and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, very fast, and automatically.

As a general rule, any operation that involves re-computing (say) 200,000+ rows across multiple cells linked via dependence relationships, is done very slowly if:

  • one of the columns includes functions such as VLOOKUP at each cell 
  • SORT or other sub-linear processes are required (by sub-linear, I mean processes that are O(n log n) or worse)

There's a very easy, efficient (but ugly) way around this, and I'm wondering why it's not built-in in Excel, and transparent to the user. For instance:

  • Replace VLOOKUP formulas by hard-coded values, perform the update on hard-coded values, then put back the formulas 
  • Perform SORT only on the minimum number of columns where necessary, then update cells in columns involving formulas

Also, I don't know why VLOOKUP is so slow in the first place. I use it all the time to join a (say) 3,000,000 row dataset with a 200,000 lookup table in Perl, and it's very fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days if it was possible to perform this "join" (it isn't). Maybe Excel indirectly performs a full join, thus exponentially slowing down the operation? I almost never do, and people almost never do a full join on large datasets. This is an area where significant improvements could be done.

Finally, not sure if Excel leverages the cloud, but this would be a great way to further speed computations and process data sets far bigger than 1 million rows. Microsoft should allow the user to export the data on some (Microsoft?) cloud in a transparent way, in one click (e.g. just click on "export to cloud") then allow the user to simulate the time-consuming operations on the local Excel version on her desktop (this amounts to using your local Excel spreadsheet as a virtual spreadsheet), and when done, click on "retrieve from cloud" and get your spreadsheet updated. The "retrieve from cloud" would:

  1. Send your Excel formulas to the cloud via the Internet
  2. Apply your formula to your cloud version of your data, leveraging Map Reduce as needed
  3. Get the processed data back to your local spreadsheet on your desktop

Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Let's say you want to store the product of A and B in column C.

  • Firstly, select the whole Column C.
  • Enter the formula of =A1*B1
  • Press the Ctrl key and Enter key together.

I wish it would be easier than that, something like ==A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user-friendly. Many times, there are some obscure ways to efficiently do something. We'll see an another example in my next spreadsheet, which will teach you how to write a formula that returns multiple cells - in particular with LINEST which returns the regression coefficients associated with a linear regression. Yes you can do it in basic Excel, without add-in!

For those interested in creating a column with 1,000,000 values (e.g. to test the speed of some Excel computations), here's how to proceed - this would indeed be a good job interview question:

  • Start with 200 cells.
  • Duplicate these cells, now you have 400.
  • Dupicate these cells, now you have 800.

Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach. And if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.

Finally, I use Perl, Python, R, C, Sort, Grep and other languages to do data summarization, before feeding stuff to Excel. But if Excel came with the features I just discussed, much more could be done within Excel. Users would spend more time in Excel.

And one weird question to finish: why does Microsoft not show ads in Excel? They would get advertising dollars that could be used to improve Excel. They could then ship a version of Excel that transparently uses the cloud, a version that even the non technical guys could use, with all the benefits that I mentioned.

Related articles:

Read more…

In-place Computing Model: for Big and Complex Data

Guest blog post by Yuanjen Chen

As we've seen how in-place and in-memory work differently, today we are sharing more fundamentals of in-place computing model. This models was designed to solve "Big and Complex Data," - not just about size but more about the complexity. We see many analytic cases today incorporate multiple relations of data, for instance, when we try to solve a data mining case for an online retailer, we may need to analyze both product attributes (categories, colors, materials..) and customer attributes (age, gender, region, ...) and even more. Such an in-depth analysis greatly counts on the relational model which can be dealt by a relational database, the data size we encounter today, however, might be too heavy for the traditional database technology. A NoSQL database scales, nonetheless, does not fit the relational model.

The In-place Computing Model aims to fill in the gap between these two systems; it supports extended relational model while maintaining the performance as well as the scalability.  This computing is unconventional in two ways; first, it moves away from data retrieval to a data-centric model in the sense that computations take place where data resides. Second, it determines the data objects in macro data structure that works in the way how macro molecules do in living cells. These two principles altogether help to organize the big data complexity and contribute to the substantial performance improvement: 2 to 3 orders of magnitude compared to the existing in-memory databases.

For more details and technical insights, please visit our document here. For free trial or more information, go here

Read more…

Guest blog post by Gil Allouche

Like “Big Data”, the term “Cloud Computing” has given rise to a number of misconceptions concerning what it is and what it does. While many of these delusions have been sufficiently debunked, a number of mistaken beliefs about the cloud have managed to keep hanging around---attaining mythical status despite being laid bare on more than a few blogs designed to dismiss them once and for all. And so, culled from the blogosphere for your consideration comes this non-biased compilation of the top 5 cloud computing myths…that just won’t die.

Myth #1: There’s only one true cloud

This myth---the notion that there is only one cloud: the public cloud---has made the rounds on a large number of cloud myth busting blogs. Of the various IT experts who weighed in on the subject, the general consensus was that, although there is greater awareness of the public cloud, the private cloud is not only a reality, it is quickly gaining headway in the corporate landscape. Offered as evidence to refute the “one true cloud” mythology, reference was made to a 2012 Gartner Data Center Conference poll, wherein 9 out of 10 respondents stated that they were either in the planning stage, the implementation stage, or that their organizations already had a private cloud up and running.

Myth #2: Cloud use is all or none

Another myth that seems to be widespread among enterprise decision-makers is that cloud use is an all-or-nothing proposition. You’re either in or you’re out, with no options in-between. In dispelling this major misconception, a number of articles discussed the reality and practicality of a “hybrid” cloud infrastructure. Through the integration of public and private cloud applications, the hybrid cloud allows enterprises to customize and scale cloud use according to their specific needs.

Myth #3: Moving to the cloud is complicated

The idea that transitioning to a cloud database is a long and laborious process is a common misconception that is holding many companies back. While the thought of shifting from a traditional network to a cloud-based infrastructure can be daunting for any enterprise, the transition is typically easier and faster than expected. A strong case in point is found within the Federal Government. Recently the Department of the Interior (DOI) contracted cloud services with IQ Business Group to utilize its SaaS platform to capture, classify and store a whopping 75 million emails per month. The time it took IQ Business Group to get the DOI’s Enterprise Records and Document Management System up and running? A mere 45 days. Contracting with the right cloud services provider is the key to a quick and smooth transition.

Myth #4: Cloud computing is too expensive---no wait---it’s cheap

Both of these misperceptions regarding the costs of cloud computing are prevalent on the web, along with seemingly sound arguments supporting each viewpoint. In weighing both arguments, the truth lies somewhere in the middle. Whether cloud computing is costly or inexpensive depends upon how enterprises go about it. For example, public clouds with pay-per-use features seem to be very economical for applications that are short-lived or those that have highly variable capacity requirements. However, for applications with long lifespans that have fairly constant capacity needs, fixed monthly or yearly costs appear more economical. Hybrid clouds could also be more affordable, depending on the needs of the enterprise. Although a switch to the cloud will necessitate up front costs, savings down the line should offset those costs. A nearly unanimous opinion on the subject of cloud costs was that cost-savings should never be the primary motivator for going to the cloud.

Myth #5: The cloud is not secure

Even staunch supporters of a cloud-based infrastructure admitted that the idea of storing and processing sensitive data off-site warrants a discussion of cloud safety. The private cloud is thought to be more secure than the public cloud, as the former is sequestered behind the firewall of the enterprise. However, the actual level of private cloud security is dependent upon the security resources and practices of the corporate data center. Enterprise-grade public clouds have seriously upped security by employing cloud security experts, staying fully compliant with regulatory and industry standards, conducting regular third-party security audits and conducting automatic hardware and software updates. Industry experts caution that enterprises need to understand their cloud provider’s security practices in order to access potential threats to security.

Myths, by definition, seem to have a life of their own, and the workings of the web can keep myths circulating indefinitely. Therefore, it’s essential for any enterprise considering cloud technology to practice due diligence, rather than being swayed one way or another by the myths and misconceptions perpetuated in the blogosphere.

Read more…

Where The Cloud Meets The Grid

Guest blog post by Peter Higdon

Companies build or rent grid machines when data length doesn't fit into HDFS, or the latency of parallel interconnects is too slow in the cloud. This review explores the overlap of the two paradigms at the ends of the parallel processing latency spectrum. The comparison is almost poetic and leads to many other comparisons in languages, interfaces, formats, and hardware, but there is amazingly little overlap.

Your Laptop Is A Supercomputer

To put things in perspective, 60 years ago, "computer" was a job title. When The Wu Tang Clan dropped 36 Chambers, the bottom ranking machine in the TOP500 was a quad-core Cray. Armed with your current machine, you should be able to dip your toes into any project before diving in head first. Take a small slice of the data to get a glimpse of the obstacles ahead first. Start with 1/10th, 1/8, 1/4.. until your machine can't handle it anymore. Usually by that time, your project will encounter problems that can't be fixed simply by getting a bigger computer.

DIY HPC

Depending on the kind of problem you are solving, building your own Beowulf cluster out of old commodity hardware might be the way to go. If you need a constant run of physics simulations or BLAST alignments, a load of wholesale off-lease laptops should get the job done for under 2000$.

http://www.comsol.com/blogs/building-beowulf-cluster-faster-multiphysics-simulations/

Some Raspberry Pi enthusiasts have built a 32 node cluster, but it has limited use cases, given the limitations of the RPi's ARM.

http://www.zdnet.com/build-your-own-supercomputer-out-of-raspberry-pi-boards-7000015831/


Password hashing and BitCoin farms use ASICs and FPGAs. In these cases, latency of interconnects is much less important than single-thread processing.

Move To The Cloud

You don't need to go through the hassle of wiring and configuring a cluster for a short-term project. The hourly cost savings of running your own servers quickly diminish as you struggle through the details of MIS: DevOps, provisioning, deployment, hardware failure, etc. Small development shops and big enterprises like Netflix are happy to pay premiums for a managed solution. We have a staggering variety of SLAs available today as service providers compete to capture new markets.

Cloud Bursting

When your cluster can't quite handle the demand of your process, rent a few servers from the cloud to handle the over-flow.
http://archives.opennebula.org/documentation:rel4.4:introh

Cloud Bridging

Use your cluster to handle sensitive private data, and shift non-critical data to a public cloud.
http://www.citrix.com/products/cloudbridge/tech-info.html

GPU Hosting

Companies like EMC use graphics cards in cloud clusters to handle vector arithmetic. It works great for a specific sub-set of business solutions that use SVMs and other kernel methods.

Vectorization

Vectorization is at the heart of optimization parallel processes. Understanding how your code uses low-level libraries will help you write faster code. De-vectorized R code is a well-known performance killer.

Julia: The convergence of Big Data and HPC

"Julia makes it easy to connect to a bunch of machines—collocated or not, physical or virtual—and start doing distributed computing without any hassle. You can add and remove machines in the middle of jobs, and Julia knows how to serialize and deserialize your data without you having to tell it. References to data that lives on another machine are a first-class citizen in Julia like functions are first-class in functional languages. This is not the traditional HPC model for parallel computing but it isn’t Hadoop either. It’s somewhere in between. We believe that the traditional HPC and “Big Data” worlds are converging, and we’re aiming Julia right at that convergence point." -Julia development team.

Julia is designed to handle the vectorization for you, making de-vectorized code run faster than vectorized code.

New Compile Targets via LLVM

Scripting languages are built on top of low-level libraries like BLAS, so that under the hood, you are actually running FORTRAN.

Python can be efficient because libraries like NumPy have optimized how they use underlying libraries.

http://www.slideshare.net/teoliphant/numba-siam-2013

LLVM is acting as the middle-man between the scripting languages and machine code.

Asm.js runs C/C++ code in the browser by converting LLVM-generated bytecode into a subset of JavaScript with surprising efficiency.OpenCL and Heterogenous Computing

AMD has bet their future on the convergence of the CPU and GPU with their heterogeneous system architecture (HSA) and OpenCL. Most Data Scientists will never write such low-level code, but it is worth noting in this review.

Read more…

Guest blog post by Khosrow Hassibi

Technical Title: High-Performance Data Mining and Big Data Analytics

Business Title: The Story of Insight From Big Data

Book Site: http://bigdataminingbook.info

Table of Contents: LinkPDF

Orders (not yet available in digital edition):  AmazonCreateSpace

Target Audience: This book is intended for a variety of audiences:

(1) There are many people in the technology, science, and business disciplines who are curious to learn about big data analytics in a broad sense, combined with some historical perspective. They may intend to enter the big data market and play a role. For this group, the book provides an overview of many relevant topics. College and high school students who have interest in science and math, and are contemplating about what to pursue as a career, will also find the book helpful.  

(2) For the executives, business managers, and sales staff who also have an interest in technology, believe in the importance of analytics, and want to understand big data analytics beyond the buzzwords, this book provides a good overview and a deeper introduction of the relevant topics.
(3) Those in classic organizations—at any vertical and level— who either manage or consume data find this book helpful in grasping the important topics in big data analytics and its potential impact in their
organizations.
(4) Those in IT benefit from this book by learning about the challenges of the data consumers: data miners/scientists, data analysts, and other business users. Often the perspectives of IT and analytics users are different on how data is to be managed and consumed. 
(5) Business analysts can learn about the different big data technologies and how it may impact what they do today.
(6) Statisticians typically use a narrow set of statistical tools and usually work on a narrow set of business problems depending on their industry. This book points to many other frontiers in which statisticians can continue to play important roles.
(7) Since the main focus of the book is high-performance data mining and contrasting it with big data analytics in terms of commonalities and differences, data miners and machine learning practitioners gain a holistic view of how the two relate.
(8) Those interested in data science gain from the historical viewpoint of the book since the practice of data science—as opposed to the name itself—has existed for a long time. Big data revolution has significantly helped create awareness about analytics and increased the need for data science professionals.

Intro: The use of machine learning and data mining to create value from corporate or public data is nothing new. It is not the first time that these technologies are in the spotlight. Many remember the late '80s and the early '90s when machine learning techniques-in particular neural networks-had become very popular. Data mining was at a rise. There were talks everywhere about advanced analysis of data for decision making. Even the popular android character in "Star Trek: The Next Generation" had been named appropriately as "Data." Data mining science has been the cornerstone of many data products and applications for more than two decades, e.g., in finance and retail. Credit scores have been in use for decades to assess credit worthiness of people when applying for credit or loan. Sophisticated real-time fraud scores based on individual's transaction spending patterns have been used since early '90s to protect credit cardholders from a variety of fraud schemes. However, the popularity of web products from the likes of Google, Linked-in, Amazon, and Facebook has helped analytics become a household name. While a decade ago, the masses did not know how their detailed data were being used by corporations for decision making, today they are fully aware of that fact. Many people, especially the millennial generation, voluntarily provide detailed information about themselves. Today people know that any mouse click they generate, any comment they write, any transaction they perform, and any location they go to, may be captured and analyzed for some business purpose. 

Every new technology comes with lots of hype and many new buzzwords. Often, fact and fiction get mixed-up making it impossible for outsiders to assess the technology's true relevance. I wrote this book to provide an objective view of analytics trends today. I have written it in complete independence, and solely as a personal passion. As a result, the views expressed in this book are those of the author and do not necessarily represent the views of, and should not be attributed to, any vendor or employer.

Due to the exponential growth of data, today there is an ever increasing need to process and analyze big data. High-performance computing architectures have been devised to address the need for handling big data, not only from a transaction processing standpoint but also from a tactical and strategic analytics viewpoint. The success of big data analytics in large web companies has created a rush toward understanding the impact of new big data technologies in classic analytics environments that already employ a multitude of legacy analytics technologies. There is a wide variety of readings about big data, high-performance computing for analytics, massively parallel processing (MPP) databases, Hadoop and its ecosystem, algorithms for big data, in-memory databases, implementation of machine learning algorithms for big data platforms, and big data analytics. However, none of these readings provides an overview of these topics in a single document. The objective of this book is to provide a historical and comprehensive view of the recent trend toward high-performance computing technologies, especially as it relates to big data analytics and high-performance data mining. The book also emphasizes the impact of big data on requiring a rethinking of every aspect of the analytics life cycle, from data management, to data mining and analysis, to deployment.

As a result of interactions with different stakeholders in classic organizations, I realized there was a need for a more holistic view of big data analytics' impact across classic organizations, and also the impact of high-performance computing techniques on legacy data mining. Whether you are an executive, manager, data scientist, analyst, sales or IT staff, the holistic and broad overview provided in the book will help in grasping the important topics in big data analytics and its potential impact in your organizations.

Read more…

Guest blog post by Jessica May

The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.

The over-elaborate operation is specifically reflected in the following four aspects:

  • Computation without substep: SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result.
  • Set is unordered: SQL does not directly provide the mechanism of using position to refer to set members, and conversion is needed to implement computation relating to order and positioning.
  • Setlization is not complete: SQL set function is simple and is only used to indicate the query result set and cannot be explicitly applied as basic data type.
  • It lacks object reference: SQL does not support record reference, the association between data tables adopts equivalent foreign key scheme, and in conducting multi-table joint computation, it is necessary to conduct join operation. So it is not only difficult to understand, but also low in efficiency. 

 

Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax (which is similar to the case in which a primary school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the abovementioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problemsolving method into computation syntax is much greater than to find the solution of the problem. 

We give the following examples to describe respectively the problems in the four aspects. 

To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.

  •  Computation without sub-step

Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.

Task1 The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees? 

The number of persons of the sales department

 Conventional thought: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.

But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.

 

Task2 Each department selects a pair of male and female employees to form a game team.

Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought. 

The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.

Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.

Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.

Task3 Company’s employees whose ages are in the middle

Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a manmade sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query. 

 

Task4 For how many trading days has this stock gone up consecutively in the longest? 

Unordered set can also cause train of thought to deform.

The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over. 

In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.

It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.

 

Task5 Employees in the company whose birthday are the same as those of others

The original intention of grouping is to split the source set into several subsets, and its returned values are also these sub-sets. But SQL cannot describe this kind of "set consisting of sets", so it forcibly conducts the next step aggregating computation on these sub-sets and forms conventional result set.

But sometimes what we want is not the summary value on subsets, but rather the subsets themselves. At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.

 

Task6 Find out students whose scores ranks in top 10 for all subjects

Use set-lized train of thought, order and filter the sub-sets of subjects after grouping to select the top 10 of every subject, and then it is possible to complete the task by finding out the intersection set of these sub-sets. But SQL cannot describe the "set of set" and has not the intersection operation to cope with indefinite quantity set. At this time, it is necessary to change the train of thought and use the window function to find out the top 10 of every subject, and then find out, according to student sub-group, the students whose number of appearances is the same as the quantity of subjects, which causes difficulty in understanding.

  • It lacks object reference

In SQL, the reference relation between tables depends on equivalent foreign key for maintenance and it is impossible to directly use the record at which the foreign key point as the field of this record. In query, it is necessary to seek help of multitable join or subquery to complete the query, which causes not only trouble in writing but also low efficiency in operations.

 

Task7 Female manager’s male employees

 Use multi-table join.

If the department field in the employee table points at the record in the department table while the manager field in the department table points at the record in the employee table, then it is only necessary to write this query condition simply as this kind of intuitive high-efficiency form:

where department. manager. sex ='female' and sex ='male'

But in SQL, it is only possible to use multi-table join or sub-query to write out the two kinds of obviously obscure statements.

 

Task8 Companies with which employees have their first jobs

Use multi-table join.

Without object reference mechanism and the completely setlized of SQL, it is naturally impossible to handle the sub-table as an attribute of the primary table (field value). Regarding the query of sub-table, there are two methods. The first is to use multitable join, increase the complexity of the statement, and use filter or grouping to convert the result set into the situation having one-to-one correspondence with the primary table record (the joined record has one-to-one correspondence with the subtable). The second is to adopt subquery, and each time compute temporarily the subtable relating to the primary table record to record subsets, and increase the overall computation workload (it is impossible to use with sub-statement in sub-query) and trouble in writing. 


 

Read more…

Guest blog post by Jake Drew

BAB - The Ultimate Gaming Workstation Server

BAB - The Ultimate Gaming Workstation Server

What makes a computer blistering fast?  The answer really depends on what you want to do with it and can even be quite complex depending on your requirements.  Take for instance bitcoin mining.  Custom bitcoin mining rigs can appear very unusual since many prefer to use graphics cards for the bulk of their bitcoin processing power.

Bitcoin Mining Rig (http://exabay.com/en_us/mining_hardware.php)

Since the majority of bitcoin mining activities depend on hashing, the GPU becomes a very performance optimal, cost effective, and low voltage solution for bitcoin mining programs.  However, you can quickly see by looking above that the "Ultimate" workstation could look very different depending on what it is built for.

Meet BAB

That all being said... Now, I would like to introduce you to our very own "Ultimate" gaming workstation server named "BAB" which affectionately stands for the "Bad A$$ Box".  Our requirements for the BAB build were somewhat unusual.  We wanted to create a massively powerful parallel processing memory monster which had the stability and coolness of a high-end server or workstation mixed with a cutting edge gaming rig.

BAB Build Requirements

  1. Process very large volumes of gene sequence data for my Ph.D. related machine learning research.  For example, some of my input files are over 88GB in size.
  2. Create a machine which runs very cool at high processor loads for long periods of time.  For instance, run all available cores at > 90% utilization for days or even weeks at time.
  3. Leave plenty of room for upgrading both memory and processing power as needed.  Will the motherboard hold 256GB, if I happen to need it?  Could I throw 12 more cores on my box, if my program is still running too slow?
  4. Act as a stand-in server should the need arise.
  5. Produce high quality graphics.  Could my son play any of the latest video games on this machine?  Could I plug a 4K monitor into this machine, if I were lucky enough to get one for my birthday?


BAB does look more like a box than most towers. Measuring 16 x 13 x 18 inches, BAB does look more like a box than a tower.

The BAB build meets all of the above requirements and more.  Honestly, I did not set out to build a computer.  After much research however, I found the marketplace to be highly lacking for my unusual requirements.  I also found server grade equipment to be highly overpriced.

BAB Hardware Overview

BAB Harware at a Glance

BAB Harware at a Glance

  • Motherboard - Supermicro X9DRL-EF
  • Power - EVGA Supernova 750 G2
  • Processor - (2) Intel Xeon E5-2630 v2 Hexacore 2.6-3.1 GHz
  • Hard Drive - (3) Samsung 840EVO 250GB mSATA SSD in RAID 0
  • Storage - (2) 1TB Western Digital 7200 RPM
  • Memory - 128GB ECC 1333Mhz DDR3L PC3L-10600
  • Graphics - ASUS GTX 760 Striker Platinum 4GB GDDR5
  • Sound -  Creative Sound Blaster Audigy Fx

The Case

The Corsair AIR540 High Airflow Mid-Tower Case was selected for this project.  The case was primarily chosen to meet requirement #2 providing ample airflow to all of the selected hardware components.  BAB sports a total of eleven fans.  Six of these fans pull air into the case, and three push air out.  In addition, 4 fans are used to pull air over two Corsair H75 High Performance Liquid CPU Coolers.  This is one "cool" box!  The graphics card also includes two dedicated fans (for those of you who are counting).

Corsair AIR540 High Airflow Mid-Tower Case Corsair AIR540 High Airflow Mid-Tower Case (before modification)

Motherboard and Processors

The large memory and processing requirements for this project ruled out some of the more artistic looking motherboards which are available for modern gaming rigs.  Most of the high-end gaming motherboards I considered, unfortunately, only supported up to 64GB of memory.

Supermicro's X9DRL-EF Motherboard

The Supermicro X9DRL-EF was selected as a cost effective alternative which supports up to 512GB of ECC DDR3 RAM and 2 sockets for the Intel Xeon E5-2600 v2 family of processors, which are available including up to 12 cores per chip / socket.  Remember, this is a total of up to 48 virtual cores or threads in windows using the dual sockets on the Supermicro X9DRL-EF motherboard.

For BAB, we selected  two Intel Xeon E5-2630 v2 hexacore processors with 6 cores per chip / socket.  This provides a total of 24 virtual cores or threads on Windows which have performed very well during my own gene sequence classification benchmarks!

BAB running all 24 threads at 100% during benchmarks! BAB running all 24 threads at 100% during benchmarks!

The two Intel E5-2630 v2 processors installed in BAB perform at 2.6 (base) - 3.1 GHz (max turbo frequency).  In addition, using two of these processors provides 30MB of L3 cache.  This can also be seen in the image directly above.

Memory

128 GB ECC 1333Mhz DDR3L PC3L-10600 128GB Timetec ECC 1333Mhz DDR3L PC3L-10600

When it comes to memory, this machine beats many servers!  Using 128GB of DDR3 RAM, BAB can process the heaviest of memory intensive workloads.  This machine also leaves ample opportunity to use any excess memory for the latest in gaming techniques including loading highly utilized files such as game maps or databases to a RAMDISK partition.  While some hardware retailers may tell you that no one could ever use such a large amount of memory, I would argue that a simple in-memory database can eat up 128GB in no time!

Furthermore, solid state drive manufacturers such as Samsung are now producing software optimized drives which can utilize memory to dramatically optimize SSD perofrmance. Samsung's Rapid Mode operates at the block and file system level to analyze application and data usage and eliminates system performance bottlenecks by dynamically leveraging system DRAM as a read/ write cache.  Improved reads and writes on a SSD can exponentially increase your system's overall performance. [1]  I imagine many software companies will eventually follow suit in using such memory optimizations, including memory mapped files, if they have not begun already.

Solid State Drives and Storage

One of the fastest ways to improve overall system performance is to increase the speed at which your operating system can store and retrieve data from the hard drive.  While Solid State Drives offer one of the best ways to achieve this goal, even greater performance gains are available.

For BAB, we used 3, 250GB Samsung EVO840 mSATA solid state drives.  To further improve our read / write performance, all three drives were hardware configured in RAID 0.  This means all three drives act as one single drive providing a maximum performance increase of up to 300% over a single drive alone.  However, such spectacular results are rarely seen in a real world setting.  Nonetheless, you can definitely increase your read / write speeds using a RAID0 configuration in most cases.  It is important to note that if a single drive fails within a RAID0 array, you will lose all of your data.  It is always good to keep those backups up to date!

Samsung 840EVO Solid State Drive Samsung 840EVO 250GB Solid State Drive

3D Printing To The Rescue


One of the unique modifications we made during this build was to 3D print custom drive trays for our SATA to mSATA converters which allows them to easily slide into three of the 2.5" SSD trays which came with the Corsair AIR540 High Airflow Mid-Tower Case.

3D Printed 2.5" mSTATA trays 3D Printed 2.5" mSTATA trays

The Trays were designed using Google Sketchup [2] which is a freely available 3D modeling program. We originally printed the trays in red, but chose green to match our fan lights when the first print did not work out so well.

3D Model of the Drive Tray in Sketchup 3D Model of the Drive Tray in Sketchup

It took about 10 hours for the Makerbot Mini to print all three of the trays.  The printing process is very time consuming and takes much patience.  There are numerous reasons why a print job can fail, and we experienced several of them.  My son Nathan Drew deserves much credit for paying careful attention to the Makerbot during our printing runs.

3D Printed Trays with mSATA Cards 3D Printed Trays with mSATA Cards

The final trays work and look great in our drive bays.  The 3 mSATA drives are combined in RAID 0 for around 750GB of operating system and speed sensitive storage.  We also have 2, 1 TB 7200 RPM disk drives for media and archival storage.

mSATA Drive Trays in the 2.5" Bay mSATA Drive Trays in the 2.5" Bay

 Graphics Card and Install

The BAB build includes a ASUS GTX 760 Striker Platinum graphics card with 4GB of dedicated GDDR5 memory.  As previously mentioned, this graphics card includes two dedicated fans and a very nice backlit "Republic of Gamers" logo which turns from green, then orange, to red depending on the graphic card's workload.  In addition, ASUS provides the GPU Tweak software interface for both overclocking and monitoring the Striker Platinum card.  The GTX 760 is not the fastest card available on the market to date, but it is a very cost effective top performer which supports up to 4096 x 2160 digital resolution.

ASUS GTX 760 Striker Platinum ASUS GTX 760 Striker Platinum

One the challenges with using a server grade motherboard for such a project is the lack of PCI express 16x slots which almost every good graphics card requires for installation.  In particular, the Supermicro X9DRL-EF motherboard came with no 16x slots, only 1x and 8x.  We originally purchased the ASRock BTC Prokit which included 1 PCIe x1 to x16 Riser Card.

BTC Pro Kit BTC Pro Kit

Unfortunately, the bulky size of the converter was making for a very ugly install as it would not allow our ASUS GTX 760 Striker Platinum graphics card to slide nicely into the case's rear PCIe slots.  To make for a perfect build, we simply used a dremel to modify one of our existing PCIe 8x slots on the motherboard.  In addition, using the 8x slot should provide additional processing bandwidth when compared to the 1x slot.

PCIe 8x Modified to hold a 16x Graphics Card PCIe 8x Modified to hold a 16x Graphics Card

Modifying the 8x slot allowed the graphic card to slide perfectly into the case's rear PCIe slot for a flawless install.  It also allowed us to place the graphics card in the perfect asthetic and functional spot on motherboard.  The graphics card now sits directly in line with one fan pulling cool air in from outside the case and flowing nicely across the Striker Platinum's two on-board fans.  Since the modified 8x PCIe slot does not provide the same physical support as a "locked in" 16x slot, we added a single strategically placed zip tie to the very back of the card for extra support.  The tie is nearly invisible inside the case, and it makes us feel better about the install.

Zip Tie Support for Modified 8x Slot Zip Tie Support for Modified 8x Slot

The primary difference between the PCIe 8x and 16x slots is bandwidth.  However, in actuality, the 8x slot should provide ample bandwidth to support this particular graphics card.  It would take a whole article for me to explain why.  So instead, I will simply refer anyone interested to the following video. [3]

Since the installation, we have heavily bench-marked the graphics card, and we are very pleased with its performance.  In fact, we were able achieve a very stable overclock using the following GPU Tweak values and little help from the article here [4].

ASUS Striker Platinum Overclock Values ASUS Striker Platinum Overclock Values

Benchmarks


Benchmarks were performed with the UserBenchmark and AS SSD Benchmark applications.  The following results were achieved:

UserBenchmark Ranks

CPU Rank - 34th / 767 CPU Rank - 34th / 767

GPU Rank - 21st / 479 GPU Rank - 21st / 479

SSD RAAID 0 Performance Using AS SSD Benchmark SSD RAID 0 Performance Using AS SSD Benchmark

Who Uses Xeon Processors in Workstations?


It is interesting to note that Apple uses the Xeon E5 3.5GHz processors in the Mac Pro flagship workstation which includes 16GB of DDR3 ECC RAM in the 6-core version. [5]  While the Mac Pro typically includes the superior Samsung XP941 PCIe flash based storage, the above triple RAID0 benchmarks demonstrate that speeds comparable to PCIe flash based storage can be achieved.  When compared to these Apple PCIe SSD benchmarks, the BAB triple RAID0 configuration falls around the middle of Apple's performance stack while offering over 750GB of rapid storage for much less than a comparable amount of PCIe based storage would cost.  However, I still plan on using a PCIe based SSD solution in my next build.  As of today the 512GB version of the Samsung XP941 sells for around $510 on Amazon.  Hopefully, these prices will come down to a more reasonable amount in the near future.

Pictures of the BAB Build

BAB Side View with Clear Cover BAB Side View with Clear Cover

BAB from Rear BAB from Rear Side

BAB does look more like a box than most towers. BAB from the Front and Top

 Please feel free to learn more about me at:


www.jakemdrew.com

References

  1. Samsung, Solid State Drive Rapid Mode,  http://www.samsung.com/global/business/semiconductor/minisite/SSD/downloads/document/Samsung_SSD_Rapid_Mode_Whitepaper_EN.pdf, Accessed on 01/16/2014.
  2. Google, Sketchup, http://www.sketchup.com/ , Accessed on 01/16/2014.
  3. LinusTechTips, PCIe Lanes - PCIe 8x vs 16x in SLI, https://www.youtube.com/watch?v=rctaLgK5stA , Accessed on 01/16/2014.
  4. HARDOCP, Overclocking the ASUS ROG Striker GTX 760, http://www.hardocp.com/article/2014/07/07/asus_rog_striker_platinum_gtx_760_4gb_video_card_review/3#.VLm8fivF9dM, Accessed on 01/16/2014.
  5. Apple, Mac Pro Specs, http://www.apple.com/mac-pro/specs/, Accessed on 01/16/2014.
Read more…

Guest blog post by Michael Walker

High Performance Computing (HPC) plus data science allows public and private organizations get actionable, valuable intelligence from massive volumes of data and use predictive and prescriptive analytics to make better decisions and create game-changing strategies. The integration of computing resources, software, networking, data storage, information management, and data scientists using machine learning and algorithms is the secret sauce to achieving the fundamental goal of creating durable competitive advantage.

HPC has evolved in the past decade to provide "supercomputing" capabilities at significantly lower costs. Modern HPC uses parallel processing techniques for solving complex computational problems. HPC technology focuses on developing parallel processing algorithms and systems by incorporating both administration and parallel computational techniques.

HPC enables data scientists to address challenges that have been unmanageable in the past. HPC expands modeling and simulation capabilities, including using advanced data science techniques like random forests, monte carlo simulations, bayesian probability, regression, naive bayes, K-nearest neighbors, neural networks, decision trees and others.

Additionally, HPC allows an organization to conduct controlled experiments in a timely manner as well as conduct research for things that are too costly and time consuming to do experimentally. With HPC you can mathematically model and run numerical simulations to attempt to gain understanding via direct observation.

HPC technology today is implemented in multidisciplinary areas including:

• Finance and trading

• Oil and gas industry

• Electronic design automation

• Media and entertainment

• Biosciences

• Astrophysics

• Geographical data

• Climate research

In the near future both public and private organizations in many domains will use HPC plus data science to boost strategic thinking, improve operations and innovate to create better services and products.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

More News