Subscribe to DSC Newsletter

Scale your vision! (about scalability)

Guest forum post by fatih hamurcu

            In any sources, I am certain you have heard “the more data, the more valuable insights”. In contrast, have you ever wondered how to practice on Big Data, not only in algorithmic aspect?

            I have done, in my opinion, very comprehensive research on techniques of practicing Big Data. While doing research, I always thought, somehow, I will help someone to build their own technology stack for their Data Science job.  Anyway, let us start the article with Technology Selection:

            It is interestingly true that each of the hardware technologies listed below has been used for implementing/running Data Mining algorithms. In addition, research done on them has still continuing, that is tomorrow, scientists will amaze us with incredible speed up on execution/construction of data model. Indeed, recent experimentations are enough to utilize them for daily Data Science job. For instance, GPU’s built-in features such as map, reduce, scatter & gather, can be used for high performance, streamed, arithmetic and matrix based operations in consideration with map-reduce approach. Here, feed your vision with;

  • CPU: Preferable to do Data-Independent Parallelization at CPU Instruction Level with SSE, which is streaming SIMD Extensions vector capabilities, and MMX Instructions
  • Multicore: Preferable to benefit from multicore technologies with Pthreads or OpenML.
  • GPU: Preferable to benefit from parallel core technologies of GPUs which enables high performance stream programming with arithmetic kernels via CUDA or OpenCL Programming languages.
  • FPGA: Preferable for accelerating computationally intensive/case-specific problems with VHDL Programming language
  • Accelerator
    • RIGEL: It is architecture for a broad class of data and task parallel computations.
    • MAPLE: It is an accelerator with hundreds of simple vector processing elements (PE).

            The best combination of technology and platform is the only way to collect more ‘ibilities’, such as availability of our deployable product. To learn the best, we should not only aware of available platforms, but also experiment with sub-permutation of platform and technology combination. For enriching your, Data Science skill, toolbox, make a study of platforms:

  • Microsoft DyradLINQ: Microsoft's solution for large scale data parallel computing, done with LINQ Programming Language
  • IBM PML: IBM's Parallel Execution toolbox for Data Mining Algorithms
  • SDAM: GUI based Machine Learning System built on the top of MATLAB, and C/C++
  • HDFS Based Programming Models
    • HADOOP: Popular tools used on it are Hive, Pig, and MapReduce
    • Delite: Technology stacks built on heterogeneous hardware systems. It’s one of the key components is having Scala based Domain Specific Languages e.g. OptiWrangle for Data preparation, OptiGraph for Graph Algorithm, OptiML for machine learning
    • Spark: Like Hadoop, it is an Engine for Large-Scale Data Processing

            Let us continue the article with Micro & Macro Optimizations:

            Not all (un)structured data coming from video files, human generated interactions, diagnostic text, radio frequency ID, tweets or other sources is useful, i.e. >>50% of all is actually noise, for long or even short term decision making. Even though 100% of it is signal, refusing to store or use all the data for our data job is logical choices we should make due to time, memory, and computation constraints, and in order to gain competitive advantages on market. As a result, it seems we should, as soon as, consider Sampling as one of the Big Data practices. Let us take a look at some concerns:

  • How to partition data
    • Random Sampling
    • Instance based Sampling e.g. Horizontally partition-able, structured as Matrix, data
    • Feature based Sampling e.g. Stratified Sampling, Data Sampling on Dynamic/Static Weighted Features, Vertically partition-able, structured as Matrix, data
  • How to process samples
    • Sequentially
      • Data Model Guided Sample Selection: Incorrectly classified examples learned by data model guide next data sample selection
      • Progressive (Boosted) Sampling Method
    • Concurrently: Distribute samples randomly, or according to an algorithm to processing units
  • How much sample
    • Quora Sampling: Get only as many as you need
    • Proportional Quota Sampling: Get only as many as you need from each sub-groups according to their proportion on the population
    • Non-Proportional Quota Sampling: Get same number of instances from each sub-groups
  • How to avoid sampling traps
    • Value distribution: While sampling, as possible as protect original dataset characteristics in sample
    • Accuracy consideration: By using error metrics, estimate the accuracy of that a sample can represent the original dataset
    • Bias: To avoid extreme tail values in distribution graph, carefully increase sample size with signal instead of noise

            On a sequential computer, the fast algorithm is the best algorithm, but for new science area, I believe we need more creative approaches for algorithm design in order to extract more valuable insight in real-time. Since I have no work experiences, I am not the right person for discovering creative ways. Alternately, I have just collected powerful approaches used in Data Science related areas. In algorithm design phase, keep in view below list:

  • How to learn from distributed data
    • via using sufficient statistics e.g. parallel or serial distributed statistics gathering
    • via using divide & conquer method, or model ensembles, e.g. boosting, bagging
    • via visiting each node with data model, i.e. data model as an object and visit each node to update its internal status
  • How to optimize learning algorithm
    • by using efficient data structures e.g. mapping data to X bit binary representation
    • by optimizing algorithm along with domain specific data processing instructions e.g. CPU instruction level optimization
    • by using appropriate, according to experiment results, hypothesis space, i.e. simple model may perform well enough than complex and time consuming hypothesis
    • by directly executing (complex) database queries directly on corresponding node(s) instead of unveiling all data then doing calculation on it especially for distributed dbms systems

            Now, we reach the Overall System Analysis with respect to scalability concers:

            To talk about scalability, let us start with Wikipedia definition: Scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth. To clarify, we use scalability analysis for predicting the capacity of a system to effectively utilize an increasing number of data or processing resources. Instead of proceeding with other benefits of it, I would like to emphasize on the fact of lacking of an effective scalability analysis toolkits because last research done on this topic is nearly two decade ago. Therefore, be careful as using below metrics/tools.

  • What elements does scalability depend on?
    • Number of Nodes, Problem Size, and Algorithm
    • Node Characteristics e.g. processor speed, size of memory, number of processors in it
    • Network Characteristics e.g. type of interconnection, routing techniques
  • How to analyze the scalability of a system and algorithm
    • Scalability Metrics e.g. Iso-speed scalability, iso-efficiency function, PEF & DEF, Strategy based scalability metric
    • Performance Evaluation Toolkit and Systems e.g. Paradyn, SCALEA, STAS
    • Baseline Scalability Evaluation: Fix the number of nodes and problem size under un-tuned algorithm, and then use it as a baseline.
    • Popular Calculations
      • Running Time = Computation + Communication + Synchronization
      • Speedup = Ratio of sequential execution and parallel execution time
      • Efficiency = Ratio of speedup to the number of processors/nodes
  • Which possible places does reason of bottleneck
    • Algorithm: On central node, unparalleled part in the code even if its computation time is ignorable e.g. smaller than Ѳ(log n)
    • Communication: Cost driven by High communication, e.g. I/O, between host and computation units.
    • Synchronization: Cost driven by waiting task completion on slave machines

            As a last word, I did not accustomed to the input size of any Data Science job, so I have decided systematically to do research on these mentioned topics. Now, I am able to imagine how to build my own laboratory without expensive warehouse, million dollar storage devices, or mainframe systems. For you working for any organization, I am not sure, since I am graduated but has not yet started own career, how this article will help you, but I hopes it will.

Folks, feel free to share your opinion with tweet (@hamurcu_fatih), comments or email ([email protected]).

You need to be a member of HPC to add comments!

Join HPC

Email me when people reply –

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

More News