CS BIGDATA

BIG DATA ANALYTICS

Big data analytics is the often-complex process of examining large and varied data sets, or big data, to uncover information -- such as hidden patterns, unknown correlations, market trends and customer preferences -- that can help organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques provide a means to analyze data sets and draw conclusions about them which help organizations make informed business decisions. Business intelligence (BI) queries answer basic questions about business operations and performance.

Big data analytics is a form of advanced analytics, which involves complex applications with elements such as predictive models, statistical algorithms and what-if analysis powered by high-performance analytics systems.

IMPORTANCE OF BIG DATA ANALYTICS

Driven by specialized analytics systems and software, as well as high-powered computing systems, big data analytics offers various business benefits, including:

•      New revenue opportunities

•      More effective marketing

•       Better customer service

•       Improved operational efficiency

•       Competitive advantages over rivals

Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians and other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional BI and analytics programs. This encompasses a mix of semi-structured and unstructured data -- for example, internet clickstream data, web server logs, social media content, text from customer emails and survey responses, mobile phone records, and machine data captured by sensors connected to the internet of things (IoT).

BIG DATA ANALYTICS TECHNOLOGIES AND TOOLS

Unstructured and semi-structured data types typically don't fit well in traditional data warehouses that are based on relational databases oriented to structured data sets. Further, data warehouses may not be able to handle the processing demands posed by sets of big data that need to be updated frequently or even continually, as in the case of real-time data on stock trading, the online activities of website visitors or the performance of mobile applications.

As a result, many of the organizations that collect, process and analyze big data turn to NoSQL databases, as well as Hadoop and its companion data analytics tools, including:

  • YARN: a cluster management technology and one of the key features in second-generation Hadoop.

  •  MapReduce: a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.

  • Spark: an open source, parallel processing framework that enables users to run large-scale data analytics applications across clustered systems.

  •  HBase: a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS).

  •  Hive: an open source data warehouse system for querying and analyzing large data sets stored in Hadoop files.

  •  Kafka: a distributed publish/subscribe messaging system designed to replace traditional message brokers.

  •  Pig: an open source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs executed on Hadoop clusters.