INTRODUCTION TO ANALYTIC AND BIG DATA – HADOOP

In this era of technology, where every thing is based on IT industry needs to manage the data sources. If we talk about our traditional database management system then its not efficient to handle data in huge. Moreover, users are not able to access the web services simultaneously because our traditional data base approach is not so efficient to handle this.

WHY BIG DATA:

Data challenges are increasing day by day. In 2010 the Digital Universe was 1.2 ZBs while in 2011 it was 300 Quadrillion files to store and access. From this survey it is very clear that there’s a need to manage the entire data of every field for future references. That’s why, scientist searched for new database technology which can support accessing and storing of huge amount of data i.e. data in TBs and more and this technique is named as BIG DATA.

Big Data analytic and the Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing.

Enterprises can gain a competitive advantage by being early adopters of big data analytic.

HOW BIG DATA AND BUSINESS INTELLIGENCE DIFFERS:

BIG DATA is different than Business Intelligence as it provides better features than the traditional System of BI. It provides us mostly the Semi-Structured Model while BI is fully structured based. It provides 100’s of PB’s of data storage while BI’s storage limit is only up to 10s of TBs(max.). Moreover it provides following features over the traditional BI System:

1. Provides: Structures, Unstructured and semi-structured approach as well.
2. Supports: Batch, Near time, Real time and Streams also
3. Less Workload
4. Huge Storage
5. Easy to operate
6. It’s not Repetitive

FUTURE:

Enterprise + Big Data = Big Opportunity

BIG DATA is making its place in Business world so fast. It is accepted by the leading firms like Facebook, Google, amazon.com, twitter, Yahoo, Linkedin etc. and is going to be distributed all around the world soon. IBM is investing all its money on BIG DATA. It has good scope in future. In previous days, it was a time of Real-Time Analytic but now with BIG DATA its the time of Predictive analysis. Scientist are contributing a lot of their knowledge and time in this field as it is a future on every industry.

It has opportunity in every field:
1. Financial Services
2. Healthcare
3. Retail
4. Web/Social/Mobile
5. Government
6. Manufacturing
7. Energy
8. Media & Telecommunication

Industries are Embracing Big Data these days.

Limitations:

Big Data is in its experimental stage. It is growing every day. And for now some issues and problems it’s facing like:

1. Threat Analysis
2. Its Search Quality
3. PoS transaction Analysis
4. Trade surveillance
5. Customer churn analysis
6. Modeling true risk
7. Analyzing network data to predict failure

Companies are working on solving these issues. So, there are also good chances for database engineers to get job in this.

HADOOP:

Hadoop is a scalable fault-tolerant distributed system for data storage and processing. Core hadoop has two main components:

1. HDFS (Hadoop Distributed File System): It is reliable, redundant, distributed file system optimized for large files.
2. MapReduce: Its is programming model for processing sets of data. Mapping inputs to outputs and reducing the output of multiple Mappers to one answer.

Hadoop is a large and active ecosystem which operates on unstructured and structured data. It is open source under the friendly Apache License.

1. HDFS:

Hadoop Distributed File System performs best with a ‘modest’ number of large files. It is optimized for large, streaming reads of files and sits on top of a native (ext3, xfs, etc.) file system.

In HDFS, data is organized into files & directories. Files are divided into blocks, distributed across cluster nodes and these blocks replicated to handle failure. Check-sums used to ensure data integrity.

2. MapReduce:

It’s a method for distributing a task across multiple nodes. Each node processes data stored on that node. It consists of two developer-created phases: (i) Map (ii) Reduce

There’s a shuffle and sort in between the Map and Reduce. MapReduce provides:

(i) Automatic parallelization and distribution
(ii) Fault Tolerance
(iii) Status and Monitoring Tools
(iv) A clean abstraction for programmers.

MapReduce: Basic Concepts

1. Each Mapper processes single input split from HDFS.
2. Hadoop passes developer’s Map cade one record at a time.
3. Each record has a key and a value.
4. Intermediate data written by the Mapper to local disk.
5. During shuffle and sort phase, all values associated with same intermediate key are transferred to same Reducer.
6. Reducer is passed each key and a list of all its values.
7. Output from Reducers is written to HDFS.

SENTIMENT ANALYSIS:

1. Hadoop used frequently to monitor what customers think of company’s products or services
2. Data loaded from social media sources (Twitter, blogs, Facebook, emails, chats, etc.) into Hadoop cluster
3. Map/Reduce jobs run continuously to identify sentiment (i.e., Acme Company’s rates are “outrageous” or “rip off”)
4. Negative/positive comments can be acted upon (special offer, coupon, etc.)

Why Hadoop:

1. Social media/web data is unstructured
2. Amount of data is immense
3. New data sources arise weekly

In this way Big data & hadoop are new emerging DBMS which are going to produce new database management strategies.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s