Big Data and Analytics
This article was written to be a support for the setup of Center of Excellence at our Institution.
BIG DATA
Big Data is a term that has risen to importance describing data that exceeds the processing capability of conventional database systems. It is not the amount of data that’s important, it’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
While defining a Big Data, we should not confuse ourselves with the fact that only large volumes of data in petabytes or zeta bytes is known as Big Data, even the high rate of inflow of data, is known as Big Data. Even with moderate volume and velocity, if there is a lot of variation in the type of data, then even that is considered as Big Data. In the early 2000s Doug Laney, an industry analyst defined Big Data using the three V’s; Volume, Velocity, Variety.
Volume – Facebook, for example stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Facebook is storing roughly 250 billion images. As we move forward these numbers would surely go high and lead to voluminous amounts of data.
Velocity – Velocity is the measure of how fast the data is coming. Using the above example of Facebook, there are 900 million photos uploaded each day, so a company of that scale has to ingest all the incoming data, process it, file it, and somehow, later be able to retrieve it. Volume and Velocity are interrelated, as the amount of data entering increases, so does the flow.
Variety – We have talked about photographs, sensor data, tweets, encrypted packets and so on. Each of these are very different from each other. This data isn’t the old rows and columns, it’s different from application to application, and much of it is unstructured. So it is a Big Data problem to handle these kinds of data and thus is the third V.
In today’s trend a fourth component is added to Big Data: Veracity
Veracity – Veracity is the quality and trustworthiness of the data. Just how accurate is all the data? For example, all the Twitter posts with hash tags, abbreviations, typos, etc., and the reliability and accuracy of all that content. Gleaning loads and loads of data is of no use if the quality or trustworthiness is not accurate. Another good example of this relates to the use the use of GPS data. Often the GPS will “Drift” off course while perusing through urban areas. This may not immediately kill the business, but cannot be ignored for a long time.
BIG DATA ANALYTICS
Big Data Analytics is the process of examining large and varied data sets – i.e., big Data, to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more informed business decisions.
Benefits – The use of specialized Analytics Systems and software, Big Data Analytics points various ways towards business benefits, including new revenue opportunities, more effective marketing, better customer service, improved operational efficiency and competitive advantages. Big Data Analytics applications enable Data Scientists, Predictive Modelers, Statisticians and other Analytics Professionals to analyze growing volumes of structured transaction data, plus other forms of data that are often left untapped by conventional Business Intelligence. On a broad scale, Data Analytics technologies and techniques provide a means of analyzing data sets and drawing conclusions about them to help organizations make informed decisions. BI queries answer basic questions about business operations and performance. Big Data Analytics is a form of advanced Analytics, which involves complex applications with elements such as predictive models, statistical algorithms and what-if analysis powered by high performance Analytics Systems.
In the near future, I plan to use the MNSIT dataset to to create my classifier using CNN which would make it more efficient.
The code for the entire application is there in the following GitHub link. Click here for complete code
CENTER OF EXCELLENCE
A Center Of Excellence (CoE, also known as a Competency Center or a Capability Center) is a corporate group or team that leads other members in that organization as a whole in some particular area of focus such as technology, skill or discipline. CoE are an organizational solution aimed at ensuring the Business- to-IT link, which allows business needs to be translated into concrete IT requirements and projects. A Center Of Excellence is frequently used when an organization needs to take on a new technology or skill and manage its adoption. Major trends like BYOD (bring your own device) and Big Data Analytics often drive the adoption of CoEs as enterprises attempt to deal effectively with a rapidly evolving business environment.
FACTORS IN A CoE
- Support: Supporting various business lines with necessary services and subject-matter experts.
- Governance: By allocating limited resources (money, staff) to ensure a high level of service using the model of the CoE. Coordination of a number of other organizational interests is required for the CoE to provide added value and innovation.
- Guidance: Standards, methodologies, tools and knowledge management are approaches which provide guidance.
- Measurement results and charts: Measurements demonstrate the added value of CoE.
- Collaborative Learning: Training and certification, proficiency testing, team building and the allocation of roles and responsibilities aimed at promoting collaborative learning.
BIG DATA & ANALYTICS: STAGES OF DEVELOPMENT
Stage 1 –
All of Analysis, Visualization and Predictions on Big Data can be achieved. But the key thing to realize is that, data which is the foundations of all the above mentioned processes is to be understood and mastered first. Before being a data analyst or a data scientist, it is key to become a data engineer to know the process and methods to integrate and identify the valued and required data for the use cases. The methods and knowledge on the required tools and data for the use cases is a work of Data Engineer.
Data Engineers are responsible for the creation and maintenance of data infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition and verification. Engineers are expected to have a solid command of common scripting and tools for this purpose and are expected to use this skill set to constantly improve data quality and quantity by leveraging and improving Data Analytics Systems.
From a broad view, a Big Data Engineer should be well versed with his computer and technical skills, and working with tools like Hadoop and Spark. Hadoop is a powerful Big Data Platform and can turn out to be a fussy beast if it is not handled by skilled professionals. Knowledge on the best ways to tweak the core Big Data components within Hadoop stack- HDFS, MapReduce, Pig, Hive, and HBase is essential.
DATA WAREHOUSING
A Data Warehouse is a federated repository for all the data that an enterprise’s various business systems collect. The repository may be physical or logical. Data Warehousing emphasizes the capture of data from diverse sources for useful analysis and access, but does not generally start from the point-of-view of the end user who may need access to specialized, sometimes local databases. Apache Spark is a more new into the Big Data market and catching pace at a very quick rate.
DATABASE ARCHITECTURE
Big Data Architecture is premised on a skill set for developing reliable, scalable, completely automated data pipelines. That skill set requires profound knowledge of every layer in the stack. Big Data Engineer has to make decisions about what happens to the data, how it is stored in the cluster, how access is granted internally, what tools to use to process the data, and eventually the manner of providing access to the outside world.
Stage 2 -
After we are set with all the required data, and can be easily accessed and manipulated, now it is required to process the data in the way to meet the use cases of the client. Analysis is the process of the getting required information from the available data to enhance decision making, better predictions and improve the outcome of a business. Analysts stay focused to validate, sort, relate, evaluate the data, the most critical skill of the data analyst is to have a good knowledge of the domain one he is working on. To stand lonely as an analyst, not much of computer and technical knowledge is required, they are needed to be good in business and statistics. A good hold of machine learning is highly beneficial as it helps in managing complex data structures and learning patterns that are too difficult to handle using traditional Data Analytics.
PROGRAMMING SKILLS
Learning how to code is an essential skill in the Big Data Analyst’s arsenal. You need to code to conduct numerical and statistical analysis with massive data sets. Some good quality of time should be invested in learning the business and statistical implementations in the computer through languages like R, Python, and Java etc. Another important aspect of programming entails interacting with databases through queries and statements.
QUANTITATIVE SKILLS
As a Big Data Analyst, programming helps you to do what you need to do. But, what are you supposed to do?
The Quantitative skills is what is required to be a good Big Data Analyst. For starters, you need to know multivariable calculus and linear and matrix algebra. Probability and statistics is also important. Through all these skills, you will have a strong foundation in numerical analysis.
MULTIPLE TECHNOLOGIES
Programming is an essential Big Data Analysis skill. What makes it extra special, though, is the versatility. Technologies are not limited to programming alone. The range of technologies a Big Data Analyst must be familiar with is huge. It spans myriad tools, platforms, hardware and software.
UNDERSTANDING BUSINESS AND OUTCOMES
Analysis of data and insights would be useless if it cannot be applied to business setting. All Big Data Analysts need to have a strong understanding of the business and domain they operate in. Big Data Analysts can identify relevant opportunities and threats based on their business expertise. Domain expertise enables Big Data Analysts to communicate effectively with different stakeholders.
INTERPRETATION OF DATA
Of all the skills we have outlined, Interpretation of Data is the outlier. It is the one skill that combines both art and science. It requires the precision and sterility of hard science and mathematics but also call for creativity, ingenuity, and curiosity. Without proper knowledge to Interpret Data, the approach can become dangerous. It does not provide a holistic view of the data procurement and analysis process.
Stage 3 -
After establishing a good amount time and experience with the above mentioned skills, a final step towards becoming an expert with big data, would be to achieve the title of: Big Data Scientist. Data scientists combine statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently to find patterns along with the activities of cleaning, preparing and aligning the data. Dealing with unstructured and structured data.
Data scientists are the ones who interact with the clients and are able to completely understand the use cases and exactly know what is to be delivered. Data scientist is the person who would know how to extract meaning from the given data and create models and algorithmic structures and present the analysis or predictions through data visualization tools to the clients.
MACHINE LEARNING
Machine learning is a subfield of Artificial Intelligence (AI). The goal of Machine Learning generally is to understand the structure of data and fit the data into models that can be understood and utilized by people. Though Machine Learning and Data Science may seem to be two different curves, they all belong to the same family. Machine Learning aids Data Science by providing a suit of algorithms for data modeling/analysis (through training of machine learning algorithms), decision making (through streaming, online learning, real-time testing) and even data preparation (machine learning algorithms automatically detect anomalies in the data).
Data Science stitches together a bunch of ideas/algorithms drawn from Machine Learning to create a solution and in doing so borrows a lot of ideas from traditional statistics, domain expertise and basic mathematics. In this way, Data Science is the process of solving a use case, providing a solution as opposed to machine learning that is an important cog in that solution.
DISTRIBUTED SYSTEM
A Distributed System is a model in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: Concurrency of Components, Lack of Global Clock, and Independent Failure of Components. Principles of distributed computing are the keys to Big Data Technologies and Analytics. The mechanisms related to data storage, data access, data transfer, visualization, and predictive modeling using distributed processing in multiple low cost machines are the key considerations that make Big Data Analytics possible with stipulated costs and time practical for consumption by human and machines.
DATA VISUALIZATION
Data visualization is the presentation of data in a pictorial or graphical format to enable the decision makers to see analytics presented visually, so they can grasp difficult concepts or identify patterns. With interactive Data Visualization, a concept can be taken a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.
Data Visualization has become the de facto standard for modern Business Intelligence (BI). The success of two leading vendors in the BI space, Tableau and Qlik – both of which heavily emphasize Visualization. When a Data Scientist is writing Predictive Analysis or Machine Learning algorithms, it becomes important to visualize the outputs to monitor results and ensure the models are performing as intended.
ESSENTIAL SKILLS
DISTRIBUTED COMPUTING BIG DATA ENVIRONMENTS
Hands on skills in at least one of the many Hadoop Distributions (viz. Hortonworks, Cloudera, MapR, IBM Infosphere, BigInsights). At this point in time, Cloudera distribution is the most deployed distribution.
CLOUD DATA WAREHOUSES
Since there is an increased affinity towards moving from on premise data warehousing solutions to cloud- based warehousing solutions, skills in technologies like Amazon Redshift or Snowflakes adds value in the current trend. Redshift is fully managed cloud-based petabyte-scale Data Warehousing solution.
NoSQL AND NewSQL
Skills in some of the new emerging technologies, for example MongoDB (document based database) or Couchbase (key-value store). Others like Cassandra and HBase are also popular. On the cloud, Amazon has specific databases like DynamoDB and SimpleDB.
DATA INTEGRATION AND VISUALISATION
When working on large-scale Analytics Projects, there will ingestion of data from multiple sources. Knowledge in Big Data compliant Integration Technologies like Flume, Sqoop, and Storm Kafka etc. Data Integration products like Informatica and Talend have also upgraded their capabilities to Big Data processing. In the world of Visualization, Tableau and QlikView are popular. They also integrate with other BI reporting data stores.
BUSINESS INTELLIGENCE
Hands-on knowledge of Business Intelligence Technologies is also helpful. There are several technologies in BI, for example IBM, Oracle and SAP have Acquired BI suites. Microsoft’s BI stack is largely organically developed. Others like Microstrategy and SAS are also independent BI providers.
BIG DATA TESTING
Big Data Testing is fundamentally different from traditional ETL and application testing because of the volume of data involved. The differences in test scenarios occur due to the velocity and variety of data. Also, in certain cases, execution of test cases require scripting and programming skills (Pig scripts, Hive query language etc.)
AREAS OF RESEARCH IN BIG DATA
Big Data is a voluminous subject of research. As technology is advancing mankind demands for efficient methods to overcome the shortcomings in development fields. The research domains of Big Data is vast. Here are some areas of research:
- Improving Data Analytic techniques: Data Analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, Science, and social science domains. The data from the required fields is extracted accordingly, and the data is organized, filtered based on certain constraints. And the organized data is used to make confident approach towards the requirements.
- Natural Language Processing Methods: Natural-language processing (NLP) is a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language data. The extracted data from different sources has to be organized based on the current sentimental trend and can be used in different fields like politics, finance, and development etc. with the help of Big Data.
- Algorithms for Data Visualization: The use of images to convey some useful information about algorithms. That information can be a visual illustration of an algorithm’s operation, of its performance on different kinds of inputs, or of its execution speed versus that of other algorithms for the same problem. To accomplish this goal, an algorithm visualization uses graphic elements. Big Data can be used in order to visualize crucial algorithms and to obtain an accurate results.
- Big Data tools and Development platforms: The conventional tools available to handle on Big Data is inefficient. Lots of research has to be done on these fields in the mere future.
-
Healthcare sentiment Analysis: Big Data can be implied in following fields to over all the shortcomings.
- Live drug response Analysis.
- Heterogeneous information integration at large volume of data security.
- Privacy issues related to healthcare information exchange.
- Metadata management.
- Information retrieval tools for efficient data searching.
- Fraud detection.