meta data for this page
This is an old revision of the document!
My Own Definition of Big Data
Big Data refers to groups of data that reach the limits of processing capabilities of commodity data manipulation/analysis tools. Therefore, robust solutions have to be addressed to understand, relate and profit from the information found based on the processed data. The boundaries of the amount of data that become processable or that represent high costs to be processed change dynamically depending on technological advances and availability of the technology.
Bigdata represents an increasing business opportunity since new market segments or preferences withing a market can be found by the analysis of the data. Furthermore, entire new products or industries can bloom from the analysis of large data sets.
Other definitions of Big Data
McKinsey Global Institute
“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data—i.e., we don’t define big data in terms of being larger than a certain number of terabytes (thousands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).
Source: Big data: The next frontier for innovation, competition, and productivity May 2011 Authors: James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers.
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. Big data – information of extreme size, diversity and complexity – is everywhere. This disruptive phenomenon is destined to help organizations drive innovation by gaining new and faster insight into their customers. So, what are the business opportunities? And what will they cost?
WIKIPEDIA & THE ECONOMIST
Big data is an all-encompassing term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data processing applications.
The challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, prevent diseases, combat crime and so on.”
Big Data Summary
Chapter 1 – Now
The author Viktor Mayer-Schoenberger introduces the reader to the world of big data by making emphasis on different major world problems or valuable companies and acquisitions that took place through the analysis of information using this technique and curiosity of people.
The first example provided by the author is related to the spread of diseases in the world and the slow response that governments have to track them and contain them with the appropriate measures. The case discussed the H1N1 outbreak where the only solution to restrain it was to identify which regions had been impacted by the sickness. Lamentably, only after two weeks the spread of the diseases could be determined.
Subsequently, Google released a white paper on how the company could predict the presence of the disease by monitoring the keyword searches in each region and compare it against millions of mathematical models to increase the efficiency of the algorithm comparing it to previous flu outbreaks and their spread through the country. Others tried this, but only Google had enough amount of data and the technological muscle to achieve it.
Another example mentioned in the first chapter tells how Farecast helped its users to save billions of USD acquiring flight tickets just by telling them the price trends with the likelihood of an increase of decrease of price through time. Again, everything by analyzing a huge set of data from the airlines and their historical behavior.
On base of these examples one definition proposed in the book can be better understood “Big data refers to things one can do at a large scale that cannot be done at a smaller one”. In order to better understand the variables of big data, it should be considered that technology is not the only factor playing a role but the human creativity to seek interconnections across data sets to find new information.
Improvement in the data processing tools doesn’t offer a radical enhancement to business processes. Therefore, the new data manipulation capabilities should be used to review the combinations of data to create new information in levels that weren’t foreseen. Some examples of the incremental capacity of gathering data can be found in the field of Astronomy with new telescopes capturing more information in weeks compared from what it had been learned in the history of human kind.
Although Big data is considered to be a section of artificial intelligence, far from that, Big data is a predicting tool focused on finding out what are the next events to happen based on a known context with enough information to simulate scenarios. Furthermore, Big data doesn’t seek the explanation on why the data behaves in specific manners, but on finding what is happening and possibly how will it continue.
Big data manipulates great amounts of information that may have low quality that could result in inexactitude. Thereby, the best quality that the data based is feed with, the better outputs that will result.
Chapter 2 – MORE
This chapter reviews mostly the difference of the regular approach of sampling data to generalize conclusions and behaviors of a group of events or entities against the holistic view that gathering information from the whole group can provide (n= all) rather than a defined sample.
A good example of the stated above was a campaign from Barack Obama to ask about the preferences for the elections. The background was to survey people on their cellphones, whilst this sounds like a reliable approach; calling a random sample of people and ask about their opinions. Nevertheless, there was bias towards people having cellphones, although it could be thought that the majority of the US citizens have them, it also solely includes the segment that fulfills the target market of cellphones, young and change embracers mostly, leaving behind the rest of the society.
In the chapter were summarized three major mindsets shifts seeking big data employment.
- “The ability to analyze vast amounts of data about a topic rather than be forced to settle for smaller set”.
- “Willingness to embrace data’s real-world messiness rather than privilege exactitude”.
- “Growing respect for correlations rather than a continuing quest for elusive causality”.
Reviewing smaller samples of a whole set of data it’s a shortcut previously used due to the lack of capacity to review the complete information cost-efficiently. However, the use of shortcuts represent trade-offs of perspective against resources used and within each sample normally some trends (subgroups) can bias the results that go overseen through what it was thought to be a randomized sample.
As in fraud detection, the goal of the analysis is to find anomalies rather than normal behavior in order to discover new information rather than only the background where the highest concentration of data lays.
Samples normally answer the question “how many”, whereas big data solves the puzzle for “how many”, “where”, “how”, “what” and “who.
Chapter 3 – MESSY
The third chapter depicts why the imperfect data that is collected as a consequence of gathering large amounts of records is not a problem compared with the benefits that this provides to the overall analysis for creating information. In principle, curated data works perfectly fine for traditional algorithms or data processors that are aimed for precision and manage reduced sets of records. However, processing time for curating the data properly for the accurate results it’s time consuming and sometimes by the time when all data is collected, curated and presented, the individual values of the data sets have become obsolete.
On the contrary, by managing larger sets of data to feed an analysis we open the door for inexactitude but at the same time, it becomes possible to find inter-dependencies that were not visible with just a sample of the data. This inter-dependencies will have inconsistencies or missing values as it’s the nature of handling with “raw” data. Nevertheless, managing large sets of data counteract the messiness of the information and provide “Good enough” values, as stated in one example, 2 + 2 might not always be 4, but 3.9, and that is acceptable for many scenarios.
Naturally, this reasoning offers a different possibility to address time consuming analysis where an approximate result is sufficient. Nonetheless, in certain queries as checking on bank transactions or calculating vectors to launch a spaceship, an approximation is not enough and must be exact and up-to-date.
A significant example of radical improvements through shifting the approach from reduced data sets to large data sets was to improve both; Microsoft Word spelling correction and Google Translate. The first approach set a bifurcation of paths to solve the problem, refine the algorithm or to feed more relationships of data so the algorithm can learn based on more cases. Microsoft tried the second approach and obtained incremental results as they increased the data base from 10K to 100K to 1M and eventually to 1B of word combinations.
Google, tried the same approach but with one trillion records. However, instead of improving the typos correction, they addressed a more complex and bigger problem, translations. Instead of trying to teach a computer to translate one word for another and consider grammar rules. They fed the system with myriad translation found in official governmental translations, translated webpages or books offered in different languages, all with the purpose to find all possible combinations and degree of commonly used translations for any language. Even for many different languages at the same time.
Chapter 4 – CORRELATION
Big data offers knowledge based on what is happening due to the interactions of at least two variables, this interactions are measured by their correlation, this means that if variable A has a strong effect on variable B, the correlation between them is high as well.
The way correlation identifies the events that are happening explains the WHAT instead of the WHY as it has been stated in the prior chapters. A correlation won’t explain why certain records are prone to fulfill a given condition, but they will offer a degree of probability that the given condition is likely to happen.
The identification of correlations is better when the amount of data is greater, so was the example with Walmart when they started digging in their history of purchases to identify which articles where purchased in the same cart, when were they purchased and even what was the weather when they were acquired. This way, Walmart struck gold finding that prior to storms or hurricanes the customers used to buy Pop-Tarts (sugared cookies) together with flashlights whilst preparing for the storm and located the Pop-Tarts closer to the lamps reaching increased sales.
Previously and still in some areas in order to validate a theory, hypothesis had to be proposed and select the variables that they impact, then collect samples of information to validate the theories, in case that the evidence didn’t support the hypothesis then it had to be reformulated or completely discard the theory. Nowadays, having enough data and run algorithms for pattern recognition will identify the correlations without the need of proposing hypothesis since theories are not longer needed because facts are found instead.
Furthermore, hypothesis could find linear relationships as for example increasing factor A will also increase factor B, but due to the small amount of data contained in that sample a curved relationship couldn’t be found. An example for the curved relationships is the case of income and happiness relationship, it stated that an increase in incomes increased directly the happiness, but after a level of incomes has been reached, the happiness doesn’t increment more.
The most significant example of this chapter tells how premature babies are monitored to find their vital signs and some diseases can be foreseen and doctors can prepare for them as the babies’ bodies react and transmit information to the sensors.