meta data for this page
  •  

Definition for Big Data

“In short, the term Big Data applies to information that can't be processed or analyzed using traditional processes or tools.” [1]

The core problem that Big Data solutions are trying to address is the sheer amount of unstructured data that in itself is not very useful. Only by applying different kinds of methods we can make something out of it. Only now when computing power has reached a certain level it is possible to even think about processing all that information which is going through our information systems daily. The following maybe one of the most important factors in this: “Here’s the big truth about big data in traditional databases: it’s easier to get the data in than out. Most DBMSs are designed for efficient transaction processing: adding, updating, searching for, and retrieving small amounts of information in a large database. - - The trouble comes when we want to take that accumulated data, collected over months or years, and learn something from it—and naturally we want the answer in seconds or minutes!” [2]

Data warehousing might be easy, but again, making use of that data might be ridiculously hard. Also we cannot even store all the data that we generate so that poses yet another problem for Big Data applications. “As the amount of data available to the enterprise is on the rise, the percent of data it can process, understand, and analyze is on the decline, thereby creating the blind zone” [1] How to cherry pick those pieces of data that really matter? This leads us to the three notable characteristics, defined by IBM, in Big Data which are volume, variety and velocity. Shortened into a one sentence the characteristics could be explained: “The three V's” mean that there are batches or streaming data coming in which is a varying mix of structured and unstructured information leading to zettabytes of data by volume.

Ultimately Big Data solutions can provide us with accurate data-driven recommender systems, blazingly fast bidding systems with very high throughput capabilities, and much more. In some sense we can think of Big Data as some sort of continuum for previous business adoptions of internet technologies. Only now the amount of data is so large that it has grown out of what the current technologies can handle. [3]

References:

  1. Dirk deRoos, Chris Eaton, George Lapis, Paul Zikopoulos, Tom Deutsch, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, McGraw-Hill, 2012, [Available at: http://public.dhe.ibm.com/common/ssi/ecm/im/en/iml14296usen/IML14296USEN.PDF]
  2. Adam Jacobs, The Pathologies of Big Data, Queue, v.7 n.6, July 2009, [Available at: http://dl.acm.org/citation.cfm?id=1563874]
  3. Provost F, Fawcett T, Data science and its relationship to big data and data-driven decision making, Big Data, March 2013, 1(1): 51–59, [Available at: http://online.liebertpub.com/doi/pdf/10.1089/big.2013.1508]

Ethics in Big Data

Ethics in Big Data is quite a difficult subject. It is not the same as in the case of Open Data. Open Data in itself is some sort of a self-driving force when it comes to the ethics of workings in a corporation or government. When the data is freely available it is easy and almost mandatory to work ethically, in a fair and legal way. But with just Big Data, data not being open, there is no such incentive. Only the fear of jurisdiction or leaking information can change the mindset when it comes to ethics in handling sensitive data, such as Big Data containing personal information about individuals.

However, it is not really that black and white with ethics. For example, let's look at what Google does with its Android OS. Android has a feature to save information in the cloud as backup, and making switching to a new device simple in the process. Here a customer provides much of their personal information in exchange to a great service that makes life that a much easier. Taken out of its context anyone could figure that it is just plain wrong to collect such information but when the both parties gain benefit, it suddenly becomes acceptable.

Exam Qs

  1. Compare Big Data and Open Data. Give examples where they co-exist and where it is reasonable to categorize something under only one of these terms.
    • Emphasises practical knowledge and applying it through examples and forces answerer to consider the similarity and disparity of both topics.
  2. You are the newly appointed head of the IT department in a publicly traded company that is in public transportation business. The corporation has access to large amounts of varying data. There is momentum in releasing passenger statistics, cafeteria sales data, and operational activities as open data. In a monthly executive meeting you are asked to tell about the risks and benefits in opening this data to the public. Rationalize why opening this data is beneficial to the company and what risks it poises.
    • This question tries to reflect the importance of transparency in business activities as dealt in the book but still forcing the responded to be critical about the possible risks, for example in privacy. An answer addressing potential indirect benefits to the company is also sought after.
  3. Explain the meaning of causality and hidden correlations in the context of Big Data analysis.
    • Big Data is not just technology but a way of thinking. The book also gave good examples on this.
  4. Big Data can be defined with the three V's, Velocity, Variety, Volume. Explain what they mean and tell why each of them is problematic and connected to Big Data.
    • A good starter. This definition was brought up multiple times and it is reasonable to assume a student participating this course knows this definition.

More

Chapter 12 presentation ch12.pdf
Statement of accomplishment vaananen_coursera_criticalthinking_2015.pdf