meta data for this page

Homework 1: Defining Big Data

Articles related to Big Data found:

Casado, R., Younas, M., 2014. Emerging trends and technologies in big data processing. Concurrency Computat.: Pract. Exper.

Chen, M., Mao, S., Liu, Y., 2014. Big Data: A Survey. Mobile Netw Appl 19, 171209.

Crawford, K., 2012. CRITICAL QUESTIONS FOR BIG DATA. Information, Communication & Society.

Definition There seems to be a somewhat consensus what is meant by Big data (notice the capitalized Big to emphasize that it's Big data, not just “big amount of any data”). By contrasting Big data to simple data in traditional databases, e.g. in relational databases, Big data has some characteristics that make it difficult to process in traditional database tools. These include unstructured format, missing data types, quickly changing and also large amount of data. The amount of data can be thousands of terabytes. Traditional tools cannot link rows and columns as they don't necessarily exist, and if they exist, it's slow work to process thousands of terabytes in few seconds. The emergence of Big data is related to the rapidly increasing amount of collected data from various sources: e.g. social networks, modern cars, weather data and website user logs. That is one definition that is based on comparing big data to “traditional” data.

However, I'd include another point-of-view, where definition of Big data is based on marketing visions: Big data is kind of silver bullet where automatic processing of data can provide essential aspects on how to develop and target marketing and nowadays it's essential to use Big data (+ data mining) to be able to compete in the markets. Big data has potential of bringing insights to us that human eye is hard to see from unstructured data sets. Big data is new tool of discovering things and innovation.

Related to this hype, Crawford (2012) introduces some critics against Big data. Though Big data can help in better focusing the products and finding customers, collecting the increased amount of data can enable identification of users even without names. Governments can use collected data on locating dissidents, just few to mention. Algorithms behind analyzing of the data can also lead us to wrong conclusions or miss important piece of information: results of algorithmic analysis can only be as good as the algorithm.

Summarizing ideas from the Big Data book

Big data has potential of changing how we think, how we do measurements and what we measure. The shift in thinking and measurement mind-set is described in the picture.

Big data challenges our way of current thinking:

  • Shift from causation to correlation. The correlation within data is more important than knowing the reason. E.g. data analysis can show used orange cars are in better condition than cars with other color. Knowing why can be interesting to know but yet no necessary. Big data answers questions what rather than why.
  • When the data is “all” there is, we can accept measurement errors. In massive data sets the errors will average out.
  • In doing science, big data has a potential of changing the research methods. Traditional way of doing research is taking small random samples and doing analysis and deduction from this limited amount of information. Exactness of measurements and results is important. With “all” data as the data set, there is no more need for that. In addition, as data is more and more in a datafied and digitalized form, there is less need to go to the field to collect data.

How can big data change our world? Think for a sample of inventing the movies. A single picture can be considered as small data or unit of data. When more and more pictures are taken and combined, something remarkable happens. The pictures come alive. The more data units (pictures), the more realistic the impression. When moving pictures recording machine (camera) was invented by Lumiere, few people saw it's value. Yet now it's hard to imagine spending a day without moving picture wonders. Big data can bring similar groundbreaking new inventions and findings as large amounts of data is combined and analyzed - things that we can't understand and appreciate yet. In the book there are many such examples, so it's reasonable to believe more and more will appear.

Homework: Big Data Ethics

General perspective

There aren’t yet clear rules on how many big data related issues, like privacy, will be monitored and handled by law. This by itself is alarming. We just have to trust, that big data analysts, programmers and companies having access to big data, have “the right set of mind”, ethics and do not predispose our private data to anyone with enough dollars to buy the data from them? Yes.

What is ethically correct is one of the oldest questions humankind has pondered. Few hundred years ago black people were considered slaves and that was socially acceptable. In some cultures eating meat was considered and is still considered bad, however modern western society usually enjoys meat without any hard feelings. Thinking back to enslavement of black people seems morally wrong. In the future it’s possible that our meat eating habits are considered likely wrong. Summarizing the idea, ethics cannot exactly say what is right and wrong. Ethics depends on the environment, also education and religion.

As for morale, it’s more individual perspective of what is wrong and right.

However, big data as such has an attribute, that the data is widespread and knows no boundaries of countries or cultures. It spreads over large landscape of different ethics, morals and even laws. This brings new complexity to big data privacy issues, even though we'd trust that companies within our own countries won't abuse our personal data. So when South Korean personal data gets into the hands of North Korean data miners, it’s likely morally and ethically correct to abuse the South Korean data as much as possible. All in all, data about other countries’ peoples is considered to be in less sacred and in the gray zone. What is more, the privacy legislation of the country (there exists some) you live in is not applicable to other nationals, so you can more easily extinguish your naughty curiosity with foreign people’s data.

Categorizing ethics by risks in the Big data book

Risks mentioned in the Big Data books include some aspects that can be thought from ethics point of view.

  • Dictatorship of data?
    • I’ve noticed that the Farecast and created by Enzioni don’t exist anymore. Farecast has been bought by Microsoft and by Ebay. Farecast was a moment a part of Bing search but there is none such functionality anymore. This raises questions. Microsoft has bought the license and patent of Farecast but why doesn’t it use its abilities anymore? Though it’s possible Farecast is embedded somewhere we can’t rightly see, a possible scenario comes to my mind: Is it possible for governments to buy patents and then stop using them, declining people from useful information?
  • Privacy issues
    • The secondary use of big data will cause problems. While identification of user wouldn't be possible through primary usage of data, combining this primary data to other data sets may pin point an individual precisely. Big data miners discovering individuals from big data sets, seems unavoidable. There should be some surveillance but while there is no such, data miner companies are expected to have just good will and not be too greedy selling data for money.
    • One quite recent blockbuster in the news was Edward Snowden’s revelations of US intelligence agency monitoring almost everything, even foreign countries' private emails. The US intelligence knows everything about you already, and likely other countries' intelligence agencies too. On the other hand, Snowden’s case reveals that some people have in their hands access to the library of people’s personal lives all around the world. Snowden might have sold this data forward, rather than revealing it to the world. Good morale has won in this case, but were it the other way around, a bad morale choice, maybe we wouldn't have heard anything about Snowden and privacy issues related. Think about it. When the world seems to be silent, that might mean that your private data is abused and sold in all possible ways. Whistleblowers are rare, after all.
  • Penalties through propensities:
    • If big data analysis indicates, that someone is supposed to commit terrorist attack that kills many people, then actions should be made. However to put people into jail, there should be evidence. I don’t think it’s right to put people into jail by just because data mining has said so. He or she might still not have done the crime. However, I think there should be some new law/police unit that handles these “likely” criminals. There’s been talk about drug usage in the Finnish media and how the police have invited the drug users into talk sessions. This seems to be quite efficient way to help getting rid of the spiral of drug usage. At least in short term. Similar kind of approach can be considered in the future cases of “penalties” through propensities. The term should be “discussions” through propensities.


A book “Big data ethics” (Materosian, 2013) defines elements that can be considered as a framework for big data ethics:

  • “Identity: What is the relationship between our offline identity and online identity?
  • Privacy: Who should control access to data?
  • Ownership: Who owns data, can rights be transferred, and what are the obligations of people who generate and use that data?
  • Reputation: How can we determine what data is trustworthy? Whether about ourselves, others, or anything else, big data exponentially increases the amount of information and ways we interact with it. This phenomenon increases the complexity of managing how we are perceived and judged.”

Mateosian, R. (2013). Ethics of big data. IEEE Micro, 33(2), 60-61.

Coursera Critical Thinking Statement of Accomplishment

Assessment of the presentations

Exam Questions