meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:ct60a7000:spring2015:1:batanienrique [2015/02/26 01:14]
k5432
courses:ct60a7000:spring2015:1:batanienrique [2015/05/19 16:41] (current)
k5432
Line 1: Line 1:
 +====== Group'​s questionnaire answered ======
 +
 +{{:​courses:​ct60a7000:​spring2015:​1:​killerappsquestionnaire.pdf|}}
 +
 +====== Course questions ======
 +
 +**1. Why is Big Data a tool that is here to stay and not be replaced by others?**
 +
 +//This questions is important because it remarks the trends and technologies behind Big data, the person able to relate trends, technology and future of Big Data shows good understanding of the topic.//
 +
 +**2. As an internet user, under which terms would you agree to have your data re-utilized by companies?​**
 +
 +//This question is relevant since enables independent thinking to cover a gap that on my eyes the author of the book was not able to address correctly, given acceptance by the users for granted to re-use of data.//
 +
 +**3. After knowing the downsides of entrepreneurship and the odds against it, what reasons would convince you to be an entrepreneur?​**
 +
 +//If the student is able to justify its answer with sounded facts rather than emotional reasons he shows a deep understanding of the upsides and downsides of entrepreneurship.//​
 +
 +**4. Are there benefits of working for a SME in comparison than with a Large enterprise?​**
 +
 +//The book discusses two possible paths, either large corporations or startups, but there are still a wide range of companies that do not fit in the description,​ the Small and Medium Enterprises,​ that in a developing economy normally constitute between 40% and 60% of the GDP.//
 +
 +
 +====== Coursera Critical Thinking Statement of Accomplishment ======
 +
 +{{:​courses:​ct60a7000:​spring2015:​1:​batani_certificate_-_coursera_criticalthinking_2015.pdf|}}
 +
 +
 +====== Homeworks ======
 +
 +
 +**Ethics in/of big data**
 +
 +Big data offers a wide variety of possibilities but most of the cases it will be based in customers'​ data. The data could be personal or public nature. The risk that Big data represent is the disclosure of this private data to non-authorized public or private sites. The use that these third party institutions could do with the data compromise the privacy of the customers.
 +
 +One of the biggest threats with Big Data is the barrier of anonymity that can be broken as it was discussed with several examples in the book. I think the role of ethics in Big Data consist in the usage of the customers information for purposes that can't harm them. However, this is a complicated parameter since the risk of usage might be relative from company to company, or even from Data analyst to Data analyst. After the discussion raised in class I consider that the best practices for Ethics in Big data would be to establish guidelines for the use and re-use of data depending on the nature of this data, the most personal or revealing that the data is, the more restricted the possibilities of usage for the data should be. Nevertheless,​ relativity is the problem again, since these guidelines should be enforce by the government and each government could have more or less priority on this regard.
 +
 +In conclusion I think that the level of ethics that companies or governmental institutions will use with data depend on the response of the people to the disclosure of the data on the go.
 +
 +
 **My Own Definition of Big Data** **My Own Definition of Big Data**
  
Line 65: Line 105:
  
  
 +**Chapter 2 – MORE**
 +
 +
 +This chapter reviews mostly the difference of the regular approach of sampling data to generalize conclusions and behaviors of a group of events or entities against the holistic view that gathering information from the whole group can provide (n= all) rather than a defined sample.
 +
 +A good example of the stated above was a campaign from Barack Obama to ask about the preferences for the elections. The background was to survey people on their cellphones, whilst this sounds like a reliable approach; calling a random sample of people and ask about their opinions. Nevertheless,​ there was bias towards people having cellphones, although it could be thought that the majority of the US citizens have them, it also solely includes the segment that fulfills the target market of cellphones, young and change embracers mostly, leaving behind the rest of the society.
 +
 +In the chapter were summarized three major mindsets shifts seeking big data employment. ​
 +
 +  - “The ability to analyze vast amounts of data about a topic rather than be forced to settle for smaller set”.
 +  - “Willingness to embrace data’s real-world messiness rather than privilege exactitude”.
 +  - “Growing respect for correlations rather than a continuing quest for elusive causality”.
 +
 +Reviewing smaller samples of a whole set of data it’s a shortcut previously used due to the lack of capacity to review the complete information cost-efficiently. However, the use of shortcuts represent trade-offs of perspective against resources used and within each sample normally some trends (subgroups) can bias the results that go overseen through what it was thought to be a randomized sample.
 +
 +As in fraud detection, the goal of the analysis is to find anomalies rather than normal behavior in order to discover new information rather than only the background where the highest concentration of data lays.
 +
 +Samples normally answer the question “how many”, whereas big data solves the puzzle for “how many”, “where”,​ “how”, “what” and “who.
 +
 +
 +**Chapter 3 – MESSY**
 +
 +The third chapter depicts why the imperfect data that is collected as a consequence of gathering large amounts of records is not a problem compared with the benefits that this provides to the overall analysis for creating information. In principle, curated data works perfectly fine for traditional algorithms or data processors that are aimed for precision and manage reduced sets of records. However, processing time for curating the data properly for the accurate results it’s time consuming and sometimes by the time when all data is collected, curated and presented, the individual values of the data sets have become obsolete.
 +
 +On the contrary, by managing larger sets of data to feed an analysis we open the door for inexactitude but at the same time, it becomes possible to find inter-dependencies that were not visible with just a sample of the data. This inter-dependencies will have inconsistencies or missing values as it’s the nature of handling with “raw” data. Nevertheless,​ managing large sets of data counteract the messiness of the information and provide “Good enough” values, as stated in one example, 2 + 2 might not always be 4, but 3.9, and that is acceptable for many scenarios. ​
 +
 +Naturally, this reasoning offers a different possibility to address time consuming analysis where an approximate result is sufficient. Nonetheless,​ in certain queries as checking on bank transactions or calculating vectors to launch a spaceship, an approximation is not enough and must be exact and up-to-date.
 +
 +A significant example of radical improvements through shifting the approach from reduced data sets to large data sets was to improve both; Microsoft Word spelling correction and Google Translate. The first approach set a bifurcation of paths to solve the problem, refine the algorithm or to feed more relationships of data so the algorithm can learn based on more cases. Microsoft tried the second approach and obtained incremental results as they increased the data base from 10K to 100K to 1M and eventually to 1B of word combinations.
 +
 +Google, tried the same approach but with one trillion records. However, instead of improving the typos correction, they addressed a more complex and bigger problem, translations. Instead of trying to teach a computer to translate one word for another and consider grammar rules. They fed the system with myriad translation found in official governmental translations,​ translated webpages or books offered in different languages, all with the purpose to find all possible combinations and degree of commonly used translations for any language. Even for many different languages at the same time.
 +
 +**Chapter 4 – CORRELATION**
 +
 +Big data offers knowledge based on what is happening due to the interactions of at least two variables, this interactions are measured by their correlation,​ this means that if variable A has a strong effect on variable B, the correlation between them is high as well.
 +
 +The way correlation identifies the events that are happening explains the WHAT instead of the WHY as it has been stated in the prior chapters. A correlation won’t explain why certain records are prone to fulfill a given condition, but they will offer a degree of probability that the given condition is likely to happen.
 +
 +The identification of correlations is better when the amount of data is greater, so was the example with Walmart when they started digging in their history of purchases to identify which articles where purchased in the same cart, when were they purchased and even what was the weather when they were acquired. This way, Walmart struck gold finding that prior to storms or hurricanes the customers used to buy Pop-Tarts (sugared cookies) together with flashlights whilst preparing for the storm and located the Pop-Tarts closer to the lamps reaching increased sales.
 +
 +Previously and still in some areas in order to validate a theory, hypothesis had to be proposed and select the variables that they impact, then collect samples of information to validate the theories, in case that the evidence didn’t support the hypothesis then it had to be reformulated or completely discard the theory. Nowadays, having enough data and run algorithms for pattern recognition will identify the correlations without the need of proposing hypothesis since theories are not longer needed because facts are found instead.
 +
 +Furthermore,​ hypothesis could find linear relationships as for example increasing factor A will also increase factor B, but due to the small amount of data contained in that sample a curved relationship couldn’t be found. An example for the curved relationships is the case of income and happiness relationship,​ it stated that an increase in incomes increased directly the happiness, but after a level of incomes has been reached, the happiness doesn’t increment more.
 +
 +The most significant example of this chapter tells how premature babies are monitored to find their vital signs and some diseases can be foreseen and doctors can prepare for them as the babies’ bodies react and transmit information to the sensors.  ​
 +
 +**Chapter 5 – DATAFICATION**
 +
 +The world is full of information but hardly this information is available for all internet users and in the most common languages. Most of the information available in the world is still in printed versions in books that can’t be accessed by an online query. Therefore, companies like Google took the responsibility of scanning page by page all books and create virtual images of the books (digitalization). Based on this, users were able to review the whole books online but this wasn’t easy to access and look for specific contents without having to read the whole book. Therefore, Google with the help of software that converted the images into text datafied the books and converted it into indexed knowledge.
 +
 +This chapter contains several examples on how maps for ships, Arabic numbers and even GPS coordinates datafied the world perception of the world itself making easier to conduct business, refer to particular places in the world and perform transactions. In other words, the world has become renderized in data. Similarly, companies like Facebook, Twitter, foursquare and Linkedin have been rendering human behavior and preferences of people, even with their location.
 +
 +The purpose of gathering so much information and translating it into the computer’s language is to analyze it and find patterns. New products and services are arising and will follow this trend based on the information generated. As the author described, in the IT world, we should stop focusing entirely in the Technology and pay more attention to the Information as it was shown to be relevant in the examples mentioned.
 +
 +When designing a business or entity that interacts with people the ability to gather information about people’s life should be considered to convert the world in a massive stream of data that can be analyzed and therefore, understood. ​
 +
 +Based on the examples provided in this chapter human kind has been able to profit and safe lives through the datafication. Nonetheless,​ it’s still an open question what would be the best practices on how to manage all the information that every cellphone and other devices tell the manufacturers about our preferences and habits since this information in the wrong hands can also harm our security.
 +
 +**Chapter 6 – VALUE**
 +
 +This chapter mainly discusses how we can benefit from data as an asset, regardless of the business model behind or after it. Recently discussion started with Facebook’s IPO of 104B USD while it reported about 6B in physical assets, but this didn’t include all the strategic information that they have gathered during years and was worth significantly more than about 100B USD.
 +
 +Data can be used by different entities but gathered by one, therefore there should be profitable paths to monetize from the information collected. One way to achieve this is Open data, which takes mainly information possessed by the government and opens it to the public for use since presumably private companies and individuals could give a more innovative use than Government itself, who’s task is to guard it.
 +
 +Another way to profit from data is to re-use it, combining different sets of data for different purposes can create different insights and new knowledge. Therefore, the information that big companies have stored could be given a second purpose and be mixed with other sources of information to create new insights.
 +
 +Nowadays, storage of data is cheaper and easier to gather than ever before. Thereby, taking the opportunity to use it and profit from it has become a nearer opportunity. Value of data can increase and even the governments have though on reform the legislation on how companies are valued by their assets in order to include their data too.
 +
 +A clear example provided in the chapter was Google gathering information for many sources simultaneously with the Streetview cars, grasping information about the GPS locations, taking pictures, improving the maps, mapping location with the local WIFI networks and might even navigate the open WIFIs for data. Although it was a great expense to collect it, add new streams of data to collect is relatively cheap once the major expense has been done.
 +
 +Lastly, the opportunity of using data has a price too since data can lose its value as it loses their recentness. Data can expire and become of less value unless can be re-used by other entities.
 +
 +====== Presentation about Big Data Book ======
 +
 +The chapter that I presented about Big Data was #9, titled "​Control"​.
 +
 +The presentation can be found here:
 +
 +Additional information about information security can be found in the following link:
 +{{:​courses:​ct60a7000:​spring2015:​1:​big_data_control_chapter_9.pptx|}}
 +
 +https://​www.youtube.com/​watch?​v=fZqjSBw1JT0
 +-> Suggestions for data privacy.
 +
 +https://​www.youtube.com/​watch?​v=RUBzvatQwL8
 +-> Company that protects data.
  
 +https://​aboutthedata.com/​
 +-> Platform to verify wich personal information is already public (US Only).
  
 +**Question about the chapter**
  
 +How would you behave knowing you are under surveillance 24/7? Would you accept it just by knowing who knows you?