Everybody Lies - Book Review
1 November 2019
Dr. Moria Levy
"Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are" is a 2017 book authored by Seth Stephens-Davidowitz, a Jewish individual specializing in Big Data, who had a tenure at Google and actively advocated for innovative data extraction methods. Through an array of anecdotes and extensive research, Stephens-Davidowitz illuminates the core of Big Data and the methodologies to extract valuable insights from it.
The book delves into the following topics:
1. Grasping the Essence of Big Data
2. Sources of Information
3. Unleashing the Potential of Big Data
4. Applications and Utilizations
5. Methodologies for Research
6. Limitations and Restraints
The book delivers an engaging and easily understandable reading experience. The research is comprehensive and varied, encompassing diverse facets of life. The knowledge derived from these studies and the techniques employed for managing Big Data proves captivating and enlightening. In summary, this book is highly recommended.
Grasping the Essence of Big Data
BIG DATA is characterized by its immense content volume, demanding inventive and fresh approaches for its analysis and comprehension. It encompasses a diverse range of information types, including:
1. Quantitative data/numerical values, derived from an extensive array of distributed sensors.
2. The textual content highlights the significance of knowledge management— an arena where Machine Learning (M.L.) is pivotal. This encompasses not only online sources but also offline texts.
4. User clicks
6. Audio files
7. Typographical errors (TYPOS)
This listing is incomplete, as BIG DATA transcends conventional boundaries. While we commonly associate BIG DATA with the Internet, which houses an inexhaustible repository of data and textual information, it also encompasses substantial caches of documents and data within organizations. The revolutionary potential of the Internet as an information source is indisputable, yet the domain of BIG DATA is not restricted solely to online realms.
The ubiquity of data is palpable in our environment. On an average day, an astounding 2.5 million trillion bytes of content are generated by individuals. The ascendancy of BIG DATA is propelled not only by the ceaseless influx of new information but also by the digitization and computerization of historical texts and other data. This is witnessed in extensive book scanning initiatives, which serve as invaluable knowledge repositories. This treasure trove of information can provide insights not solely into events but also human emotions and perspectives.
The discipline of Data Science is dedicated to deciphering such content. It involves discerning patterns and elucidating relationships between variables. The significance of engaging with BIG DATA is multifaceted: it bridges gaps left by other analytical tools due to limited sample sizes; it mitigates biases that may lead to an undue emphasis on preexisting knowledge; and it compensates for situations where alternative data samples lack reliability, accuracy, or profound insights.
Sources of Information
This is the juncture at which we revisit the book's title. The title imparts a lesson about distinguishing reliable sources of information from those of lesser authority. A notable differentiation emerges between individuals' self-representation on platforms like Facebook and other social networks and data from alternative sources such as Google searches or purchase statistics on relevant subjects. These disparities underscore the propensity to portray an idealized self-image as opposed to an accurate reflection of genuine behavior. Similar to other contexts, judiciously selecting and deploying suitable sources within Big Data is imperative. Only some things that carry significance are inherently trustworthy.
Once a primary data source of historical significance, surveys are unveiled to possess limited applicability. Similar to our evaluation of claims on social networks, a fundamental predicament surfaces here as well. Survey participants frequently need complete candor. This phenomenon persists even in anonymous surveys, possibly driven by the desire to establish a favorable impression in the eyes of others and, conceivably, within one's perception.
Google searches stand as a substantial and generally reliable fount of information. Nonetheless, precision in the approach to analysis is, at times, essential. For example, scrutinizing mentions of presidential candidates may not inherently divulge a distinct preference, as many individuals likely conduct searches to compare multiple contenders. However, the sequence in which options are presented establishes a direct link to selection. When both candidates' names are entered, it becomes evident that people list their favored choice first. Thus, by gauging search volumes, one can approximate projected voter support for each candidate.
Databases of scanned books also furnish a substantial and captivating dataset, offering insights into the evolution of historical trends, both literary and beyond. Social media searches provide a mechanism to amass demographic data. The vastness and diversity of available databases surpass imagination, and the inventiveness of researchers is pivotal to its continual expansion.
Guideline: As one embarks upon BIG DATA research, upholding an open and adaptable approach when discerning sources of information and identifying pertinent, influential content is essential. As an illustrative example, the discovery that a horse's heart size has implications for racing success unveils an unforeseen insight within the broader context.
Unleashing the Potential of Big Data
Big Data holds advantages over other data sources:
1. It presents novel research content that was once beyond reach, shedding light on concealed information and unspoken sentiments. Notably, individuals tend to divulge their innermost thoughts and confessions within search queries, even when those queries may not directly yield a pertinent response (such as searching "Is my daughter beautiful?"). Data analysis can, for instance, cast a revealing light on these trends.
2. It furnishes dependable information, standing in contrast to the reliance on traditional sources like surveys or assumptions. Google search data is an illustrative example, especially concerning sensitive topics like racism, where understanding people's actions often yields more precise insights than merely depending on their assertions. Online platforms like Netflix understand the significance of observing behaviors and transcending verbal expressions.
3. It enables targeted scrutiny of distinct subpopulations, including more minor factions, previously an unwieldy endeavor due to limited scope. This laser-focused approach unlocks new dimensions of learning that were previously inaccessible. For instance, scrutinizing populations across diverse cities and neighborhoods unearths finely-tuned insights, forestalling overgeneralization and hurried conclusions. Focused learning delves into the complexities of real-world dynamics and human conduct. An example is the investigation of preferences between individuals who dream of tomatoes versus those who desire cucumbers, unearthing profound insights beyond the surface.
4. It empowers the conduct of boundless randomized trials, diminishing reliance on controlled experiments. These experiments are pivotal for studying causality and influence (as expounded in the methodologies below)—a case in point: delving into the impact of television advertisements on consumption patterns for advertised products.
Applications and Utilizations
Big Data holds the potential to enrich learning across a multitude of life's dimensions. It can be harnessed in the following ways:
1. To either corroborate or question established theories (hint: a Freudian theory is discredited while another is affirmed).
2. To predict trends or occurrences, including the forecasting of election outcomes.
3. To identify subpopulations, nurture a sense of belonging, and detect individuals in distress, enabling timely assistance.
4. To facilitate impartial decision-making and to rely on tried-and-true solutions.
5. For education and acquiring more profound insights into the world—like comprehending the concerns of expectant mothers and tracing shifts in these concerns across diverse regions.
It's crucial to acknowledge that when pursuing knowledge, any subject is a viable pursuit. The author strongly recommends, and this counsel bears substantial weight, venturing into domains where conventional methodologies lack firm grounding. This approach harbors potential not solely for unveiling novel perspectives but also for generating innovative insights that challenge preexisting knowledge.
Below are several research methodologies that delineate the analysis and learning from Big Data, along with pertinent recommendations:
1. Focused Information: When navigating Big Data, it is imperative to acknowledge that employing the entire dataset is only sometimes necessary or accurate for drawing conclusions and gaining novel insights. Frequently, working with representative samples proves more advantageous. A more precise and fitting foundation of information is established by paring down the data and focusing on a subset, provided the quantity suffices.
2. Sentiment Text Analysis: Following a foundational study wherein researchers categorized tens of thousands of English words as positive or negative, many studies can now delve into individuals' emotions through text analysis. This burgeoning field emerged as an uncharted territory before the advent of Big Data.
3. Doppelganger Search: In predictive searches centered on individuals, a group of "duplicates" possessing the same attributes as the subject is sought to comprehend their outcomes. This approach is utilized, for instance, to foresee the success of older baseball players attempting comebacks or to personalize drug testing. It also underpins personalized product recommendations on platforms like Amazon. Doppelganger searches facilitate tailored and precise diagnoses and treatments.
4. Causal Analyses: Contemporary deep learning (AI) predominantly involves identifying patterns that unveil behavioral insights. While substantial, this approach lacks context—it doesn't elucidate why one factor influences another; it merely highlights a correlation between the two (even though both might be affected by a third factor). The expansive information reservoirs within Big Data databases allow for the execution of randomized field experiments. Unlike controlled laboratory scenarios, these real-world experiments are more reliable, encompassing authentic situations instead of artificial constructs. Moreover, these experiments are magnitudes more extensive than planned lab experiments, and the control groups inherently derive from the preexisting database content. Commonly termed A/B testing, these experiments facilitate effect analysis and contribute to understanding causation. Remarkably, even industry giants like Google employ these analytics daily to optimize button placements, sizes, hues, and more—all aimed at maximizing consumption and, naturally, their profits.
Limitations and Restraints
Recognizing that Big Data is not without its limitations is crucial, encompassing what it cannot achieve and what should be avoided. Here are notable constraints:
1. The Curse of Dimensionality: This pertains to the risk of identifying erroneous phenomena stemming from either inadequate data or, in the context of Big Data, an overwhelming number of variables being tested simultaneously. When multiple variables are examined randomly, one may emerge as statistically significant compared to others. To navigate this, strategies include humility, iterative testing, narrowing the variable scope, and corroborating previous findings.
2. Over-Reliance on Measurable Aspects: Ensuring alignment between measurable and readily measurable is only sometimes guaranteed. Our measurements are often influenced by availability, possibly leading to an undue emphasis on specific indicators compared to their true importance. Managing this entails supplementing research with interviews, surveys, or other Small Data tools that depend on alternative metrics. Relying on sound judgment and refraining from settling for conclusions solely derived from Big Data is essential.
3. Ethical Considerations: Ethical dilemmas encompass the application of Big Data. For instance, if individuals with a specific profile are less likely to repay loans, is it ethical to deny them loans? Many contemporary companies base decisions of this nature on insights from Big Data, yet these approaches might need to pay more attention to nuanced realities. Another example is price discrimination—adjusting prices for sub-populations based on an analysis of their willingness to pay for a product or service.
4. Government Empowerment: Big Data empowers governments with extensive information about each citizen, facilitated by omnipresent surveillance cameras. This heightened power allows governments to anticipate future offenses and proactively intervene, which spawns many concerns—beginning with privacy infringement and stretching far beyond that domain. The ramifications are multifaceted and extend into broader societal considerations.
Social science, fortified by the foundation of Big Data, is maturing into a distinctive scientific field. The book has illuminated myriad research cases, underscoring that this digital metamorphosis is in its nascent phase. A multitude of unexplored concepts, surpassing those already delineated, lie untapped within diverse spheres. This applies universally, touching every facet of existence. As articulated by the author and reiterated by Levitt, the amalgamation of curiosity, ingenuity, and data can elevate our grasp of the world profoundly.