{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 3. Data Analysis Workflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardless of what you may have heard, **there is not such a thing as well established sequence of steps to perform data analysis**. Any person that has done data analysis for research or industrial applications knows that getting insights from data is a **nonlinear process**. True, you have to import data before plotting data, but you may find yourself reimporting more or better data after presumably completing the visualization part. The same goes for each part of the data analysis process: going back and forth between different stages as you discover great things and issues with your data is a fact of life. \n", "\n", "What we do have in data analysis are **idealized models** of how the process should go. A prominent example is this scheme from [Grolemud and Wickham](https://r4ds.had.co.nz/explore-intro.html): " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Under Grolemud's and Wickham's scheme, you have a first linear process of importing and tidying (i.e., putting the data in tabular form), followed by a nonlinear part of transforming, visualizing, and modeling, and a final stage of communicating your results. This model is a perfectly good scheme to follow, and you can find numerous similar proposals online or in data analysis textbooks. This diversity of proposals reinforces my initial point: there is no such thing as universally accepted \"right\" way to do data analysis. **Trial and error** and **flexibility** are fundamental aspects of an effective data analysis process. \n", "\n", "Proposing data analysis workflows seems to be a lot of fun, so I want to make my own too. According to my experience, this is how an idealized process may look:\n", "\n", "- Questions\n", "- Data collection\n", "- Data loading\n", "- Data cleaning\n", "- Data formating\n", "- Data transformation\n", "- Data exploration and visualization\n", "- Data modeling\n", "- Insight and communication\n", "\n", "The graphic below illustrate my view of the data analysis process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll admit right away that my scheme is similar to Grolemud's and Wickham's but in all fairness, all schemes are similar to each other. However, there are a couple of differences that are important to highlight. First, I want to emphasize **the nonlinearity of the data analysis process**, illustrated by the arrows pointing backward in the scheme. For instance, the data cleaning process often left you with not enough quality data to do any meaningful analysis, which will lead you to collect more data. The transformation, exploration, and modeling cycle can also indicate that more cleaning and or more data is required. And the insights and communication process often works as a source for new questions. And second, the **creative process of making questions and coming up with data collection procedures** is also often a part of the data analysis process. \n", "\n", "Next, I will address each of these stages with a **practical example analyzing real data**. Before jumping into coding, I want to briefly address the first two stages, **Questions** and **Data collection** at a conceptual level as they are processes usually done in teams and before any actual coding happens. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadly speaking, questions in data analysis have three goals: (1) to **describe** the data in a **non-causal fashion**, (2) to **establish causality**, (3) to generate more targeted questions by **freely exploring** the data. These goals match with the methods of descriptive and/or exploratory research, correlational an/or causal research, and purely exploratory research. I make the distinction between descriptive and/or exploratory research and purely exploratory, to emphasize that the former usually have well-defined questions guiding the analysis, whereas the latter is more open and inductive in terms of questions and methods, meaning that questions can become about causal relationships. \n", "\n", "\n", "**Examples of descriptive analysis questions**: \n", "\n", "- What is the demographic profile of the company customers?\n", "- What is the percentage of votes that liberal candidates for congress obtained in the last election?\n", "- What are the main characteristics of the geomorphology of the south of Africa? \n", "- Which are the preferred technologies among web developers in Europe?\n", "- What are the main sources of income in the Japanese economy?\n", "\n", "**Examples of inferential analysis questions**: \n", "\n", "- Do older customers have different preferences for coffee in my coffee shop?\n", "- Is infant mortality significantly higher in countries with access to universal public health? \n", "- Do gun control significantly reduces the rate of violent crimes?\n", "- Do participants in the treatment group of a COVID-19 clinical trial vaccine have a lower infection rate that participants in the control group?\n", "- Is weather significantly associated with economic growth? \n", "\n", "\n", "**Examples of purely exploratory analysis questions**: \n", "\n", "- Does my data contain any meaningful clustering pattern? \n", "- Are some variables in my data associated? \n", "- How does distribution in my dataset look like? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data collection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There exist a wide variety of data collection methods, and data analyst may or may not be involved in such a process. The possibilities are so vast that it makes no sense trying to catalog all of them. Data can come from telescopes, surveys, ethnographies, clinical trials, census, logs from videogames, financial transactions, etc. Instead of attempting the impossible task of enumerating all methods, I want to focus my attention on the following two questions: (1) **can the analyst collect more data?**, (2) **does the data come from a study/experiment or it was collected \"on the wild\"?** \n", "\n", "Having the capacity of collecting more data can be a good or a bad thing depending on the kind of problem you are facing. In descriptive data analysis and machine learning projects, having such capacity is a good thing. From a descriptive analysis perspective, more data enables us to expand the dataset to understand things better. From a machine learning perspective, more data usually equals higher accuracy and less overfitting. Now, if your goal is hypothesis testing or causal inference, having the capacity of collecting more data *after the modeling phase* can be a bad thing as enables p-hacking and other bad practices in statistical inference. True, is ultimately up to the analyst to engage or not in bad practices, but **the temptation is there** and other members of the team that may not be aware of the risk may put pressure on the analyst. \n", "\n", "When data are collected in a controlled situation like an experiment or a survey, the data analysis process can be carefully planned, and the question-answering process becomes easier. When data is gathered from secondary sources or \"in the wild\" through methods like web-scraping or utilizing someone else's data, the data analysis becomes harder. In such a context, asking exactly the question you wanted you to ask may be hard or impossible, and a more exploratory and flexible mindset is necessary. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.6.8 64-bit ('venv': venv)", "language": "python", "name": "python36864bitvenvvenve269670d94154a47aa0ba5a4660e02de" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 4 }