The Six Stages of The Data Analysis Process
Clique aqui para ler em português
Many authors, such as Jeannette Wing in her book “The Data Life Cycle,” organize the data analysis process into 5 steps: data collection, data cleaning, data analysis, data visualization, and presentation. But in this article, we shall consider an additional step that we can call step 0 or 1. We will base ourselves on the Geeks for Geeks article of January 10, 2004.
Let’s compare the data analysis process to a trip to better understand the subject.
What is data analysis?
The expression “scientia potentia est” (“knowledge is power”) appears in his earliest writings, which have been recorded dating back to around 1597, but it is believed that the expression itself dates back many years before Christ, even several years later today this phrase proves to be true; whether managing a country, business, or even a home without knowledge, the effort is practically useless. Unlike these times, information and knowledge are no longer whispered in the ears, but hang freely over all of us, mixed with all kinds of debris, making this knowledge unreadable. Data analysis arises precisely to facilitate the understanding of this knowledge, making it readable and useful.
Now that we know what it is, let’s go step by step for good data analysis. Even though we are presenting the steps linearly, data analysis is more circular than linear, and the need to go back to a previous step may arise. As we better understand the problem,
The six steps of the data analysis process are:
1. Definition of the object of analysis:
Objective: Establish the problem we want to solve or the question we want to answer with data analysis.
Key questions: What do we want to discover? What information is needed? What are the real problems that stakeholders need to solve? What are our expectations for the solutions?
Success metrics: How will success be measured?
Imagine that your friend Euclides needs your help. He wants to go on a trip and asks you to help him organize everything. What is the first piece of advice you will give him? Pack your bags soon. Of course not; you will have to know where he wants to travel, why, and when, so you can have information to help him.
Similarly, in the first step of the process, the data analyst is given a problem or business task. The analyst must understand the task and the stakeholders’ expectations regarding the solution. The stakeholder is a person who invested their money and resources in a project; in our example, the stakeholder will be our friend Euclides. Even without having invested the money, he is the interested party and the decision-maker. The analyst must be able to ask several questions to find the right solution to your problem. As an analyst, I must find the root cause of the problem to fully understand it; therefore, effective communication with stakeholders and other colleagues is essential to fully understanding the problem.
2. Collect data:
Data sources: Where can the data be found?
Data types: What type of data is needed (numeric, categorical, etc.)?
Manual or automated collection: How will the data be collected?
Since we already know where Euclides wants to travel (to Mozambique) and why (to visit his mother), it is time to collect as much information as possible about how we can do this. We need to collect all the data that could be useful to For Euclides to decide whether he will actually travel and how he will do so, we also define how we will do this collection: we will interview airlines, we will analyze what we have on the subject (our database), and we can collect data from several locations at the same time.
Similarly, in data analysis, the second step is to prepare or collect the data. This step includes collecting data and storing it for later analysis. We will collect data based on the task and can collect it from different sources; the most common are interviews, surveys, feedback, and questionnaires. The collected data can be stored in a spreadsheet or SQL database. Spreadsheets can be used to store a few thousand or ten thousand rows of data, while databases are used when there are many rows to store. As examples and electronic spreadsheets, we have MS Excel and Google; for databases, we have SQL Server, MySQL, and Oracle.
3. Data cleaning:
Identify and remove errors: Fix inconsistent, missing, or duplicate values. Normalize data: standardize measurement units and formats to facilitate analysis.
Transform data: Apply mathematical or statistical functions to generate new variables.
Now with all the information collected, we need to clean our data, that is, remove fake news. Supposing in your research you found data that says you can travel by swimming across the ocean, riding a dragon, and through a portal in one second, just consider the portal. This data hinders our research because, in addition to being false, it influences the search result, being probably the most efficient option if it were true. Therefore, with this option, we eliminate all useless options that would make our result unfeasible.
There can be more subtle forms of bad data, so it is good to know what we consider clean data. Clean data means data free from spelling errors, redundancies, and irrelevancies. Clean data largely depends on data integrity. There may be duplicate data or the data may not be in a format, so unnecessary data is removed and cleaned. There are different functions provided by SQL and Excel to clean the data. This is one of the most important steps in data analysis, as clean, formatted data helps find trends and solutions. The most important part of the process phase is checking whether your data is biased or not. The sample must be representative of this in rare cases where it is not possible to evaluate the entire population, which is ideal.
4. Analyzing the Data:
Statistical analysis: describe and summarize data and identify patterns and trends. Machine Learning Techniques: Create models to predict results, classify data, or identify anomalies.
Hypothesis testing: evaluate the significance of results and confirm conclusions.
In this stage, we will do the data analysis; we will create the solution for Euclides, analyze our data to find out the best way for Euclides to make the trip; we will have the prices of cruises, flights, stopovers, and time spent by each one; we will have Euclides’ preferences for contemplating landscapes; level of urgency and availability; crossing information; and through calculations, we can find the best combination and make the decision.
Likewise, now with clean data, we can use it to analyze and identify trends. It also performs calculations and combines data to obtain better results. Here, we use tools such as Excel and SQL to perform calculations. These tools provide built-in functions to perform calculations or sample code written in SQL to perform calculations. Using Excel, we can create pivot tables and perform calculations, whereas SQL creates temporary tables to perform calculations. Programming languages are another way to solve problems. They make troubleshooting much easier by providing packages. The most commonly used programming languages for data analysis are R and Python.
5. Data visualization and presentation:
Graphs and tables: Create clear, concise visualizations to communicate results. Interactive dashboards enable data exploration and the identification of insights.
Compelling narrative: Tell a story with data to communicate your findings.
Although the time has come to make the decision, it will not be us who will make the decision but Euclides, so we must find a simple and straightforward method of presenting this data to help you make the best decision: which airline to choose, how much to spend on the flight… and because this is his choice and not another, we need to prepare a presentation for him.
This presentation will be based on data findings and involves transforming raw information into a format that is easily understandable and meaningful to stakeholders. This process encompasses creating visual representations, such as tables and graphs, to effectively communicate patterns, trends, and insights gained from data analysis. The objective is to facilitate a clear understanding of complex information, making it accessible to both technical and non-technical audiences. Effective data presentation involves the careful selection of visualization techniques based on the nature of the data and the specific message intended. It goes beyond mere display to storytelling, where the presenter interprets the findings, emphasizes key points, and guides the audience through the narrative that the data unfolds. Whether through reports, presentations, or interactive dashboards, the art of presenting data involves balancing simplicity with depth, ensuring that the audience can easily understand the meaning of the information presented and use it to make informed decisions. This technique is known as storytelling.
We have several tools for data visualization, such as Tableau, Looker, and my favorite, Power BI. The programming languages I mentioned in the previous point, R and Python, have some packages that offer data visualizations. R has a package called ggplot that has a variety of data visualizations.
6. Action and effectiveness Measurement:
Implement insights: Apply analytics findings to make decisions and improve processes. Monitor results: Measure the impact of actions taken and adjust the strategy as necessary. Effective communication: Share results and insights with stakeholders to generate value.
Although action is normally taken by stakeholders and not by us as data analysts, we need to continue assisting. As we saw at the beginning of this article, the data analysis process is more circular than linear, so we need to measure the efficiency of the solution obtained.
To measure the effectiveness of the analysis, data is moved to a live environment and monitored to observe whether the results match the expected business goal. If the findings are in line with the objective, the reports and results are finalized. However, suppose the result deviates from the intention established in the first phase. Imagine that Euclides’ mother is in Ethiopia on a trip. The objective defined in the first phase was not simply to travel to Mozambique but to see his mother. In this case, we will have to go back some phases and redo our work. That’s why the first phase is so important because if we don’t really understand the issue, we will be running fervently towards perdition.
Although this is the last phase, reviewing the objective of our analysis at each stage will be very useful, always keeping us aligned with the objective and making the necessary adjustments without spending a lot of time and resources.