What is the Data Science Process?

When talking about Data Science, we can’t just skim over the fact that it has layers to it–and we mean so many layers! There are all kinds of Data Scientists doing all sorts of work, and that’s exactly what we aim to explain in this article. We’ll guide you through the different aspects of the Process and uncover all the mysteries that surround this curious career path. Now, let’s take a further look at the data science process.
The Different Types of Data
Collecting data can find itself on different levels of difficulty–some methods may be easy, but some produce datasets that are hard to read and analyze. This is what Data Science deals with to its core. You take the data you want to gain some knowledge from, change it, fit it, analyze it, make models based on it, and at the end, gain something in return. It’s like chipping away at a marble block until you get a Greek statue.
So, when we’re at the different types of data you can retrieve, what are they? Well, they’re everything you can imagine. The data can be:
- Structured
- Unstructured
- Natural language
- Machine-generated
- Graph-based
- Audio
- Video
- Images
- Streaming, and everything in between
These terms are rather self-explanatory. Structured V. unstructured data is simple–the former comes ready for you to jab at it for information, and the latter requires a bit of fiddling before you can do something useful with it. You can organize structured data into traditional databases, like the ones we’re used to seeing in Microsoft Excel or an SQL database. They’re quantitative and have some sort of order in them that can be learned and used everywhere. Unstructured data, unsurprisingly, is a bit more complicated. It includes everything from text, video, and audio to blog posts, surveillance, and extraterrestrial footage; you get it.
So, with that being said, what does a typical Data Science process look like? We’ll go over the main points every data scientist and data analyst need to cover, but it is by no means the only and right way to do it. Everyone has their method, and depending on the stuff you’re doing–your work may differ drastically from that of your colleagues or someone on the other side of the world.
The Data Science Process
The Process is iterative, and that means that many phases can happen at once, and one or more phases are most likely to be repeated during the whole project’s lifecycle. That’s how Data Science works; you try, fail, try again, and succeed at the end. And that’s not to discourage you from pursuing a career in Data Science; it’s just the beauty and nature of it!
So, The Data Science Process has multiple Phases in it and looks something like this:
- Setting the Goal
- Retrieving Data
- Data Preparation
- Data Exploration
- Data Modeling
- Presentation and Automation
Let’s go through an overview of each Phase and get a glimpse of the broader picture.
Phase One: Setting the Goal
This step is crucial, and no matter your mathematical wit, many Data Scientists fail to define the project’s business goals before starting. This, in return, blows up in their face later on when they discover that they’ve made a crucial mistake that could’ve been avoided with a detailed plan. Math isn’t going to help you here, and don’t think you can wiggle yourself out of a problem just by knowing Python and Statistics.
You need to define the overall goal you wish to achieve for a clear plan and then create a project charter. A project charter includes setting detailed goals, missions, project context, and how you’re going to perform the analysis to reach those goals. Next, you need to clearly define the resources you’ll need for the project to work. Of course, you can’t do it without a dataset and a computer with software on it–so that’s a good place to start.
Furthermore, you’ll need some sort of proof that the project you’ll be working on is achievable. Nobody wants to invest resources in something that won’t work in the end, of course. So next on the list is defining what you think you’ll get out of the project and how well the whole thing will actually work. And, at the last part of Phase One, drawing out a neat little timeline of all activities will only help you when you get deeper into the project.
Phase Two: Retrieving Data
Retrieving the data is an important part of Data Science because… Well, what would you do without data to work on? All jokes aside, this Phase has two parts, and it involves where your data comes from.
Internal data is readily available for you to analyze within your company. It can come from all sorts of places from the company, but the most common one is any kind of database or data lake that the company owns.
On the opposite side, there’s External data, which isn’t yours or the company’s property. It means you need to seek it outside, from other companies that collect it and sell it, and it can be free or paid depending on the company and the data. The most common example is how social media platforms operate. They offer data in exchange for others to enrich their services (such as how Instagram provides user data in exchange for targeting ads).
Phase Three: Data Preparation
Data Preparation can be divided into three sub-processes–cleaning, transforming, and combining.
Cleaning the data means removing any values in your dataset that hinder the project’s results and flow. For example, it can mean removing user errors, computer errors, physically impossible values, missing values, outliers, unneeded spaces and typos, and any other kind of error that goes against the rulebook.
Data transformation is where you aggregate and extrapolate the data, derive measures, create dummies, and reduce the number of variables.
Then, at the last moment of Phase Three comes combining the data, where you’d merge datasets, set operators, and create specific views.
Phase Four: Data Exploration
Data Exploration is a mid-point in the Process, where you sit down and take a breath from the previous work and assess everything that’s been done up until this point. It’s where you’d visualize your transformed data with the help of bars, charts, and graphs to take a deeper look at every aspect of your set.
Phase Five: Data Modeling
Building a model is an iterative process–much like the overall Data Science Process. It means choosing what kind of model you’re going to approach your data with–classic statistical models or machine learning models, and the type of techniques you’ll be using.
Data Modeling includes selecting the modeling techniques and variables for the model, executing the model, and then diagnosing and comparing different models. Choosing a model can come with a variety of reasons, and this is where that planning in Phase One comes crucially in handy because, without it, you wouldn’t know what to implement in this part of the project. Once you’ve chosen the model you want to use, you’re ready to code! Coding is fairly straightforward (we’re not saying easy) because programming languages like Python have so many libraries and tools to help you out that you don’t even have to lift a finger for the most part!
Then, at the end of Phase Five, it’s all about comparing the results you’ve ended up with. You’ll be using different statistical methods here, like the mean square error–which checks how far your predictions were from the truth and discovering the errors within your model.
Phase Six: Presentation and Automation
And, at the far end of the project comes the presentation. It’s the part where you pat yourself on the back, create those beautiful graphs you know how to do all too well, and present this to whoever asked you to do it in the first place.
Another important part we need to mention is that repeating your models is inevitable, especially if you created a model that genuinely helps people and businesses–they’ll keep coming back for more! That’s why you’ll need to automate your model, so you don’t have to fiddle around with it every time someone comes to you with a big new chunk of data you need to analyze. And, that’s the end of the data science process.