Computer scienceData scienceMachine learningIntroduction to machine learning

Typical ML pipeline

8 minutes read

You might wonder what machine learning specialists do as part of their job. Of course, different projects mean different tasks, but there are some very common steps. In this topic, we will try to highlight them so that you could have a better idea of typical machine learning tasks. This will also help you understand what it takes to become a data scientist.

Data collection

Machine learning is impossible without data. When developing a new machine learning algorithm, experts often use publicly available datasets to benchmark their methods and compare them to the ones that have already been developed. You can find them on the famous UCI repository, as well as on Kaggle, the largest ML competition platform.

If you are working on a specific problem, let's say, in consultancy, your client (for instance, some company) can already provide you with some data they have, which is relevant to the problem they want you to solve. The data can come, for instance, as Excel spreadsheets. Alternatively, you may be given access to a database from where you can load all the necessary data using SQL queries.

However, for some tasks, you might need to collect the data yourself. This can be the case, for example, if you are working on a problem or a particular application no one has worked on before. Data collection can include web scraping, which is automatically extracting and parsing the content of certain web pages.

Keep in mind, in most cases you'll have to handle unlabelled data. So, if you are going to use data for supervised learning, you'll need to label the data manually first.

Data preprocessing

Whether you use available data or collect it yourself, the datasets you end up with can be very messy. Sometimes, there can be a lot of missing values that you might need to fill in somehow. Some values can be simply wrong (imagine someone made a typo when filling in a spreadsheet and inserted 100 instead of 10.0). It is also quite common for the data to come from different sources. In this case, it is likely that the formats are different (for instance, different measurement units, date formats, currencies, and so on, used in different files).

So typically, the first step in any machine learning project is data preprocessing. It includes joining data from different sources, dealing with missing values, and so on. The exact sequence of operations depends on the data type. For example, when working with text data, you may face the need to use additional tokenization and lemmatization methods to prepare your data for processing.

Exploratory data analysis

Once the data is ready to use, it would be a good idea to take a closer look at it before starting the actual modeling part. This step is generally called exploratory data analysis (EDA). It usually involves making some plots and calculating some basic statistics on your data.

Data visualization is usually included in the EDA stage and is particularly important for tackling the following aspects:

Making hypotheses about a suitable algorithm. Suppose you have a dataset about the customers of some store, and you would like to cluster them into groups. There is a clustering algorithm called K-means that assumes clusters are spherical and equally sized. K-means works well when the features within clusters are independent, identically distributed, and follow a Gaussian distribution. In turn, you should look at the features in your dataset and decide whether they follow the outlined assumptions. In case they don't, you have two choices: select another algorithm that works well for your specific distribution (or might not make feature assumptions at all), or preprocess your data so that it meets the algorithm's constraints.
Identifying patterns in the data. "Patterns" might refer to feature correlation (whether two features are dependent), detecting anomalies (e.g., the majority of bank transactions are not fraudulent, but there are a few that are), looking at trends in time series (such as chocolate sales skyrocketing around holidays), etc.
Detecting outliers or mistakes in the data. To give an example, there might be a scenario where the majority of data points lie in a very specific range, but a few of them have values that are suspiciously high or low (e.g., an apartment might typically have 2 to 5 bedrooms, but you notice a few with 30 bedrooms). These outliers might skew the algorithm (reconsidering the apartment example, some models will assign higher importance to the samples with a higher bedroom count simply because the value is higher, and not because these samples are more representative of the data). At that point, you have to decide what to do with those samples or features (you might decide to drop these samples or rescale the features so that their magnitude does not mislead the algorithm, etc).

EDA is a crucial step, as it helps you get to know your data better and identify possible problems with it that might have gone unnoticed at the preprocessing step. Besides, at this step, you gain more insights about the data and the events you will be trying to model. This is the time to test some assumptions about the data that you might have and to get some ideas on which approaches could be the best to tackle the problem.

Model selection

Now that you know your data well, you can finally start the modeling part! This is typically an iterative process — you start by training an ML model that you believe will do well on the task you are trying to solve. Then, you evaluate the model's performance and carefully investigate it, whether the model performs as expected, whether it has any pattern in the mistakes it makes, and so on. If so, it is also a good idea to devise a method to fix it. This stage also typically includes visualizations of the results (such as building the confusion matrix or plotting the ROC curve).

After that, you may want to make the necessary adjustments, train a new model, analyze its performance, and repeat, until you get the perfect fine-tuned model. Your models will require different constraints, weights, and learning rates to generalize different data patterns. So, the next stage in the Model Selection step is to find the perfect parameters for your model to reach the optimal solution to the data problem.

Deploying your model

Even if you build the greatest ML model in the world, it is of little use if no one else can use it, or if the results you are getting cannot be reproduced. To put it simply, at this final stage, you need to ensure that your code can run on any machine, that your implementation is robust (that is, it does not produce unexpected errors), efficient, and scales well.

The process of making your models available in production environments is called deployment. For example, the company you are working for may be interested in integrating your ML solution into the software they are already using, so you will need to deploy your model so that it can provide predictions to other software systems.

Typically, machine learning engineers implement the built model into production, but in a smaller company, you may be responsible for both developing and deploying ML solutions.

Do I have to know all of this?

Above, we have described the most typical steps in a data scientist's job. It seems like you need to know a lot of things, right?

In some companies (typically, smaller ones), you might be expected to perform all of these tasks by yourself. In others, the roles can be spread across different people; data engineers responsible for data preprocessing, some do the modeling part, the others implement the solution efficiently and deploy it.

You might have noticed that not all of the steps mentioned above are immediately related to machine learning itself but rather to data or software engineering. This is true. Also, ironically, many data scientists report that most of their time is spent exactly on such engineering tasks rather than on pure machine learning (see, for instance, this 2020 Datanami survey).

Conclusion

Machine learning experts typically deal with very diverse tasks at their job.
Every ML project is different, but the most common steps are data loading and preprocessing, exploratory data analysis, modeling, and deployment.
Depending on your job, you can be expected to perform all of these tasks, or the task can be divided among the team.

162 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo