How to clean your SEM-Data for Machine Learning

Published by Patrick Mebus on

Highly underrated but hyper relevant is the second step in the Machine Learning process, called ‚Data Cleaning’. Let’s stick with the goldrush analogy from chapter number one. Even if you search in the right location, what you’ll hold in your hands first, is a lot of mud you’ve to clean first, to find the valuable stuff you want.

Same counts for Machine Learning. Even if your data collection works smoothly, the csv file you pull from Google or Bing might be useless due to unrelevant columns, formatting errors or statistical errors. Especially Google Analytics is notorious famous for providing damaged columns in csv-files.

Possible reasons why you might have to clean your SEM-dataset:
1. Missing Data

Imagine you’re facing tracking issues and now you have to deal with gaps in your performance data. What now? Quo vadis Search Marketer? 
First of all: never ignore missing data and values. Otherwise you’ll be in trouble later on and most likely your algorithm won’t work properly. The very last we want to do, is to build our model based on incomplete or wrong data. Sounds reasonable, right?

How to handle gaps and missing values in your dataset?

How to solve: The best way to avoid missing data is to ensure a smooth tracking. Nevertheless, when things have gone wrong already it’s highly recommended not just to drop out the particular columns. Better: Tell your algorithm that data is missing by labeling and creating new data-classes.

When it comes to missing numeric data you should further fill the  affected cells with a zero-value to make it readable for the algorithm and meet the technical requirements.

Fortunately libraries like Pandas come along with several functions to check and replace missing values (numerical and non-numerical) in your dataset, like isnull(), fillna() or replace().


2. Outliers

Outliers are datapoints that are distant to other values in your dataset. Anomalies in your usually common performance numbers.

Huge peaks and drops in SEM performance-data always require some deeper analysis before you can decide to drop out certain values, which is not always recommended. Peaks in Search Engine Advertising are quite common. They can be caused by weekend-searches, seasonality and even good or bad weather.

How to solve: Remove or fix? That’s the crucial question. The following steps are mostly common and recommended in terms of outlier-handling.

A. Figure out the outlier-reasons. Is it just about seasonality-reasons? Is it just a formatting error? Or do you have a serious issue in your data-measurement and tracking?

Outlier due to weekend-traffic or flawed tracking?

B. If it’s acutually crappy data caused by tracking issues, remove it and drop out the particular values

C. If you have to deal with issue-outliners regularly, think about working with algorithms that’re less sensitiv for outliers. Random Forst algorithms are a great choice for a more robust modeling

D. If your data is correct and outliers caused by seasonality or increasing bids, you should proceed by choosing the data within a standard deviation and ignore all data that is outside of this range. In Scikit-Learn ( which is the Machine Learning framework we use for all examples on this page) you can rely on the ‘preprocessing-data-package’ for standardization. This package provides a bunch of functions to prepare raw data before pushing it into a model. Roboust-scale-methods are awesome tools and functions in terms of outliers.

3. Irrelevant Columns

Depending on your subject and goal there might be a couple of columns that don’t contribute to the objective. Columns like ‚Keyword Status’ or ‚Currency’ are two typical examples.

How to solve: Key-takeaway here: Focus on the most relevant columns that support your goal to make data-handling easier and computational processes faster later on. 

Consider wich columns are actually relevant for your model

4. Wrong format in Cells

Who has not yet been annoyed by csv-formatting issues rooting in keyword-operators, points instead of commas or the other way around, and my favorite: The ‚date-instead of number issue’.

How to solve: A lifehack that helps in most of the times: Push your data from Google Ads dashboard to a spreadsheet first. Then download data as csv-file from here to have a clean file available.

exited about this post? feel free to share

Patrick Mebus

I’m a Digital Marketer with deep passion for Search Engines, Automation and AI. I’m here to make Machine Learning more feasible for Search Engine Marketers.

1 Comment

How to set up a Machine Learning environment and upload SEM-Data - SEM SMARTATION · October 27, 2019 at 6:20 pm

[…] we’ve collected and cleaned our data, the next step is to build a Machine Learning environment for our purpose and to push the […]

Leave a Reply

Your email address will not be published. Required fields are marked *