# Calculation of probabilities: How Machine Learning rules ad-tech

Never tell me the odds

– Han Solo, Empire Strikes Back

As a search engine marketing manager you’re working with Google’s ad-algorithm every day. But do you actually know how it is working? It’s worth to take a look behind the machine learning curtain of ad tech. Google, Bing, Facebook & Co. make billions of dollars  with so-called sponsored ad placements. This means advertising on the respective platform. The billing per click is based on the auction model ‘Generalized Second Prize’ (GSP). This model determines the order in which the ads are displayed based on bid amount and click probability. #### Conditions, Conditions, Conditions…

It is therefore elementary for this business model to reliably predict the probability that a certain user will click on a certain ad. This is where the conditional probability theory comes into play. The key question here is: How likely is it that a certain event will actually occur under certain given conditions? Or expressed as a formula: In this case P stands for Probability, A and B are our variables that are mutually dependent.

The hypothesis of this statistical approach, which can be traced back to the English theologian Thomas Bayes (1701 – 1761), assumes that the probability of an event can be predicted if we know that

a) it has already occurred more frequently under the same conditions and

b) has also frequently not occurred under the same conditions.

This requires a lot of positive and negative data under the same conditions. The Bayesian problem in Search Engine Marketing, however, is that every user is largely unique. And even this unique user does not always interact with the Internet under the same conditions. He can change the device. The location. The day of the week for his interactions varies, and so on.

Huge amounts of data are required to reliably and accurately predict click probability. The remarkable thing is: Google has these amounts of data. Through many billions of searches (approximately 70,000 per second) and feed interactions per day.

What still remains is the problem of efficient data processing. How do you manage to calculate these complex, non-linear hypotheses? This is guaranteed by an algorithm architecture that has existed since the middle of the 20th century, but only experienced its breakthrough in the 2010s.

#### Neural networks to calculate complex functions – This is what happens when you search for something on Google

Inspired by the structure of the human brain, these networks consisting of so-called “artificial neurons” have different types of layers. Between an input layer containing the existing user information (such as demographic data, affinities and the search term) and an output layer, which ultimately outputs the click probability, there are so-called hidden layers in which the actual calculations are performed. Source: Andrew Ng, Coursera.org

On the connecting lines there are parameter values, called weightings, which are necessary for these calculations, which are performed in several steps:

#### 1. Forward-Propagation

As soon as the system receives a new input, i.e. a new input from a searching user, the signal is fired through the network. First from right to left.

Via a so-called “activation function” in the second layer, a binary value (0 or 1) is calculated. Depending on the result, this is then transported further via a logical gate. A process that can also be found in every modern computer today, where transistors pass on binary voltage values through logical functions (AND, NAND, OR, NOR, XNOR, XOR, NOT).

An activation function that is often used for problems of this kind is Regularized Logistic Regression. This classification algorithm describes the dependency on binary variables and in this way predicts the affiliation to a certain class, taking into account the statistical frequency. Thus, neural networks are basically large, complicated algorithmic structures that are composed of many simpler algorithms.

Each neuron makes its own calculations. This is usually done in sequences one after another.

The signal is forwarded until the output layer is reached and an output value, the click probability, is available. The network has initially found weightings and developed a hypothesis.

But that’s just the beginning of the fun. If you remember the four elementary steps in the machine learning process already mentioned, you know that the hypothesis has errors or distances to the optimal value, which must be minimized in the learning process.

#### 2. Backpropagation & Stochastic Gradient Descent

Backpropagation is basically a back calculation of the calculated errors. The individual layers, this time starting with the output layer, pass on their results to the next layer. The goal of this step is to calculate the rate of change of the errors in relation to the changes in the weightings. In this way, the learning algorithm is made easier and the calculations more efficient.

In Deep Neural Networks with many layers and millions of data inputs to be processed, the role of the learning or optimization algorithm is given to ‘Stochastic Gradient Descent’. In contrast to the classical GD, SGD is able to scan through the data faster and more efficiently in order to optimize the weights towards the optimal value. This is what happens initially in the first days after the launch of a campaign, when the campaign status in the interface is “Learning”. However, the campaign manager cannot look behind this construction site sign and thus take a look into the “engine room”. 