Treating experimental data#

A prelab assignment for analytical chemistry#

Author: Ayla S. Coder ✉️ Editor: Audun Skau Hansen ✉️

The Hylleraas Centre for Quantum Molecular Sciences and The Centre for Computing in Science Education, 2022


Things you might need to know before tackling this notebook#

Before starting this part of the prelab for analytical chemistry II, the following notebooks should be completed:

The goal of this notebook is to introduce you to Gaussian Process Regression in the context of a real life example of an analytical lab. We’ll use response measurements, based on a Box-Behnken experimental setup for determining the best sample compisition to achieve the optimal response for an electrospray ionization (ESI) mass spectrometry (MS) set up. Let’s begin with a preliminary discussion just to gather our thoughts:

Preliminary discussion

Based on what you know of Gaussian Process Regression and (refer to experimental setup):

  • How does GPR differ from conventional methods like polynomial regression?

  • Do you expect to encounter an optimal response? Explain your answer.

  • What are the experimental constraints for this setup?

{hint} a constraint is what the physical limits of our experiment can be. This can be amounts of a chemical, budget limits, time limits etc. What levels of the dimensions (variables) are actually feasable?

"""
Answer: 

"""

Installation#

We’ll use the btjenesten module for this session. You can install it using:

!pip install btjenesten evince bubblebox

Then you import the following modules:

import btjenesten as bt
import numpy as np
import matplotlib.pyplot as plt
import bubblebox as bb

%matplotlib notebook

Data preparation#

When dealing with large datasets it is convenient to work on sets of data directly, rather than the individual data points. For this reason, we’ll make extensive use of the numpy array data type and it’s associated methods. For more details on array handling, see the official Numpy documentation.

Data from the last Box-Behnken analysis is loaded below. Pay attention to the shape:

"""
All the data
"""
MgSO4 = np.array([30,   60,   90,   60,   30,   90,   30,   90,   60,   30,   90,   60,   60,   60,   60])
ACN   = np.array([0.75, 1.00, 1.00, 1.25, 0.75, 0.75, 1.25, 1.25, 0.75, 1.00, 1.00, 1.25, 1.00, 1.00, 1.00])
HCL   = np.array([6,    6,    6,    6,    12,   12,   12,   12,   18,   18,   18,   18,   12,   12,   12])
all_x = np.vstack([MgSO4, ACN, HCL]).T # The T transposes the stacked data into the correct shape.
# Try running this cell without the T to see what that means.

all_y = np.array([65626517, 48406988, 42143802, 26975625, 70941284, 66452677, \
26688204, 18157839, 73184566, 54919876, 9915204, 13288195, 17918550, 13380019, 19043787])


# The arrays containing all the data is printed below
print(all_x)
print("The shape of the all_x array is:", all_x.shape)
print("---")
print(all_y)
print ("The shape of the all_y array is: ", all_y.shape)

It may also be instructive to visualize how the data distributes along each of the three axis’ in the measurement domain:

bt.analysis.show_3d_sample_box(all_x).view() # Show the measurements in a threedimensional box

As can be seen in the all_x array above, the last three measurements were performed on the exact same x-values. This is problematic for the btjenesten.gpr module, as the regressor cannot handle duplicates.

In order to remove the redundancy, use the btjenesten.analysis.remove_redundancy function:

# remove redundance
all_x_, all_y_ = bt.analysis.remove_redundancy(all_x, all_y) # note that these all_x_ and all_y_ arrays are different than the first ones. 

print("The length of the x-array before removing the duplicates: ", len(all_x))
print("The length of the x-array after removing the duplicates : ", len(all_x_))

This function computes the average of the responses from the duplicate measurements. This ensures that all responses are taken equally into account.

Discussion 2

What kind of statistical analysis can be performed in order to remove apparently erroneous measurements such as outliers?

Setting up the regressor#

Next, we’ll set up the Gaussian Process regressor by initializing it with the measured data. For more detailed information on how to do this, check out this tutorial.

We will, however, not start out giving the full data set for the regressor. In order to choose the regressor parameter in a reasonable range, we’ll instead give it a few (5) dispersed measurements, and chose our regressor parameters in such a way that the regressor gives reasonable predictions in the remaining points.

A function btjenesten.analysis.choose_n_most_distant can be used to quickly pick these measurements, or you could pick them manually if you have reason to do so.

n = bt.analysis.choose_n_most_distant(all_x_, 5) #chose 5 training data points positioned far apart

# Here's an n that you can easily modify yourself: 
# n = np.array([index, index, index, index, index]) # Remember to not choose more than 13 indices. 

print("The indices of the picked out training data = ", n)

Discussion 3

Run the whole notebook first, then return to this Discussion.

What happens when random training points are chosen? Try running the code block below to test this out. Run it several times to see how the regressor changes depending on the training data chosen. Based on what you see, discuss what initial measurements should be taken when setting up the actual analysis for lab 4.

Hint

! Look at the description in the paragraph Setting up the Regressor. Which training data did we pick out for the regressor here?

#n = np.random.choice(len(all_y_), 5, replace = False)
#print(" Random n; ", n)

The btjenesten.gpr.regressor class has one parameter per dimension. These parameters determine the decay of the correlation in each dimension. Simply put, the higher the parameter for a given dinemsion, the more rapid a (the regression) function is assumed to change in that dimension. A parameter set to zero signifies a perfect correlation in that direction,

/This means that changing the variable associated with that dimension does not make any impact on the value./

Picking the right parameters is complicated, so let’s just try out a set and discuss the results:

regressor = bt.gpr.Regressor(all_x_[n], all_y_[n]) # Sets up the regressor

regressor.params = np.array([.001, .001, .001]) # setting the parameters for each dimension to 0.001 

#bt.analysis.regressor_summary(all_x_, all_y_, regressor, n, error_margin = 2e7) # Creates the plot below

table = bt.analysis.regressor_summary(all_x_, all_y_, regressor, n, error_margin = 2e7) # Sets up a table summarizing the compositions of the samples in the plot  
table

Discussion 3

  • What does the figure above show? Explain it to eachother.

  • Is the order of the samples of any real significance (ie, does rearranging them change the information conveyed in the graph)?

  • Does it reveal any information on the optimum of the function with regards to the measured points?

  • How do you expect it to change if you vary the parameters? (Feel free to try it out below).

Ideally, we would like for the parameters to be chosen so that the difference between the predicted values and the real values to be minimized. You can play a bit around by adjusting the parameters with the sliders to get a feel for it:

%matplotlib notebook
bt.analysis.parameter_tuner_3d(all_x_, all_y_, n) #, training_subset = n)

The btjenesten module additionally allow for some automated optimization of these parameters, but please pay careful attention to it as it is prone to numerical instabilities.

params_optimized = bt.analysis.parameter_optimization(all_x_, all_y_, training_subset = n) 

regressor.params = params_optimized
print("Optimized parameters:", params_optimized)

When you are satisfied with the regressor parameters, you can finally set up a regressor using the full data set. Insert your choice of parameters below:

regressor = bt.gpr.Regressor(all_x_, all_y_)

#params = bt.analysis.parameter_optimization(all_x_, all_y_, training_fraction = .99)
p1 = 10**-1
p2 = 10**-1
p3 = 10**-1
regressor.params = np.array([p1,p2,p3])

The btjenesten.gpr module should reproduce the training data flawlessly:

# Uses all the x data to predict the corresponding y-values with the regressor.
y_pred = regressor.predict(all_x_) # This will be compared to the measured values.


print("Predictions       | True values ")
print(np.vstack([y_pred, all_y_]).T)

Discussion 3

  • Make an enumerated list of the steps performed in this sections. What is the purpose of each step?

  • How do changing the training data alter the predicted response and choice of parameters? (Try it out)

  • How would you go about systematically improving the prediction?

"""
Answer: 

"""

Visualizing the response surface#

An important feature of regressors is their ability to predict values outside the training data set. Generally speaking, we can distinguish between interpolating from extrapolating predictions. Common to all regressors is the fact that they are much better at making predictions close to the training data, in effect interpolating between the measured values. As we move further apart, we leave the domain of interpolation and instead enter the domain of extrapolation.

The btjenesten.gpr module seeks predicted optimas in both these domains, and can be controlled to stay close to our training set or be more hungry for exploration. At any rate, a proposed optimum may be sought by the btjenesten.gpr.acquisition-method as follows:

new_x = regressor.aquisition(method = "COBYLA", minimize_prediction=False, l = 0) # acquiring the x-coordinates for the regressor's maxima.

print("The dimensions for the next suggested measurement: ", new_x[0])
print("The new measurement's predicted output (the maxima of the regressor): ", regressor.predict(new_x)[0])

Warning

Depending on the training data picked out, the new proposed measurement might contain negative values. Is this possible for any of the dimensions in this analysis? Or does it violate any constraints for the experiment? Is there any situation (theoretically) where a dimension can be negative?

"""
Answer:
"""

In order to fix the issue of negative values occuring for our dimension’s proposed maxima, the following line of code may be added:

# constrain proposed values to larger than 0
new_x_ = np.max(np.stack([new_x[0],np.zeros(3, dtype = float)]), axis = 0)

print("The realistic dimensions for the next suggested measurement: ", new_x_)

Finally, let’s inspect our results visually by calling the btjenesten.analysis.one_dimensional_plots-function:

params_optimized = bt.analysis.parameter_optimization(all_x_, all_y_, training_subset = n) 
#params_optimized[1] *= 100
params_optimized[2] *= 1e+6
regressor.params = params_optimized
print("Optimized parameters:", params_optimized)
plots = bt.analysis.one_dimensional_plots(all_x_, new_x_, regressor, ["Response", "", ""],  ("MgSO4 content", "Acetonitril Ratio", "HCl volume", "Real Response", "Predicted Response")) # setting up the plots with our data.
plots # Calling the plotting function. 

Discussion 4

  • What does the above plots show us?

  • How do the figures change when the training data in part 3 changes? What does this show us?

  • How reliable are these figures when it comes to determining what the optimal dimensions are?

  • How do these two figures compare to each other?

"""
Answer
"""

Visualizing interdependencies#

The btjenesten.gpr prediction is always single valued, meaning that it takes in a \(N\)-dimensional vector and returns a scalar number. Such multivariate functions are difficult to visualize, since they are essentially densities in \(N\)-dimensional space. We can, however, allow only some few variables to vary, and leave the remaining fixed, and in this way display cross-sections of densities in lower dimensions.

Below, we extract two-dimensional cross-sections (surfaces) from the regressor-predictions, in order to get more information on the area around the measurements.

S = ("MgSO4 content", "Acetonitril Ratio", "HCl volume")

for i in range(3):
    for j in range(i+1, 3):

        x,y = bt.analysis.data_projection(regressor, axes = [i,j], center = new_x_)#np.array([new_measurements[-1]]))
        
        plt.figure(figsize =(6,5))
        plt.title(f"{S[j]} plotted against {S[i]}, where i= {i} & j= {j}")
        plt.xlabel(S[i])
        plt.ylabel(S[j])
        plt.contourf(x[0], x[1], y)
        plt.colorbar()
        plt.show()

Discussion 5

  • How to we determine where to start the measurements?

  • Why can’t the computer tell us where to start out?

  • What happens when a non optimal starting point is chosen?

  • What determines the minimal amount of starting measurements? Think: what would the regressor look like if one starting measurement was used vs two vs three?

"""
Answers: 

"""

Final preparations for the lab#

Okay, now that we have completed this module, propose where our initial measurements should be in the lab 4.

Discussion 6

  • Which levels should each dimension have in your opinion?

  • Can this information from last year’s box behnken excercise be used as a starting point (training data) for the new regressor? Why/why not? Hint: Remember why taking measurements over several days is often not recommended.

  • What is the desired amount of measurements that should be necessary to find the maxima?

  • Extra: How would you set up this experiment? Think about time constraints, how errors can propagate (how to dix that), load on the user and how to store your analyte solutions.