Predicting ion concentrations in soil using a cheap method. Part 2: Why are P and Zn models so terrible?

In the first post in my series about predicting ion concentrations using a cheap method, we looked at the AfSIS database and how it can be used to create models for the prediction of several ion concentrations in soil using cheap-to-measure spectroscopic data. However it was clear within these results that not all ions were equally easy to predict, with Zn and P giving us the worst prediction results overall. On today’s post I want to look at why this is the case and what we may be able to do to actually improve our results.

Distribution plots of concentration for the different ions across all the experiments
Spectra for max (blue) and min (orange) concentrations of the different ions in the AfSIS database

To understand why you’re bad at predicting a variable with a set of data, it is first important to understand how the data you have relates to your target variable and potentially others you might be better at predicting. Since we are trying to relate spectra with concentrations, it makes sense to look at the spectra we may expect at the extremes of the distribution regions. Since spectroscopic phenomena are in general expected to be at least directly proportional to concentration, this gives us an idea of our potential signal-to-noise ratio. Besides this we can also look at how the different concentrations of the ions relate, since ions that are more concentrated could be rationally expected to be easier to predict as their signal-to-noise ratio should also be higher (provided the response for all ions is fairly similar).

The first plot in this post shows you the distribution of concentrations for the different ions, from here we can already see that predicting Al, Mg, Ca and Mn should be easier, given that we have higher absolute concentrations and a better distribution of examples from the minimum to the highest concentrations in the database. The cases for P and Zn don’t look too strong as most of the samples have very low concentrations relative to the above elements.

The second plot above shows you a min/max analysis, with the spectra corresponding to the highest and lowest concentrations in the database for each ion. Here you can see that the variables we’re good at predicting – such as B and Ca – have significantly large variations from minimum to maximum concentrations. We can also see that in the case of Zn and P the variations are poor, where in the case of Zn the spectra of the minimum and maximum Zn concentrations are almost identical, which means our chances of predicting Zn from these spectra are way worse when compared to any of the other ions.

Square of the Pearson correlation (R^2) of the ions vs each wavelength of the spectra provided across the entire database.

Another useful analysis is to look at how the concentration of our ions relates linearly to each wavelength across the entire database. We can do this by calculating the R^2 of all the concentration Vs individual wavelength plots, which finally leads to the graph above. In this case we can see that ions that are easier to predict have relatively linear relationships with some wavelengths, while the ions that are hard to predict basically never go above 0.15, meaning that there are no simple linear relationships within the database between any wavelength and their concentrations. However there are ions that are easier to predict than P or Zn – like Fe for example – whose maximum linear correlation also stays in that region. This means that we predict Fe mainly due to the presence of non-linear relationships within the data, however these relationships seem to be far weaker for P and Zn.

A plot like the above also hints at how we might be able to simplify models for the different ions. For example if we create a model to predict Mg concentration that just uses wavelengths between 4500 and 6000 we can basically obtain the same level of accuracy as a model using the entire spectrum, just because within this frequency we get the most information. An LGBM model created using data for this region for Mg achieves an R^2 of 0.68, just as high as the model shown in the previous post.

With the above information it is clear why P and Zn are hard to predict. Low signal-to-noise ratio, no strong linear relationships with any wavelength and weak non-linear relationships overall compared with other ions, like Fe. Is there any hope then? I want to try a few other things, so stay tuned for the next post in this series!

Leave a Reply

Your email address will not be published.