Predicting ion concentrations in soil using a cheap method. Part 1: The AfSIS database

For growers it is generally very important to know the composition of their soil. A soil that lacks certain nutrients will commonly need some form of amendment while a soil that contains very high concentrations of certain elements – like aluminium for example – might not be suitable for growing certain crops. However measuring the concentration of metal ions in soil is not trivial – since direct lab measurements of ion concentrations are expensive – so a cheap way to do this using a “field friendly” method would be invaluable. Can we create accurate models to predict metal concentrations based on a cheap measurement? We’ll see!

The output of an LGBM regression algorithm across 5-fold cross-validation for testing sets for 12 different elemental concentrations (log(M+1)).

This is where the AfSIS database comes into play. This set of data – created by the Africa Soil Information Service project, contains the output of a cheap spectroscopic technique with the actual lab analysis results of hundreds of different samples, distributed along the African continent. With this spectral and analysis data we can then construct statistical models to attempt to predict actual ion concentrations from these simple and cheap IR spectroscopy measurements. Better yet, the AfSIS database is freely available and hosted by Amazon AWS. The database files even contain some simple examples of how to carry out basic machine learning and statistical analysis using the project files.

Several years ago, a kaggle competition was actually done for this and the winner – not surprisingly – used a complex ensemble of models using support vector machines and neural networks with different levels of complexity (view the winning solution here). The scores of the final solution were also not very high and can now be surpassed with new boosting algorithms developed in recent years (like LGBM) plus the use of additional data that has been collected by the project since that time. The kaggle challenge also addressed 5 measurements Ca, P, pH, SOC and Sand, while I find it way more interesting – from a practical standpoint – to look at all 12 elemental concentrations that have been measured.

Average square of the Pearson correlation of the real vs predicted log(1+M) plots for test sets in 5-fold cross-validation

Using a simple boosting regression approach (LGBM) with no optimization, using the gradient of the spectral information and the Log(M+1) of the ion concentration as the target, I was able to obtain the results shown in this post. The first image shows the results of 5 testing sets obtained with random shuffling of the data for each ion while the second image shows the mean square Pearson correlation coefficient of these graphs. We can see that pretty good predictions can be achieved for Ca, Mg, Al and B, while the problem becomes way harder for elements like P and Zn.

Notice that at this point I have carried no extensive effort to improve the model, since at this point I’m just interested in understanding the problem. What’s hard to predict, what’s easier to predict, so that I can better understand where to focus. I want to know where the spectral measurements made hold the most value and where they are weak.

The questions we need to ask now are, what makes some elements easier to predict than others? Why are Zn and P so hard to predict from this data? Can we use the relatively accurate predictions we have for Mg and Ca to enhance our predictions for other ions? Is there a more intelligent way to preprocess this data to extract information? Stay tuned for part two in this series, where I will try to answer some of these questions.

1 comment

Leave a Reply

Your email address will not be published.