Explaining Machine Learning Classifiers with LIME

Machine learning algorithms can produce impressive results in classification, prediction, anomaly detection, and many other hard problems. Understanding what the results are based on is often complicated, since many algorithms are black boxes with little visibility into their inner working. Explainable AI is a term referring to techniques for providing human-understandable explanations of the ML algorithm outputs.

Explainable AI is interesting for many reasons, including being able to reason about the algorithms used, the data we have to train them, and to understand better how to test the system using such algorithms.

LIME, or Local Interpretable Model-Agnostic Explanations is one technique that seems to have gotten attention lately in this area. The idea of LIME is to give it a single datapoint, and the ML algorithm to use, and it will try to build understandable explanation for the output of the ML algorithm for that specific datapoint. Such as "because this person was found to be sneezing and coughing (datapoint features), there is a high probability they have a flu (ML output)".

There are plenty of introductory articles around for LIME, but I felt I needed something more concrete. So I tried it out on a few classifiers and datasets / datapoints to see.

For the impatient, I can summarize LIME seem interesting and going in the right direction, but I still found the details confusing to interpret. It didn’t really make me very confident on the explanations. There seems to be still ways to go for easy to understand, and high confidence explanations.

Experiment Setups

Overview

There are three sections to my experiments in the following. First, I try explaining output from three different ML algorithms specifically designed for tabular data. Seconds, I try explaining the output of a generic neural network architecture. Third, I try a regression problem as opposed to the first two, which examine a classification problem. Each of the three sections uses LIME to explain a few datapoints, each from different datasets for variety.

Inverted Values

As a little experiment, I took a single feature that was ranked as having a high contribution to the explanation for a datapoint by LIME, for each ML algorithm in my experiments, and inverted their value. I then re-ran the ML algorithm and LIME on this same datapoint, with the single value changed, and compared the explanation.

The inverted feature was in each case a binary categorical feature, making the inversion process obvious (e.g, change gender from male to female or the other way around). The point with this was just to see if changing the value of a feature that LIME weights highly results in large changes in the ML algorithm outputs and associted LIME weights themselves.

Datasets and Features

The datasets used in different sections:

  • Titanic: What features contribute to a specific person classified as survivor or not?
  • Heart disease UCI: What features contribute to a specific person being classified at risk of heart disease?
  • Ames housing dataset: What features contribute positively to predicted house price, and what negatively?

Algorithms applied:

  • Titanic: classifiers from LGBM, CatBoost, XGBoost
  • Heart disease UCI: Keras multi-layer perceptron NN architecture
  • Ames housing dataset: regressor from XGBoost

Tree Boosting Classifiers

Some of the most popular classifiers I see with tabular data are gradient boosted decision tree based ones; LGBM, Catboost, and XGBoost. Many others exist that I also use at times, such as Naive Bayes, Random Forest, and Logistic Regression. Hoever, LGBM, Catboost, and XGBoost are ones I often try first these days for tabular data. So I try using LIME to explain a few datapoints for each of these ML algorithms in this section. I expect a similar evaluation for other ML algorithms should follow a quite similar process.

For this section, I use the Titanic dataset. The goal with this dataset is to predict who would survive the shipwreck and who would not. Its features:

  1. survival: 0 = No, 1 = Yes
  2. pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  3. sex: Sex
  4. Age: Age in years
  5. sibsp: number of siblings / spouses aboard the Titanic
  6. parch: number of parents / children aboard the Titanic
  7. ticket: Ticket number
  8. fare: Passenger fare
  9. cabin: Cabin number
  10. embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

The actual notebook code is available my Github as well as in a Kaggle notebook.

Each of the three boosting models (LGBM, Catboost, XGBoost) provides access to their internal statistics as a form of feature weights. For details, check some articles and documentation. These types of model feature weights provide a more holistic view of the model workings, over all datapoints as opposed to the single datapoint that LIME tries to explain. So in the following, I will show these feature weights for comparison where available.

However, there is also some very good criticism on using these types of classifier internal statistics for feature importances, noting it might also be meaningful to compare with other techniques such as permutation importance and drop-column importance. As such, I calculate also permutation importance for each of the three boosters here, as well as later for the Keras NN classifier.

LGBM

Feature Weights from Classifier / Permutations

The following figure illustrates the weights given by the model itself, when I trained it on the Titanic dataset, via the classifier feature_importances_ attribute.

LGBM Feature Weights

And the ones illustrated by the following figure are the ones given by the SKLearn’s permutation importance function for the same classifier.

LGBM Permutation Weights

Comparing the two above, the model statistics based weights, and the permutation based weights, there is quite a difference in what they rank higher. Something interesting to keep in mind for LIME next.

Datapoint 1

The following figure illustrates the LIME explanations (figures are from LIME itself) for the first item in the test set for Titanic data:

LIME LGBM 1

The figure shows to versions of the same datapoint. The one on the left is the original data from the dataset. The one on the right has the sex attribute changed to the opposite gender. This is the invertion of the highly ranked LIME feature I mentioned before.

Now, compare these LIME visualizations/explanations for these two datapoint variants, to the global feature importances above (from model internal statistics and permutation score). The top features presented by LIME closely match those given by the global permutation importance as top features. In fact, it is almost an exact match.

Beyond that, the left side of the figure illustrates one of my main confusions about LIME in general. The prediction of the classifier for this datapoint is:

  • Not survived: 71% probability
  • Survived: 29% probability

I would expect the LIME feature weights to show highest contributions then for the not survived classification. But it shows much higher weights for survived. By far "Sex=male" seems to be the heaviest weight for any variable given by LIME here, and it is shown as pointing towards survived. Similarly, the overall LIME feature weights in the left hand figure are

  • Not survived: 0.17+0.09+0.03+0.00=0.29
  • Survived: 0.31+0.15+0.07+0.03+0.02+0.01=0.59

Funny how the not survived weights sum up the exact prediction value for survived. I might think I am looking at it the wrong way, but further explanations I tried with other datapoints seem to indicate otherwise. Starting with the right part of the above figure.

The right side of the above figure, with the gender inverted, also shows the sex attribute as the highest contibutor. But now, the title has risen much higher. So perhaps it is telling that a female master has a higher change of survival? I don’t know, but certainly the predictions of the classifier changed to:

  • Not survived: 43%
  • Survived: 57%

Similarly, passenger class (Pclass) value has jumped from weighting on survival to weighting on non-survival. The sums of LIME feature weights in the inverted case do not seem too different overall, but the prediction has changed by quite a bit. It seems complicated.

Datapoint 2

LIME explanation for the second datapoint in the test set:

LIME LGBM 2

For this one, the ML prediction for the left side datapoint variant seems to indicate even more strongly that the predicted survival chance is low, but the LIME feature weights point even stronger into the opposite direction (survived).

The right side figure here illustrates a bit how silly my changes are (inverting only gender). The combination of female with mr should never happen in real data. But regardless of the sanity of some of the value combinations, I would expect the explanation to reflect the prediction equally well. After all, LIME is designed to explain a given prediction with given features, however crazy those features might be. On the right hand side figure the feature weights at least seem to match a bit better to the prediction vs on the left side, but then why is it no always matching in the same way?

An interesting point is also how the gender seems to always weight heavily towards survival in both cases here. Perhaps it is due to the combinatorics of the other feature values, but given how the LIME weights vs predictions seem to vary across datapoints, I wouldn’t be so sure.

Catboost

Feature Weights from Classifier / Permutations

Model feature weights based on model internals:

Catboost Feature Weights

Based on permutations:

Catboost Permutation Weights

Interestingly, parch shows negative contribution.

Datapoint 1

First datapoint using Catboost:

LIME Catboost 1

In this case, both the LIME weights for the left (original datapoint) and right (inverted gender) side seem to be more in line with the predictions. Which sort of shows that I cannot only blame myself for interpreting the figures wrong, since they sometimes seem to match the intuition, and other times not..

As opposed to the LGBM case/section above, in this case (for Catboost) the top LIME features actually seem to follow almost exactly the feature weights from the model internal statistics. For LGBM it was the other way around, they were not following the internal weights but rather the permutation weights. Confusing as everything else about these weights, yes.

Datapoint 2

The second datapoint using Catboost:

LIME Catboost 2

In this case, LIME is giving very high weights for variables on the side of survived, while the actual classifier is almost fully predicting non-survival. Uh oh..

XGBoost

Feature Weights from Classifier / Permutations

Model feature weights based on model internal statistics:

XGB Feature Weights

Based on permutations:

XGB Permutation Weights

Datapoint 1

First datapoint explained for XGBoost:

LIME XGB 1

In this case, the left one seems to indicate not-survived on the weights quite heavily, but the actual predictions are quite even on survived and not survived. On the right side the weights vs predictions are more in line with LIME feature weights, seeming to match prediction.

As for LIME weights vs the global predictions from model internals and permutations, in this case they seem to be mixed. Some LIME top features are shared with the top feature weights for model internals, some are shared with permutations. Compared to the previous sections, the LIME weights vs model and permutation weights seem to be all over the place. Which might be some attribute of the algorithms in case of the internal feature weights but I would expect LIME to be more consistent with regards to permutation weights, as that algorithm never changes.

Datapoint 2

Second datapoint:

LIME XGB 2

Here, the left one seems to indicate much more of survival on the weights, and non-survival in actual prediction. On the right side, the weights and predictions seem more in line again.

Explaining a Keras NN Classifier

This section uses a different dataset of the Cleveland Heart Disease risk. The inverted variable in this case is not gender but the cp variable, since it seemed to be the highest scoring categorical variable for LIME on the datapoints I looked at. It also has 4 values, not 2, but in any case, I expect changing a high scoring variable to show some impact.

Features:

  1. age: age in years
  2. sex: (1 = male; 0 = female)
  3. cp: chest pain type (4 values)
  4. trestbps: resting blood pressure in mm Hg on admission to the hospital
  5. chol: serum cholestoral in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl
  7. restecg: resting electrocardiographic results (values 0,1,2)
  8. *maximum heart rate achieved
  9. *exercise induced angina: (1 = yes; 0 = no)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Feature Weights from Permutations

Keras does not provide feature weights based on model internal statistics, being a generic neural networks framework, as opposed to specific algorithms such as the boosters above. But permutation based feature weighting is always an option:

Keras Permutation Weights

Training Curves

Training curves are always nice, so here you go:

Keras Training

Datapoint 1

First datapoint explained by LIME for Keras:

LIME Keras 1

This one is predicting almost fully no risk on both datapoints. Yet the weights seem to be indicating almost fully on the risk of heart side.

The LIME weights compared to the global permutation weights share the same top 1-2 features, with some changes after.

Datapoint 2

Second datapoint explained by LIME for Keras:

LIME Keras 2

In this case, the predictions and weights are more mixed on both sides. The right side seems to have the weights much more on the no risk side than the left side, yet the change between the two is that the prediction has shifted more towards the risk of heart side.

In this case, the features are quite different from the first datapoint, and also from the global weights given by permutation importance. Since LIME aims to explain single datapoints and not the global model, I don’t see an issue with this. However, I do see an issue in not being able to map the LIME weights to the predictions in any reasonable way. Not consistently at least.

Explaining an XGBoost Regressor

Features in the Ames Housing Dataset used in this section:

  • SalePrice – the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • Utilities: Type of utilities available
  • OverallQual: Overall material and finish quality
  • GrLivArea: Above grade (ground) living area square feet
  • ExterQual: Exterior material quality
  • Functional: Home functionality rating
  • KitchenQual: Kitchen quality
  • FireplaceQu: Fireplace quality
  • GarageCars: Size of garage in car capacity
  • YearRemodAdd: Remodel date
  • GarageArea: Size of garage in square feet

Datapoint 1

LIME XGBReg 1

As discussed here, LIME results seem more intuitive to reason about for classification than regression. For regression it should show some relative value of how the feature values contribute to the predicted regression value. In this case how the specific features values are predicted to impact the house price.

But as mentioned, the meaning of this is a bit unclear. For example, what does it mean for something to be positively weighted? Or negatively? Regards to what? This would require more investigation, but I will stick on more details for classification in this post.

Data Distribution

Just out of interest, here is a description of the data distribution for the features shown above.

XGBReg Distribution

One could perhaps make some analysis of how the feature value distributions are with regards to the LIME weights for those variables, and use those as a means to analyze the LIME results further in relation to the predicted price. Maybe someday someone will.. 🙂

Conclusions

Compared to all the global feature weights given by the model internal statistics, and the permutations, the results are often sharing some of the top features. And comparing explanations for different datapoints using the same algorithm, there appears to be some changes in which features LIME ranks highest per datapoint. Overall, this all makes sense considering what LIME is supposed to be. Explaining individual data points, where the globally important features likely often (and on average ahould) rank high, but where single points can vary.

LIME in general seems like a good way to visualize feature importances for a datapoint. I like how the features are presented as weighting in one direction vs other. The idea of trying values close to a point to come up with an explanation also seems to make sense. However, many of the results I saw in above experiments do not quite seem to make sense. The weights presented often seem to be opposed to the actual predictions.

This book chapter hosts some good discussion on the limitations of LIME, and maybe it explains some of this. The book chapters ends up says to use great care in applying LIME, and how the LIME parameters impact the results and explanations given. Which seems in line with what I see above.

Also, many of the articles I linked in the beginning simply gloss over the interpretation of the results, whether they make sense, or make seemingly strange assumptions. Such as this one, which gives me the impression that the explanation weights would change depending on what is the higher predicted probability by the classifier. For me, this does not seem to be what the visualizations show.

More useful would be maybe to understand the limitations and not expect it to be too great, even if I feel like I don’t necessarily get all the details. I expect it is either poorly explained, or I did get the details and it is just very limited. This is perhaps coming from the background of LIME itself, where the academics must sell their results as the greatest and best in every case, and put aside their limitations. This is how you get your papers accepted, and cited, leading to more grants, and better tenure positions..

I would not really use LIME. Mostly because I cannot see myself trusting the results very much, no matter what the sales arguments. But overall, it seems like interesting work and perhaps something simpler (to use) will be available someday, where I feel like having more trust in the results. Or maybe the problem is just complicated. But as I said, these all seem like useful steps into the direction of improving the approaches and making them more usable. Along these lines, it is also nice to see these being integrated as part of ML platform offerings and services.

There are other interesting methods for similar approaches as well. SHAP is one that seems very popular, and Eli5 is another. Some even say LIME is a subset of SHAP, which should be more complete vs the sampling approach taken by LIME. Perhaps it would be worth the effort to make a comparison some day..

That’s all for this time. Cheers.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s