My name is Dmitry. I’ll tell you
about a recent contest on Kaggle. Many people took part.
It was held by an auto insurance
company from South America, called Porto Seguro.
I placed third.
We dealt with the following problem:
As participants, we had to …
use certain data on …
insurance — to figure out …
if an insurance claim will
be initiated within a year. We were given some 57 semi-
anonymized features to work with. By semi-anonymized I mean
they were grouped into … car features, client features,
and so on.
We also knew which
features were categorical. So, the problem was set up quite nicely.
The data itself were rather imbalanced.
It makes sense, as insurance
claims are relatively rare. And we had about 3-4 percent of …
The quality metric was Gini-normalized.
Under the binary classification,
this comes out as … two times AUC minus one.
Pretty standard stuff.
You can validate with AUCs. Everything works out nicely.
the data were shuffled and
prepared exceedingly well. I mean, the distributions between
train and test were virtually identical. Among categorical features,
no unexpected values popped up.
No outliers. So, the data
were thoroughly prepared. Basically, we didn’t have to
spend too much time on … preprocessing the data.
We had some 600,000 rows in train
and 900,000 rows in test at our disposal. The public/private split was random.
As for the solution,
it was more or less standard.
Actually, it was close to the baseline:
Just use the raw data to train trees
or something, with a bit of tuning, and submit.
As a result,
public script wars ensued —
a fierce last-digit competition. Notably,
many of the 57 features were noisy.
they were of no use.
Furthermore, many of them appeared
to be automatically generated, perfectly distributed: Gaussian
random noise, steady random noise … They cried out for attention.
So, one of the crucial things …
that you had to do to gain an edge …
was removing the bad features,
as well as hot-encoding
the categorical features. I mean, it normally doesn’t make much
difference: You can hot-encode them. Or you might leave them as they are.
But here, it removed some noise.
It turned out,
by dropping the bad features,
with proper hot encoding and regularized
models, you could use one LightGBM and one neural network,
and just by averaging them — 0.5 + 0.5 —
you could land the third prize. Ultimately,
validation proved to be critical:
Which features to drop?
And how not to over-regularize?
To avoid retraining on various splits,
when the training set is used, it is important to …
tune parameters and select features
based on distinct splits or holdout sets. Because if you instead fix one parameter
and start going over the rest of them, you get hundreds or thousands of trials.
Eventually it leads to hill-climbing. And that’s what actually
happened on the leaderboard, when many public scripts were mixed.
If you look at the screenshot
of the final results, it becomes pretty obvious:
Thousands of participants
submitted roughly the same thing, climbing the leaderboard.
Those who overtook them ended up a
thousand places up the ranking table. To avoid similar problems locally,
and you will hill-climb with
different splits and seeds, you need to change
seeds during validation. Also, don’t fix any seeds
when you train the model. Instead, do a couple of runs and average
them — with different initializations. This makes them much more stable.
Let’s talk about feature elimination …
and how it was implemented.
had a “calc” prefix.
They turned out to be randomly generated.
Well, perhaps …
Perhaps, the organizers wanted to test us.
Anyhow, I eliminated them, since
they were utterly useless. Next,
pretty much …
the only real option …
was recursive feature elimination.
if you wanted to train the
model and check which … features to drop as the least important —
well, it wouldn’t work, because you can’t tell for sure …
that a feature is bad and won’t work.
So, the way to do it was by
fixing the whole feature set, running cross-validation, and then
eliminating features one by one. And …
You check how much the score improves
when you eliminate each of them. You spot the feature …
that improves the score
the most and eliminate it. One less feature to worry about.
You then repeat the process until … until you get rid of a decent
number of features. In my case, I dropped about 10 out of 57.
Moving on to models.
For one thing,
With trees, it usually doesn’t matter …
what you do with categorical features, how you encode them.
However, in this particular case,
hot encoding proved to be critical. Especially since there weren’t any
features with many categories. At most, there were about a
hundred categories — for car models. So there’s really no
reason not to hot-encode. The feature space could only grow
so much — by a factor of two or three. Parameters: The parameters of
the model have to be highly … highly regularized.
That means as little as 16 leaves,
and a high L1 regularization.
All of this enabled a model that had
almost no retraining on the training set. The AUC scores were around 67-68
on training and 64 on validation, which is nice, especially for LightGBM.
As for the neural networks,
I used the PyTorch framework. Once again, the categorical
features required hot encoding. Interestingly, this was also true for …
some continuous numerical features —
the ones that take on few values. These could be counts,
things like driver’s age, maybe.
Well, that could reasonably be expected,
because some age groups
correlate with careless driving. Architecture … I guess,
it wasn’t that important. I tested several options.
And even with a one-layer network,
the results were not much worse. Anyway, I ended up using
a network with three layers, 4096-1024-256, with a
0.5 dropout in the middle. Layer one had mostly zero weights:
Only 2 percent of the weights
were updated, the rest were zero. A sparse layer.
By combining the two models,
I was able to finish third, not by a large margin,
but still I was far ahead
of the majority of the … contestants.
What … didn’t work out …
was the generation of new features.
And that’s despite their
being largely deanonymized. The forum explained which features …
stood for the price, the car model,
the year of manufacture. Yet …
When I tried to leverage
this to improve the score, it actually made things worse.
In my case, huge ensembles
were no good, either. Of course, these two
models are not all there is: You could have random forests, FFMs,
what have you —
on different features.
And I did try out all of that. But in the end, it didn’t do much. So I
gave up the idea and kept things simple. The guys that took second place …
actually built an ensemble model.
Well … Good for them!
This contest featured
lots of interesting stuff … that is not typical for Kaggle problems.
this problem was basically
about anomaly detection. I mean,
rare and unique examples had
a higher probability of being class 1. So you could take a certain
combination of categories, concatenate them, and just
count up the frequencies. As it turned out,
examples with very low frequencies …
were more likely to be class 1.
Features of that kind
gave an AUC of about 0.57, whereas the winner got an AUC of 0.64.
And that’s unsupervised!
One more …
A further interesting point
is due to the solution by … the first place winner —
Unfortunately, I failed to make them work.
They could have been quite handy …
for using the training and the
testing set at the same time … and gaining a clearer representation in
the hidden layers of the neural network. What I tried to do was train
a regular auto-encoder … on the train + test data …
and then see what’s the AUC
given by its reconstruction error. Reconstruction error is something
I calculated as the square error … between the sample and what you get if you
put the sample through the auto-encoder. So …
If we calculate it for every sample,
we get an AUC of 0.6. This is a further indication that …
examples that are more unique —
those the auto-encoder
could not represent very well — are more likely to
trigger an insurance claim. The first place winner, Michael
Jahrer, made full use of this. This enabled him to gain a massive
lead right from the start and … secure the first place.
Here is what he did
with the auto-encoder:
First, there was no
bottleneck in the middle. He had very broad layers —
like a thousand neurons each. Second, the auto-encoders used were
“denoising” — they got rid of the noise. He implemented this …
by replacing the value of a feature …
with another value from the distribution …
of that same feature
15 percent of the time. This is obviously better …
than just adding random noise,
since the distribution is preserved.
Plus your representation becomes …
You get a more robust
representation in the hidden layers. If you trained the auto-encoder in that
way, you then had all the hidden layers … to train a new neural network on …
and thus gained a great score.
This was, I think, made possible by a very
good representation based on both sets. Probably,
hot-encoding some of the
features played a part in it as well. I want to conclude with a bit
of a promotion of this course. I’m one of the authors.
It’s actually about
contests, mainly on Kaggle. You’ll find it can teach you a lot about
how to win and work with data in general.
Your questions are welcome.
Dmitry, you started by talking
about the features you dropped. How did you spot them?
They were quite obvious.
Just by drawing a graph, you could see it: You can’t just get
a perfect curve, it doesn’t happen. One feature had a perfectly
normal distribution, then another one … After a while,
it became obvious that all the features … that had a “calc” prefix were bad.
Next, you try eliminating them,
and wow, it makes the score a bit higher. That’s all there is to it. (Thank you.)
Thanks. I wonder if your features matched
those available in the public kernels. Is there a match, or are they different?
In public kernels, people also
removed the “calc” features. But no one bothered too
much about elimination … They did. There were examples of that.
I wonder if they correspond to yours. I’m afraid I didn’t get through to them.
(OK, thank you.)
I mean, in a competition like that, it’s a bad idea to examine public
scripts, because they, you know … In your algorithm for
recursive feature elimination, the random seeds aren’t fixed
between iterations? Is that correct? The cross-validation seed is fixed.
(Fixed for the whole algorithm?) Yes. What was the point of that talk
about changing random seeds? The point was that you use one seed
for recursive feature elimination, but then it’s better to
use a different seed … for final quality evaluation or
hyperparameter adjustment. (OK, I get it. Here’s one more:)
Have you tried any other algorithm
for feature selection, aside from RFE? I started by doing things
that were less involved, such as checking feature
importance from trees and the like. But it …
Didn’t work well:
Sometimes the score didn’t even improve
when the last two features were removed. So I decided to stick with …
the surefire option.
Did you try adding the features back,
the ones you dropped … ? (Yes. I got it.) You mean going back and forth:
eliminate, then add back again to see
if you dropped something useful. (Right.) Well …
It’s not available in Sci-learn.
I mean, RFE is available,
but not Add-Del, I think. Anyway, I wrote my own cycle, and it did
have this procedure, but it didn’t help. So I omitted it.
Did you use one and the same model
for feature selection and training? (Well …)
I used a baseline model,
a pretty good one.
These over here are the final parameters
The ones I used were a bit different.
Bagging was different, leaves were
slightly different. But it’s pretty close. Having a decent model is fairly important.
Because if it’s bad,
you will eliminate the good stuff.
You mean, why LightGBM and not
XGBoost? (Right, CatBoost, XGBoost …) As for CatBoost,
it’s a solid choice for when you have a
strong dependence on categorical features. However,
not in this case.
You can launch CatBoost
instead of doing target encoding to see
whether categorical features are any good. If not, then you shouldn’t
bother using CatBoost. In my case, it proved to be slower and …
produced no results, so I
abandoned it right away. As for XGBoost versus LightGBM,
I’ve long since switched to LightGBM, for the simple reason that …
The results are basically the same.
Not much difference.
Oh, what I personally like
about it is the parameter … “number of examples in leaf,”
which is not available in XGBoost, despite its being very helpful
for many problems. (What did you mean by
“model regularization”?) I mean … on this slide.
What it means is … parameter selection.
In the case of trees,
it means fewer leaves
and this lambda. I see this
as regularized boosting. Because often they aren’t
constrained that much: You can see 64 leaves. And lambda
sometimes just gets ignored. Bagging, too, gets neglected.
But they do help.
(You do it manually?) Yes. (Thank you.)