Search This Blog

Thursday, May 5, 2016

The Seer's Accuracy

AnalyticsVidhya organized a weekend hackathon The Seer's Accuracy on 29th April - 1st May, 2016.

In the midst of a new job, new city, I wasn't sure if I'll get enough time to participate in this hackathon. But fortunately, it was a relatively light weekend.

The challenge was to predict which customers would be return customers to a chain of stores. Looking at it another way, it was predicting which customers would churn.

The train data consisted of customers (a.k.a. clients) and their transaction history in the years 2003 - 2006. The test evaluation was on which clients would return in 2007.

There was no test data per se, and it turned out to be the most crucial part of this challenge.

Overall, very clean data and very interesting problem. Kudos to AV!

Right from the beginning I felt setting up a validation framework is going to be the key. And with a few LB submissions, I realized it was going to be extremely important to have a good validation set too.

I started off just like most other participants by using 2003-04-05 as build set and 2006 as the validation set and running a CV on it.
What finally catapulted me up the LB was when I added 2003-04 as build and 2005 as the validation set as well in my CV framework.

I think this resulted in a much stabler validation set and my CV and LB improvement was much more in sync.

Since the variables were limited, I treated and tested each of them individually and finally had a model with 335 features.

My final model was a blend of 3 XGBs on varying subsets of data and features.
It was a very minor improvement over my single best model.

View My GitHub Repository

I stood 1st on the public LB scoring 0.8856 and 1st on the private LB too, scoring 0.8800 using the AUC metric with the username 'vopani'.

Congrats to orenov/DataGeek for 2nd place and Bishwarup for 3rd place.

View Final Results

Feels good. Really good.
Not just for winning, but for building a solid architecture which enabled a strong and stable model resulting in a considerable lead over the rest.

And this is also my first win on AV! :-)

Thanks to the AV organizers for this hackathon, was top quality and totally worth spending a weekend over.

View AV article on winners.

1 comment:

  1. Hi rohan , great approach by the way.I have just one small doubt in your model.The data set that goes into the xgb algorithm will also have customer ids right? Is it ok if we consider them as numeric type?.Would the prediction change if we change the id and keep every other variable constant? ids generally don't have any mathematical signifcance right?