Search This Blog

Wednesday, May 6, 2015

TFI Restaurant Revenue Prediction


The TFI Restaurant Revenue Prediction competition was held on Kaggle in Mar-May, 2015.

Objective
New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

The objective was to predict the annual restaurant sales of a restaurant.

Data
The train data consisted of 137 rows with restaurant data pertaining to opening date, city, type and anonymous variables related to demographic, real estate and commercial data along with the corresponding annual revenue.

The test data consisted of 100000 rows with restaurant data for which we had to predict the annual revenue. Most of the test data was junk, a popular technique used to prevent hand-labeling in such competitions. The actual size of the test set is rumoured to be around 320.

Approach
What can you do when you have just 137 data points? And some of them that look like outliers? You have to make a choice and bank on some luck :-)

I chose to build a model which gives relatively stable CV and a decent LB. I tried out a few models and found RandomForest giving decent/stable results, like many other participants found after reading the forums.

I tried some simple features, nothing too complex, since over-fitting is highly likely on such datasets.
I also shuffled the train data and built RF models on different subsets to remove noise and the effect of outliers.

Also, training the model on log-transformed 'revenue' was much better than the raw 'revenue'.

Model
My final model was an average of many RFs built on different subsets of the data.

'days' variable (Number of days back the restaurant was opened) was the most important variable. I also converted some of the anonymous variables into dummy variables, treating them as categorical. These two ideas gave the best improvements.

Github
View Github repository

Results
My model scored 16.2L on the public LB which ranked 66th, and 17.6L on the private LB which ranked 14th! There were totally 2256 teams.

This is my best individual performance on Kaggle! :-)
Not the best competition on Kaggle, but it was certainly the biggest one in terms of teams. This also becomes the first competition to cross the 2000-participants mark. Of course, the Otto competition is going to beat this soon.

View Final Results
View Public LB

Views
Working with a small data-set is always challenging in many aspects. Choosing the model, training the model appropriately, preventing over-fitting, etc.

I'm glad I stuck with RF and made it as stable as possible.

The BAYZ team built a 'perfect submission' which scored 0 on the public LB, putting them in 1st place. How? You can keep track of their forum post and learn how to become a master at over-fitting! Of course, their final private LB rank is way below, but I still think they came up with a winning model (to overfit, not to predict revenue!) and I'm looking forward to know how they cracked it.

So, this gets me to 105th in overall Kaggle rankings and among the Top-3 Indians.
Next target is Top-50 and then Top-Indian.

Check out My Best Kaggle Performances

5 comments:

  1. Can you please let me know how u treated p variables ??

    ReplyDelete
    Replies
    1. I kept some of them as it is while I one-hot encoded others, which were categorical in nature.
      You can find it in the dummy.data.frame function in line 38 of the code.

      Delete
  2. If you have some documentation kindly please provide it to me.

    ReplyDelete
  3. panel$days <- as.numeric(as.Date("2014-02-02")-panel$Date)

    Can you please explain how you decided this date 2014-02-02
    Was it just present day of coding then you took?

    ReplyDelete