Search This Blog

Wednesday, December 16, 2015

Rossmann Store Sales


The Rossmann Store Sales competition was held on Kaggle in Nov-Dec, 2015.

Objective
Rossmann operates over 3000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

The objective was to predict the sales of various Rossmann stores in Germany.

Data
Train data consisted of sales from over 1000 Rossmann stores along with information related to promotions, competitions, holidays, etc. upto July, 2015.

Test data consisted of dates in August and September, 2015 for which we had to predict the sales.

Approach
Being a classic time series sales forecasting problem, I explored two approaches. One being the standard building of tree-based and linear models. The other being trying out time series models like ARIMA.

It became quickly evident from cross-validation and validation results that ARIMA wasn't working. XGBoost was giving much better results.

There was a lot of external data shared and available, but none of those made a big improvement in the model. My final model didn't use any external data either.

Building models at a store-level was not giving as good results as building a model using all the data together, but it helped while blending models.

Model
I built multiple XGBoost models on different subsets of the entire data and averaged them. I merged these with store-level models of XGBoost, Random Forest and GBM. The blending of models gave a huge improvement and ultimately lead to the stability of the predictions.

I finally tweaked the predictions by using a multiplicative factor of 0.98 to get the best fit to the LB.

I usually share my code on GitHub, but this time I decided against it, since I haven't done anything extraordinary or special.

Results
My model gave a RMSPE of just below 0.10 on the public LB with rank 66th and RMSPE of just below 0.11 (in fact, I scored 0.10999!) which ranked me 14th on the private LB out of 3303 teams.

A lucky jump, having chosen a stable model, which results in my best individual performance on Kaggle till date, improving on my 14th rank / 2256 teams in the TFI competition.

View Public LB
View Final Results

Views
It was a tricky contest, mainly due to the nature of the public and private LB split. It was overwhelming to see so much external data being shared and used. Maybe under other circumstances, this could have played a much more important role.

Congratulations to the winner, Gert, who performed fantastically, by being way ahead of the lot in the public LB with very few submissions! And finally being stable enough to win on the private LB, again with a big lead.

So, I gained some good points from this contest, and moved to 111th in overall Kaggle rankings. My year-end goal was to be in Top-100. I'm close, and with the Walmart contest left, I might just make it.

Check out My Best Kaggle Performances

1 comment:

  1. Hello Rohan,

    Hope you are doing good. I am one of your follower.

    On the grounds of this competition - very nice elaboration of the approach.

    Is it possible for you to share the code ?

    I am currently working on the similar kind of problem and not getting desired results from Time series forecasting techniques like SARIMA. Want to go for supervised learning. As of now I have only location and Sales but can drill database for more factors.

    I would be great if you can share the piece of code or path so that i would get some insights from your code.

    mail - anishpurohit.ds@gmail.com

    Hoping for a positive response.

    Thanks in advance buddy !
    Anish Purohit

    ReplyDelete