The Puzzling World of Logic: Exacerbation Prediction

Saturday, January 10, 2015

Exacerbation Prediction

The Exacerbation Prediction competition was held on CrowdAnalytix in Nov-Dec, 2014.

Objective
Respiratory diseases (asthma, cystic fibrosis, smoking diseases, etc.) are one of the leading causes of deaths globally. As the condition of patients deteriorate, they experience 'exacerbations', which is sudden worsening of symptoms, requiring immediate emergency and medical attention.

The objective was to build a predictive model using medical and genetic data which predicts beforehand which patients will experience 'exacerbation' so that they can be provided appropriate medical treatment to prevent/control it.

Data
The train data consisted of ~ 4000 patients and 1300 variables along with the true labels of whether they experienced Exacerbation or not.
The test data consisted of ~ 2000 patients for which we had to predict the probability of Exacerbation.

Approach
My main idea was to build 2-3 strong classifiers and then build an ensemble with them. With 1300 variables, variable selection / dimension reduction became a must.

I tried tree-based models like Random Forest, GBM, Extra Trees, XG-Boost, etc., regression based models like Logistic Regression, Ridge Regression, etc., and some others like k-NearestNeighbours, SVM, NaiveBayes, etc.

RF and XGB gave the best results while LR and k-NN were decent. I explored and optimized these. After some tuning, XGB and LR gave much better scores, and k-NN didn't add any improvement.

Model
My final model was a weighted average of XG-Boost and Logistic Regression.

The XG-Boost was built on 150 variables, which were selected based on the variable importance of some sample tree models.

The Logistic Regression was built on the top-50 Principal Components.

Results
I stood 4th on the Public LB out of 101 teams, 1st on the Private LB, and finally 2nd on the Private Evaluation. I'm not sharing the scores (since they are not public), but my models achieved AUC scores of ~ 0.845

So, I stood 2nd! This is the first time I've got a ranking with some prize money! Yay!

Views
When I started this competition, I was looking at all numeric features of anonymized variables. I wasn't sure how much I could squeeze out from the data, but I put in a lot of time and effort and found some wonderful ideas in the process.

I think my model was a very robust and competitive one, and I was surprised it scored so consistently across multiple test sets.

Overall, it was fun. The Public LB evaluation on CrowdAnalytix is not absolutely ideal, since you can tune your model to overfit the LB. I still love Kaggle's method of evaluating winners.

Thanks to my family, friends and colleagues (especially my flat-mate and colleague Shashwat) for their help and support, this is a big achievement for me and I'm hoping to perform better in the years to come, and hopefully call myself one of the best Data Scientists of India :-)

Check out My Best CrowdAnalytix Performances