Search This Blog

Friday, July 31, 2015

Exacerbation Prediction of COPD

The Exacerbation Prediction of COPD patients competition was held on CrowdAnalytix in May-Jul, 2015.
Seems like this was a sequel to the first Exacerbation Prediction competition in which I stood 2nd.

Smoking related diseases like chronic pulmonary obstructive disease (COPD) are a severe global medical problem which have affected over 50 million people worldwide. As their condition worsens, a fraction of patients experience “exacerbations”. Exacerbation is defined by sudden worsening of symptoms such as shortness of breath and increased airway inflammation often requiring immediate medical treatment and emergency room visits.

The objective was to build a predictive model using medical data which predicts beforehand which patients will experience 'exacerbation' so that they can be provided appropriate medical treatment to prevent/control it.

The train data consisted of 1935 patients and 62 variables related to medical and smoking history, demographics, lung functions, etc. along with the true labels of whether they experienced Exacerbation or not.
The test data consisted of 1324 patients for which we had to predict the probability of Exacerbation.

Being one of the toppers of the previous Exacerbation Prediction competition, I followed a similar approach. My approach was to build 3-4 models and ensemble.

Unfortunately, it was very hard since the CV and LB scores did not go hand-in-hand. I finally tried various subsets and combinations of XGBoost, RandomForest, Logistic Regression and k-NearestNeighbours.

My best model on the public LB was a simple average of XGBoost and Logistic Regression. Which is the exact same ensemble I used in the previous Exacerbation contest.
My best model on the private LB was Logistic Regression on the PCA-transformed variables (using the top-7 components).

My public LB gave an AUC score of 0.767 (XGB + LR) putting me in 11th place, whereas, my private LB gave an AUC of 0.769 (LR) putting me in 4th place.

So, I stood 4th and won some more prize money! (Who wants a party?)
This also means I've been in the Top-5 in 3 of the 4 CrowdAnalytix competitions I've participated in.

I think the evaluation system is absolutely useless. The winners were decided solely based on the best private LB score. Kaggle does the same, but forces players to choose two submissions for evaluation. Here, ALL private submissions were evaluated and the best one was chosen.

I see a lot of cons here:

1. Players can try out all sorts of models and submit, and the more submissions a player makes, the likelier is one of them to be among the top.
2. Players don't know which model will be the final best model. So, if they made 100 submissions, are they supposed to track all 100 of them and submit the one that CA chooses as best? Are you kidding me? I had a tough time identifying which model of mine finally gave the best private LB score.
3. What sense does it make when one model fits best to public LB and another fits best to private LB?
4. Winners are more based on luck. Models are likely to be the luckiest fit to the private test set. I'm not sure how useful this would be to the client.

Kaggle has a much better, robust and stable evaluation system, and I really hope CrowdAnalytix figures something out soon, else its just going to be a series of lottery competitions.

Nonetheless, I'm happy with my performance. Another win up my sleeve and looking forward to add more in the future!

Read a blog post about the 7th place solution by Triskelion on ML Wave.

Check out My Best CrowdAnalytix Performances

1 comment: