The Carcinogenicity Prediction competition was held on CrowdAnalytix in Jul-Sep, 2015.
Objective
Carcinogenicity (an agent or exposure that increases the incidence of cancer) is one of the most crucial aspects to evaluate drug safety.
The objective was the predict the amount of carcinogenicity in compounds, which is measured through TD50 (Tumorigenic Dose rate).
Data
The train data consisted of compounds with over 500 variables consisting of physical, chemical and medical features along with their corresponding TD50 values. About 60% of the TD50 values were 0, the rest were non-zeros with few outliers.
The test data consisted of compounds with these features for which we had to predict the TD50 value.
Approach
This was a weird contest. On exploring the data, within 3-4 days, I found a key insight, and that proved to be a game changer.
So, what was this golden insight? It was the evaluation metric: RMSE.
The target variable (TD50) had many zeros and the rest were positive continuous values. RMSE as a metric can very easily get skewed due to outliers.
The train data had two values above 20,000. Predicting them accurately (greater than 20,000) would reduce the RMSE by more than 50%. So, assuming there are these outliers in the test data too, I knew this would give the maximum boost in score.
All the participants were lingering in the 1700's scores... and most of the usual models were not performing better than the benchmark 'all zeros' submission! That was a proxy validation that there had to be outliers in the test set too.
I built a model to classify outliers. The train data had only two rows (the ones with TD50 > 20,000) with target value '1' and the rest as '0'. Scored the classifier on the test set. Took the top-3 predicted rows of the test set and used 25,000 as the prediction. And BINGO! The 2nd one dropped my RMSE from 1700's to ~900. Almost a 50% drop!
Thats what you call a game-changer :-)
There are pros and cons.
Pros are that it was definitely a 'smart trick', and not really a 'sophisticated model'. Which I accepted and mentioned on the forum too. It was a neat hack applied on a poor evaluation criteria.
Cons are, of course, it doesn't lead to the best model. And worse, the result was technically determined by just one or few rows, making the rest of the test set worthless.
Model
For the remaining observations, I used a two-step model approach.
I first built a binary classifier to predict zeros vs non-zeros. Used RandomForest for this.
I then built a regressor to predict the amount of TD50, only using it for the observations which were classified as non-zeros from the binary classifier. Used RandomForest for this too.
For the binary classifier and regressor, I subsetted the train data by removing all rows where the TD50 values were > 1000 (considering them as outliers).
Results
I was 1st on the Public LB and 1st on the Private LB too.
This is my first Data Science contest where I stood 1st. Yay!
Not a really good one, but I'll take it :-)
Congrats to Sanket Janewoo and Prarthana Bhatt for 2nd and 3rd. Nice to see all Indians on the podium!
Views
The evaluation metric became the decider for this contest. A learning for me, that sometimes a simple approach can make a BIG DIFFERENCE.
Which makes it VERY IMPORTANT to explore the data, understand the objective, the evaluation and always do some sanity checks before diving deep into models and analysis. I've learnt a lot of these things from top Kagglers, and I'm sharing one of these here today, hoping someone else learns and helps in the development, improvement and future of Data Science.
Data can do magical things sometimes :-)
Check out My Best CrowdAnalytix Performances