Search This Blog

Saturday, September 12, 2015

Carcinogenicity Prediction of Compounds

The Carcinogenicity Prediction competition was held on CrowdAnalytix in Jul-Sep, 2015.

Carcinogenicity (an agent or exposure that increases the incidence of cancer) is one of the most crucial aspects to evaluate drug safety.

The objective was the predict the amount of carcinogenicity in compounds, which is measured through TD50 (Tumorigenic Dose rate).

The train data consisted of compounds with over 500 variables consisting of physical, chemical and medical features along with their corresponding TD50 values. About 60% of the TD50 values were 0, the rest were non-zeros with few outliers.

The test data consisted of compounds with these features for which we had to predict the TD50 value.

This was a weird contest. On exploring the data, within 3-4 days, I found a key insight, and that proved to be a game changer.

So, what was this golden insight? It was the evaluation metric: RMSE.

The target variable (TD50) had many zeros and the rest were positive continuous values. RMSE as a metric can very easily get skewed due to outliers.

The train data had two values above 20,000. Predicting them accurately (greater than 20,000) would reduce the RMSE by more than 50%. So, assuming there are these outliers in the test data too, I knew this would give the maximum boost in score.

All the participants were lingering in the 1700's scores... and most of the usual models were not performing better than the benchmark 'all zeros' submission! That was a proxy validation that there had to be outliers in the test set too.

I built a model to classify outliers. The train data had only two rows (the ones with TD50 > 20,000) with target value '1' and the rest as '0'. Scored the classifier on the test set. Took the top-3 predicted rows of the test set and used 25,000 as the prediction. And BINGO! The 2nd one dropped my RMSE from 1700's to ~900. Almost a 50% drop!
Thats what you call a game-changer :-)

There are pros and cons.
Pros are that it was definitely a 'smart trick', and not really a 'sophisticated model'. Which I accepted and mentioned on the forum too. It was a neat hack applied on a poor evaluation criteria.
Cons are, of course, it doesn't lead to the best model. And worse, the result was technically determined by just one or few rows, making the rest of the test set worthless.

For the remaining observations, I used a two-step model approach.

I first built a binary classifier to predict zeros vs non-zeros. Used RandomForest for this.
I then built a regressor to predict the amount of TD50, only using it for the observations which were classified as non-zeros from the binary classifier. Used RandomForest for this too.

For the binary classifier and regressor, I subsetted the train data by removing all rows where the TD50 values were > 1000 (considering them as outliers).

I was 1st on the Public LB and 1st on the Private LB too.

This is my first Data Science contest where I stood 1st. Yay!
Not a really good one, but I'll take it :-)

Congrats to Sanket Janewoo and Prarthana Bhatt for 2nd and 3rd. Nice to see all Indians on the podium!

The evaluation metric became the decider for this contest. A learning for me, that sometimes a simple approach can make a BIG DIFFERENCE.

Which makes it VERY IMPORTANT to explore the data, understand the objective, the evaluation and always do some sanity checks before diving deep into models and analysis. I've learnt a lot of these things from top Kagglers, and I'm sharing one of these here today, hoping someone else learns and helps in the development, improvement and future of Data Science.

Data can do magical things sometimes :-)

Check out My Best CrowdAnalytix Performances


  1. Congratulations on first place! Great post. I find this part of data science very interesting and under-appreciated.

    The worker's compensation CAX competition had similar characteristics to a large extent. I dropped my error 50% when my deep learning model correctly predicted a high-ranking claim and I multiplied its prediction by 10x.

    People like MSE because it's common and has convenient mathematical properties, but I agree that other metrics should be considered more often. Either that, or the some other method of framing the problem differently, including removing outliers beforehand. Hopefully the business host understands their goals well enough to know what they want optimized and data scientists can help structure the competition correctly to fit those goals.

    You probably remember this competition: where two-stage classification/regression combinations were very popular. In that case it was MAE, which did reduce the impact of the high values, but the distribution still suggested an initial classifier would be useful. It's a fun technique when you can spot a problem that benefits, and your description here summarizes that well.

    Again, congratulations!

    1. Thanks Mark!
      Nice to read your comment.

      I struggled with the Worker's Compensation on CAX :-(
      Yes, its similar to the Loan Prediction one on Kaggle, and I'm glad it worked.

      Hope to see more such insights and competitions with wonderful models that change the landscape of Data Science in the future :-)

  2. Thanks Rohan. This is really an eye-opener for me. I tried some basic models at first and then gave up looking at your score and my score on the LB :D

    This is a very clever trick to outsmart the evaluation metric. Very good learning for me. Congratulations. :)