“Most of the smartest people work for somebody else…” Kaggle in the new world of data science
Kaggle’s online data science platform is every software company’s critical new tool
“No matter who you are, most of the smartest people work for somebody else.” -Bill Joy, co-founder of Sun Microsystems
Today software companies stay competitive through applying a layer of machine intelligence to their product offerings. Google’s generalized AI platform, DeepMind, is seen as the primary pathway to future dominance. However, the data scientists and engineers that create, maintain, and improve these machine intelligence layers are notoriously hard to recruit and retain. Good talent is constantly being poached away. Moreover, the best breakthroughs in algorithmic approaches, the ones that yield significant jumps in accuracy, are often developed by outsiders who thinking orthogonally to the existing teams and their approaches. As we saw in the IMB TOM case last year, the Watson team made its most significant progress when external developers and engineers were invited to create independent and even competing algorithms.
In 2010, Kaggle capitalized on this trend by launching a platform that hosted predictive modeling and analytics competitions. On one side of the platform are teams of data scientists. On the other side are corporations offering data modeling competitions with prizes in worth millions of dollars.
Kaggle creates value through:
- Providing data scientists a single destination to search, filter, and find competitions
- Providing corporations a platform on which to run competition leaderboards, host competition datasets, accept code submissions, and automatically score submissions
- Hosting and curating public datasets
- Rating and curating data science teams
- Distributing analytics content, best practices, and analytics innovations through its blog and discussion forum.
Kaggle captures value through:
- Job postings
- Hosting private competitions where only select groups of data science teams can participate
- Directly matching data scientist teams to corporations with analytics needs
Traditionally, the market of data science was one with low multihoming. Data scientists worked for individual corporations solving a narrow set of problems using a relatively static set of datapoints. Work was rarely shared outside of the academic community. The network effects were also quite low as the human capital management issues and salary costs of bloated data scientist teams tended to create sub-optimal performance.
Kaggle creates a situation of high multihoming where independent teams of data scientists can work on many different problems, spanning unique sectors, and a variety of analytics techniques and datasets. The network effects are also immense. The live leaderboard and Scripts, Kaggle’s proprietary tool to share code, encourages all participants to learn and innovate against the current best practice. Kaggle also maintains a blog to publish the prize-winning techniques so they can be applied elsewhere.
The work here is all the more valuable as much of the innovation has moved from the open academic community to walled-up private corporations. Uber poached much of Carnaige Melon’s robotics research group, Stanford’s most celebrated artificial intelligence professor, Andrew Ng, left to be Cheif Data Scientist of Baidu, and Facebook hired away NYU’s Yann LeCunn. Without Kaggle’s platform data science could revert back to a data science world of low-multihoming and low network effects. As Bill Joy accurately identified, in this world companies are constantly left with the second best talent working on their most important technique problems.
It’s pretty cool how companies can essentially outsource data science development work to the Kaggle community. I remember they launched a specialized consulting service for the oil & gas industry a while back, but had to shut it down due to market cyclicality. Wonder why they haven’t pursued it across a broader set of verticals.