Kaggle: How a Platform Democratizes AI
How a platform initially coded in a Bondi bedroom became the world’s center for data science
Kaggle is a community data science platform that connects amateur and professional data scientists with companies who have problems that can be solved using data science skills. While many firms including NASA which we discussed in class leverage a crowdsourced data science model, Kaggle benefits from strong network effects having built up the reputation of being the go-to place for data science – it boasts of some of the world’s best data scientists who are attracted to the platform for the diversity in projects, active community, as well as reputation and prizemoney and by companies who are willing to pay for access to that talent pool (recruitment competitions) and the opportunity to solve some of their hardest problems (featured competitions). According to ComputerWorld, since its inception in 2009, the Kaggle community has submitted more than four million machine learning models to competitions, shared 170,000 forums posts, more than 250,000 kernels and 1,000 datasets [https://global-factiva-com.prd1.ezproxy-prod.hbs.edu/redir/default.aspx?P=sa&an=IDGCWA0020170609ed6900002&cat=a&ep=ASE]. Below we identify the core elements and processes of that help make Kaggle work based on both personal experiences competing on the platform as well as firsthand conversations with the Kaggle team.
Screenshot of Kaggle’s Competitions
Unique problems: At a given point in time, Kaggle hosts 10-20 competitions, each competition representing a different problem that a company wants to solve. The problems are typically very diverse from various industries and across time. Kaggle requires that a company explain why its problem is unique and solvable via a data science during its screening process. Kaggle offers research competitions in addition to featured and recruitment competitions in the event a company has a unique problem to solve but is unsure whether a solution exists.
Quality control: Once the problem has been identified, a Kaggle engineer will work with a dedicated resource at the company to review the underlying dataset, the target variable (what the company is looking to predict), and help the company come up with the evaluation metric if does not have it already. In addition to design and scope, Kaggle works with the company to define rules, logistics, and configure the launch of the competition. This process can take up to three months at which point the competition is typically open for a subsequent 2 to 3 months.
Prize money: The company puts up a cash prize for the winners (usually there is a first prize second and third) totaling anywhere from $15,000-$125,000. Featured competitions typically command the largest prize money and are featured at top of the webpage of the competition’s webpage.
Screenshot of Kaggle’s infrastructure for data analysis
Ease of use: Competition participants conduct their analysis on the Kaggle platform. Kaggle provides a uniform infrastructure for analysis it allows you to easily import the data and run a solution at scale. Kaggle community members often share an exploratory data analysis for each competition, making it easier for others to get started with their analysis. They are incentivized to do this through a rating system that awards its community members for sharing their work based on how well they are received across the community.
Screenshot of Live Leaderboard
Live leaderboard: Once the data scientist is satisfied with her solution, she submits it and her code is run against a test data set to be evaluated. The participant does not have access to the test dataset – this is referred to as “out of sample” testing – and ensures that the contestant’s data set does not overfit the data she has been working and that the algorithm is robust enough to work on data in real life. The contestant is automatically graded based on her accuracy. The definition of accuracy can vary but is roughly the total absolute difference between the actual and the model predicted numbers. Those with the out of sample difference tend to receive the highest scores. The scores are generated in real-time so that contestants can see their progress on a public leaderboard also maintain by Kaggle.
Connecting parties: At the end of the contest, the winners are awarded given prize money. Since the winners’ names and solutions are not initially made public, companies will pay Kaggle for either the winning solution or access to the contact information of the top contestants (or both). Companies have typically used Kaggle to recruit top talent worldwide as in the event of obtaining a winning solution, companies still need to integrate that solution into a production environment, where constraints may limit the ability to use complex solutions (something that Kaggle does not penalize).
Kaggle was acquired by Google in 2018. Google has continued to market Kaggle independently but has since integrated Kaggle into its cloud platform. Thus, contestants are now able to analyze larger datasets in a more real-world environment. Google currently provides this service free of charge to contestants under its mission to “democratizing AI for all.”
Student comments on Kaggle: How a Platform Democratizes AI
Linking this directly to one of our NASA case, Kaggle offers the ability of a large volume of inputs to gradually converge towards the best solution to a problem. Now, while I think that this solution could help researchers find excellent inputs, I’m concerned about the consistency of the output quality, as well as the fit to certain business questions. While this platform might be equipped to optimize problems in the AI space, would it also be useful to find solutions to reduce child poverty, or is this platform a hermetic system that only solves problems in the tech space?
Love this writeup!
As a Kaggle user myself, I wonder how Kaggle deals with multihoming. Since several other companies are competing in the online collaborative coding space (like SPOJ, Topcoder), and many more are coming up with interesting differentiating factors (like Numerai), how can Kaggle prevent its user base from multihoming?
Also, I’d be curious to understand the network effects in play better. I can imagine there must be strong cross-side network effects between users and companies, but the same-side effects seem to be weak at best. What is their scaling strategy, in the absence of organic scaling drivers like network effects?
You talk about the community aspect which incentivizes participants. I am always wondering if these kinds of platforms are not just smart outsourcing tools for organizations. Instead of paying in-house personnel high salaries and social security for solving complex problems, organizations use external platforms where highly talented people work, on average, for less than what they would be paid as full-time employees because these people value the community aspect as partial compensation. I guess this is where the job market is moving, and many platforms have the same tension, but a data science platform makes this problem even more obvious due to the need for highly-skilled labor. Therefore, I am not sure this platform really “democratizes AI for all” or rather “outsources AI problems to all”.
Thanks for the good read. I have found kaggle to have been quite beginner-friendly in the sense that it is easy to get started. The community is very welcoming and as you pointed out, often publish their initial findings to help others get started. Nonetheless, kaggle competitions have enough complexity to attract top data scientists to participate. In this regard, I think kaggle has found the sweet spot in creating a “game” that is easy to learn but hard to master.
Great read, thanks for sharing! Always cool to learn more about developer culture and the persistence of creating “fun” competitions to gain access to skilled programmers/creative solutions. There seems to be commons elements with hackathons and Topcoder in this setting. Interesting to think about how Google plans to grow the platform as the owner. It seems like a great way to introduce people to data science, specifically on the Google cloud platform.
Love Kaggle and look forward to seeing how the platform evolves. Google doesn’t always do things purely altruistically so I wonder what their plan is for integrating the models that the platform produces in Alphabet offerings.