Kaggle: the data scientists hub
If you want to become a data scientist, you need to be a part of the Kaggle community!
Founded in 2010, Kaggle is a crowdsourcing platform that helps companies find Machine Learning solution through hosting predictive modeling and analytics competitions. Competitions focus on solving predictive analytics problems, range from predicting cancer to rental price. Participants/Teams that build the best models, i.e. achieve the highest scores, are the winners of the competitions: win a prize ranging from few thousands to a million or get recruited.
Since then, it has attracted 600,000 data crunchers with diverse backgrounds, such as data scientists, computer scientists, and biologists, from 194 countries. The community is not only active in building models and making over 35,000 submissions per day but also exchanging ideas and sharing latest machine learning algorithms. For example, Geoff Hinton and George Dahl educated the community with deep neural networks, and Tianqi Chen implemented a famous algorithm with accurate predictive power and shared the implementation with the community. As the community grows, community members sharpen their analytics skills and build credential with Kaggle. While many Kagglers have joined companies from DeepMind to Walmart, Kaggle has recently been bought by Google.
How does it work?
When a company wants to leverage Machine Learning to tackle business problems, it can choose an Analytics Solution provider or turn to the crowd. Netflix has benefited from crowdsourcing Machine learning algorithms. It hosted a competition with grand prize of 1 million USD, attracted brilliant minds, such as researchers from AT&T Labs and Yahoo!, and improved its recommendation algorithm by 10%.
Kaggle provides an access to quality data scientists for companies that want to crowdsource solutions through hosting a Machine Learning competition. To help companies set up competitions, Kaggle offers consultancies, such as defines a valuable problem given the data the company provides, and prepares data for the competition. For example, to evaluate the performance of teams, Kaggle needs to set aside some data as test dataset and define metrics to score the accuracy of predictions submitted by participants.
After a competition is launched, Kaggle will monitor the competition and provides tools to help participants experiment with various algorithms to compete. Participants build models locally or online using Kernel, submit their predictions of test set, and receive scores for their submissions. Leaderboards post the score of all the teams, and hence incentivize teams to continue to improve their models. Participants can choose to share their scripts, i.e. codes for analyzing data, publicly and discuss their problems in forums.
Once the competition is closed, the competition host will award the winner with prize. Kaggle will assist the company in obtaining IP license and integrating winning model to business. If the competition is for recruiting purposes, the company can screen participants based on their performance, i.e. score and place on the leaderboard, and their scripts.
Kaggle creates value by crowdsourcing brilliants mind to help business solve problems with Predictive Analytics, while fostering the growth of Machine Learning community. Crowdsourcing can create substantial value in this setting because there is no unique solution/algorithm to a predictive modeling task. More participants makes it more likely to find the optimal model. In addition, given a set of metrics, it is very easy to evaluate models and screen for the best.
In addition, Kaggle also helps connect companies with talented data scientists. Since data science is a very new field, many data scientists are self-taught. They can build up their profiles on Kaggle and attract recruiters.
Kaggle captures value by charging platform licensing fee, problem setup consulting fee, and any additional service fee, such as custom evaluation metric. In addition, for recruiting competition, Kaggle will charge a recruiting fee per hire.
In addition to providing full service on setting up competitions, Kaggle has also developed tools, such as Kernel, for participants and is dedicated to build and maintain a healthy community. Kernel is a product where Data Scientists can build models and test them online easily. It also allows them to share their codes publicly and ask for feedback from the community. It promotes the openness of the platform further and encourage idea exchange and learning from each other. Thus, participants are more likely to stay with the community because they can always learn new things and strengthen their analytical skills. Furthermore, when choosing projects, Kaggle makes sure that the challenge is both interesting and approachable, which also encourages more data scientists to participate and contributes to the stickiness of the community.
Since Kaggle has joined Google, it now has the access to Google Cloud, which allows it to hold competitions that require huge computation resources. The access to Google Cloud broadens the variety of analytics tasks that can be crowdsourced. Kaggle community will then play an even bigger role in the advancement of AI.
Student comments on Kaggle: the data scientists hub
Thanks for sharing, Jing! I often hear data science described as the intersection of statistics, computer science, and “domain knowledge”—a substantive understanding of the field that one is applying the other two tools to. That understanding influences everything from problem formulation to the ultimate applicability of a given machine learning solution (beyond just its accuracy). Both seem to lie outside of the scope of a typical Kaggle competition. I worry that Kaggle might thus inappropriately reduce data science problems to algorithm challenges and prevent companies from fully realizing the field’s potential.
Have you encountered this concern in your research, and do you think there are ways that Kaggle could better promote involvement of its community in the entire data science process, rather than just predictions?
Thank you for the comment! Yes, I agree with you that domain knowledge is very important and predictive modeling is only part of the data scientist process. This is also why we need in-house data scientists. For the exact reasons, I think Kaggle should stay in the predictive modeling spaces and promote advanced algorithms.
If we take a closer look at the competitions, most of them involve image recognition and NLP, where a good understanding of algorithm is more important then domain knowledge.
In general, we can split data science tasks into two parts: one is to extract insight from data, which require domain knowledge and a lot of effort, and to build predictive models. Kaggle adds value to the second part if companies have already done the first part and confine the problem to a predictive modeling task.
Yeah, the biggest news of the month in the crowdsourcing industry is probably Google’s acquisition of Kaggle. I’m curious about the implication of the deal on crowdsourcing. Google used to run competitions with Kaggle, but now they are crowdsourcing themselves. Do you think the acquisition is more about Kaggle’s community than technology?
Based on the announcement, Kaggle will run as a separate entity, and hence the business model will remain the same. If Google want to crowdsourcing, it still need to run competitions on Kaggle. I do not think acquiring Kaggle means acquiring its community. The deal is beneficial to both sides because Kaggle now have a better infrastructure support, and Google can promote its cloud service/platform and compete with IBM and Microsoft.
While crowdsourcing companies like Kaggle can help to invite minds to solve complex probems related to machine learning, the challenge for using crowdsourcing in knowledge domains is the ability to accurately communicate the issue in the form of a challenge for it to be resolved. It requires a great amount of interaction with the client to be able to understand their needs. Once that is taken care of, it can certainly help reduce research costs.
Thank you for your comment Charu. Agree, and I think Predictive Analytics is a good fit because everything can be well-defined and measured.