import.io & Web Scraping for the Average Joe or Jill
Why should just the big-shots have access to an unlimited flow of data from the unstructured web?
Data, data everywhere, but not a byte to crunch. This expertly-crafted opening sentence captures the sentiment of many an aspiring analyst as he or she crawls through the tubes of the internet, desperate for tabular, structured data to feed into a project. While companies might have more structured data than they know how to handle, the opposite is true for most people outside of a firm’s org structure.
This is understandable: data is a competitive advantage, so sharing ought to be done, if at all, judiciously. But in so doing, data owners take the limitless power of the crowd down a peg.
Until the storied data singularity occurs, then, crawler & scraper technology fills the gap. Borrowing a page* from search engines, these automatons wander the web, downloading & storing whatever data they’re instructed to. With a little python & a lot of moxie, users can build datasets in exponentially shorter amounts of time than manually harvesting.
Some companies have fed their cash cows on the grass of web scraping tech. In a well-known lawsuit, internet artifact craigslist.org set their legal team on 3Taps, a company that scraped all Craigslist postings & sold to third-parties (most famously, padmapper.com built its reputation on Craigslist data scraped by 3Taps).
To be clear: scraping is not in-and-of-itself illegal. In fact, the technology does little more than replicate the same requests of servers that a user’s browser might. So long as a scraper doesn’t overload the server with request (a popular weapon used in “Denial of Service” attacks, often used by ne’er-do-wells to bring-down webservers), the entire process is on the up-and-up.
Where 3Taps went wrong, then, was monetizing the asset on the back-end—data to which Craigslist held exclusive rights*. But the fruit of the data tree proved too sweet for other companies to resist a taste in the wake of the Craigslist fiasco.
Enter “import.io.” This young company has two notable components to their business model: the first is a user-friendly, powerful, and graphical utility for crawling the web. Using import.io solutions for the first time, this author was able to harvest thousands of posts from a forum with just 15 minutes of setup–no lines of code, and only a few trips to the documentation. To put that in perspective, using some of the more popular code-based scraper solutions, only an expert user could have built that sort of scraper in so little time.
https://youtube.com/watch?v=cdmsTxu45-c
The second component of import.io is where the company got especially clever. Much in the same way that Duolingo provides free language lessons in-exchange for crowd-sourced translation services, import.io puts its army of users to work. Their enterprise package boasts that “every day our powerful infrastructure collects millions of data records from the web,” access to which they sell at a premium.
For a single company to offer just this second component would be a tremendous burden. Every site’s structure is different, and that typically calls for a custom scraper program. No company could write & maintain scrapers for every site on the web. So, import.io doesn’t. With every user building scrapers, tabularizing & organizing the data, and (most importantly) updating their scraper when a site changes, import.io has an unprecedented army of data harvesters at their fingers.
For import.io and the few competitors in this space, the future is ostensibly bright. Because their critical mass of data likely doesn’t depend on any one site (like 3Taps & padmapper did), they’re able to adjust to company’s response to scraping. This is clearly on their radar, as their documentation pretty explicitly states that they respect target sites’ wishes.
The most obvious room for growth for import.io is the flexibility of their scraper solution. Programmatic scraping might be difficult, but it’s almost infinitely extensible—something a graphical tool can rarely claim. Until big-dog open-source solutions like Scrapy & Selenium are no longer necessary, import.io will not be able to address the entire market.
*Pun.
**There’s more to it, including 3Taps circumventing of Craigslist’s efforts to block them, but this is sufficient for our purposes.
So does import.io openly acknowledge breaking terms of service of all the websites it scrapes from by selling the data? I’m curious about their long term strategy and how they’re thinking about the legal side of things.
I think they’re taking the classic platform approach: all they do is build tools, it’s up to users to apply them ethically.
To wit, their (T&C)[https://import.io/terms-and-conditions] reads:
8.2. If you wish to use the Service to convert any Web Data into a table or data or a structured API (or any other functionality offered by the Service) that you do not own, you must obtain the consent of or an appropriate licence from the licensors or owners of such Web Data before you process all of or any portion of such Web Data through the Service. You must comply with requests from third party rights holders to cease to deal in any way with any Web Data that they own when you do not possess appropriate licences to deal with such Web Data.
oops, inverted my markdown syntax; here’s the link: [T&C](https://import.io/terms-and-conditions)