Sorting and securing at scale: machine learning at Dropbox

How Dropbox leverages machine learning in its quest to help users maintain focus.

As the steward of hundreds of billions of documents belonging to over 500 million users [1], Dropbox is starting to rely on machine learning more heavily than ever. Dropbox researchers have invested years of study into modern work and how people spend their time doing that work. Having found that people waste a significant amount of productive time on three specific activities — organization, contextualization, and prioritization [2] — Dropbox views machine learning as a critical tool to help users avoid productivity potholes and maintain their focus.
 
This is objective is difficult to achieve because Dropbox needs to sift through and make sense of vast amounts of content while providing an experience tailored to each and every user.  For example, while most web search engines take into account a user’s search habits (e.g. Google search history), Dropbox must go a step further to distinguish which documents should be available to each user [1]. Further complicating this task is the constantly changing nature of these documents. As a hub for creative collaboration [3], documents in Dropbox are continuously being updated by multiple collaborators [4]. This dynamic process means that when Dropbox indexes a certain file with certain search criteria, within a few seconds those indexes may be rendered irrelevant due to edits by various users and new indexes must be assigned. 
 
Given these conditions, machine learning is going to become ever more important for Dropbox. 
 

How machine learning helps address these issues:

Dropbox’s first notable feature that harnessed machine learning was their document scanner. “More than 20 billion image and PDF files have been stored in Dropbox, and of those, 10–20% are photos of documents” [5]. Images of documents pose an issue because the text within them cannot be searched. As far as the computer is concerned this “text” is just a group of pixels, not text [6]. Machine learning offered a solution. Dropbox built an in-house Optical Character Recognition tool [7] that leveraged machine learning to recognize, extract, and index the text in these images so that users can search for it.
 
In the short term, Dropbox is continuing to find ways to utilize machine learning in the features they build. For instance, they recently released a redesigned search engine called “Nautilus” [7] which uses machine learning to solve the problems of search described previously. Farther down the line, Dropbox appears committed to investing in machine learning expertise and to making it a foundational component of the company. Dropbox’s job page clearly illustrates this emphasis, with open listings for machine learning engineers, product managers, PhDs, as well as college interns [8].

 

Searching for text in an image. Source: https://blogs.dropbox.com/dropbox/2018/10/search-images-text-ocr/

Recommendations:

A natural extension of their current use of machine learning is in securing user data. As the guardian of immense amounts of private user data, Dropbox is an attractive target for hackers. Machine learning, with its ability to rapidly process information at scale, could be expected to more quickly recognize and block suspicious entities attempting to accounts and, accordingly, should be pursued in the immediate and short term. According to The Times of Israel, Dropbox is hoping to shift the focus of their Tel Aviv team to security and machine learning, and also potentially acquire a startup to further this endeavor [9].  In addition, I would advise them to investigate the possibilities of machine learning to help enhance individual document and data security processes, including, for example, in connection with password and verification best practices.
 
Machine learning relies on large datasets and constant learning opportunities in order to evolve and improve. While Dropbox’s user base offers great scale, Dropbox’s approach to product releases may delay its ability to provide the machine learning protocols with lots of learning opportunities. While some companies in the technology industry have been known for moving fast and iterating once a product is live (for example Facebook noted this strategy in their IPO filings [10]), Dropbox is known for holding every product to a very high bar, only releasing it when majority of the kinks have been worked out. (This mindset is demonstrated in their values such as “Sweat the details” and “Be worthy of trust” [11]). This approach of only releasing products when they’re highly evolved is in tension with machine learning’s need to be exposed to lots of use before it can evolve. One way I recommend bridging this gap would be to look into using simulations of historic user activity in order to test and iterate with new features that rely on machine learning. 
 
However, changing the company’s philosophy on product launches is a major strategic decision and far from a sure success. While this careful approach to product releases has served Dropbox well so far, will the benefits of machine learning push them to release products earlier in the development process? Moreover, with machine learning becoming a focus for many technology companies, how can Dropbox expect to outcompete the competition for machine learning talent? 
 
 
 (800 words)
 
Sources:
  1.  https://blogs.dropbox.com/tech/2018/09/architecture-of-nautilus-the-new-dropbox-search-engine/
  2. https://blogs.dropbox.com/tech/2018/09/machine-intelligence-at-dropbox-an-update-from-our-dbxi-team/
  3. www.dropbox.com 
  4.  https://www.cbronline.com/news/dropbox-search-engine
  5.  https://blogs.dropbox.com/dropbox/2018/10/search-images-text-ocr/ 
  6. https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ 
  7.  https://venturebeat.com/2018/10/09/dropboxs-autoocr-can-index-text-from-pdfs-and-images/ 
  8.  www.dropbox.com/jobs 
  9.  https://www.timesofisrael.com/dropbox-seeks-to-expand-operations-in-israel-possibly-acquire-startups/
  10. https://www.sec.gov/Archives/edgar/data/1326801/000119312512034517/d287954ds1.htm#toc287954_10
  11. https://www.sec.gov/Archives/edgar/data/1467623/000119312518055809/d451946ds1.htm 

Previous:

Streaming as an engine for innovation, content creation and improved monetization in the gaming industry.

Next:

Can AI Help Tyson Foods Lead The Protein Revolution?

Leave a comment