On the Shoulders of Giants: How Amazon Uses Machine Learning and Consumer Data to Disrupt its Partners

What if I told you…that the flow of aggregated consumer shopping data could be multi-directional?

Machine learning (ML) must be applied to proprietary, relevant, and large datasets to generate differentiated value [1]. Some of the most common ML models in use today, including neural networks, are not recent inventions [2]. The size and quality of the data that is only recently become available for training has been the driver of 10X gains in ML applications. The increase in availability in compute resources has also contributed, though this is not a focus of this report. Amazon, the world’s largest eCommerce retailer, has an unprecedented and increasingly large trove of consumer data [3]. This essay will explore Amazon’s use of machine learning to prioritize their investments in new product development, specifically with regards to private label consumer products and durables.

Over the past few years, Amazon has expanded from being an online retailer of third-party goods, to also selling their own private label goods [4]. Traditionally, companies in the consumer products and durables space (CPG companies) have made investments in new product development based on a combination of executive “gut” and pseudo-analytic methods such as consumer focus groups or surveys with small sample size. Amazon uses a different approach. They gather significant consumer shopping data and make decisions on which private label brands to launch based on analysis of that data [5]. Amazon can use machine learning to its advantage more so than other consumer product manufacturers because they have access to consumer shopping data from Amazon.com. This data, and the machine learning techniques they can apply to it, has become their edge. They have used it to launch tens of lines of private label consumer product lines over the past few years.

CPG companies that sell their products on Amazon.com have several financial and strategic considerations to evaluate when making the decision to use Amazon’s platform. Originally, the major consideration was margin pressure, as Amazon takes a meaningful percentage of third-party sales as profit. CPG companies also consider the strategic loss in value of controlling the brand experience, or the way in which their end-customers interact with their products online. The loss of consumer shopping data is also a consideration. These considerations pale in comparison to the new reality in which Amazon markets and sells their own private label goods. In today’s landscape, CPG companies that sell on Amazon must also consider the fact that the consumer data they are giving up to Amazon is being used against them. It is being used by their newest competitor, to identify and refine the products which are most successful, in order to steal market share.

Amazon’s management (read: Jeff) might consider a world in which value accrues to and leaves Amazon via:

  1. increased margin from sales shifting to Amazon private label goods, and
  2. loss of consumer shopping data from the flight of CPG companies off of Amazon’s platform.

If CPG companies leave Amazon’s platform, they take with them the consumer shopping data which Amazon relies on for their private label new product development. However, Amazon would still be collecting data from consumer shopping of their own private label products. Because the scale of data available is absolutely crucial to successful use of machine learning, Amazon’s new product development may be at risk if the net flux of consumer shopping data on their platform is negative.

Retention and use of data is at the heart of Amazon’s future product development, and the decision by CPG companies on whether to sell on Amazon’s platform. In order to retain consumer shopping data, Amazon could fundamentally alter the structure in which they think about CPG companies’ interaction with their platform. Access to consumer shopping data should be a part of negotiations with CPG companies. Just as Amazon’s percentage take on sales is negotiated, so should the amount of data shared. This would allow Amazon to retain the massive ecosystem of customer shopping data, while allowing CPG companies to access data related to their own branded sales.

Two open questions for the community:

  1. If Amazon were to adopt the proposed suggestion, how much access to other CPG companies’ data (if any) should each CPG company be afforded, at a maximum?
  2. Does Amazon face a risk of the largest CPG companies “teaming up” with each other, against Amazon, to match and thus mitigate Amazon’s trove of consumer data?

(734 words)


[1] “An Introduction to Statistical Learning: with Applications in R”, Hastie et. al, Springer (2017)

[2] “Statistical Learning (Self-Paced)”, Stanford Online Learning, Stanford University (Accessed online MOOC 2017, 2018)

[3] “More Product Searches Start on Amazon”, Garcia, eMarketer, (Accessed online https://retail.emarketer.com/article/more-product-searches-start-on-amazon/5b92c0e0ebd40005bc4dc7ae October 2018)

[4] “How Amazon Plans to dominate The Private Label Market”, Danziger, Forbes (Accessed online https://www.forbes.com/sites/pamdanziger/2018/05/06/how-amazon-plans-to-dominate-the-private-label-market/#5be9def372d9 November 2018)

[5] “Secret Amazon brands are quietly taking over Amazon.com”, Griswold, Quartz (Accessed online https://qz.com/1414238/secret-amazon-brands-are-quietly-taking-over-amazon-com/ November 2018)


Speeding the Drug Discovery Pipeline with Open Innovation


Hurry Up and Wait: Bringing Machine Learning to the US Department of Homeland Security

Student comments on On the Shoulders of Giants: How Amazon Uses Machine Learning and Consumer Data to Disrupt its Partners

  1. Thank you for raising and exciting topic that is both on the tip of the tongue and in some part continues the class discussion in our section. I really like the solutions proposed for solving the potential CPG flee issue, both of them are realistic and easily executed. Thinking further on this problem I’d argue that even “let them go” scenario actually won’t hurt Amazon. Right now they are exposed to a huge amount of data which is generated internally, but it’s not the only data they are using – the huge data-gathering companies are still on the market (Nielsen, for instance) and data from them can still be used in case CPG companies quit on Amazon. This will definitely drive the costs up, but the process of creating private label substitutes for top brands is already unstoppable.

  2. The practice of using third party sellers’ data for its own label business is a contentious issue in Europe, where the EU’s regulatory body is probing into the anti-trust violations. It does give Amazon a massive advantage in deciding which product to pursue to increase its own sales. I personally do not think it is an anti-trust issue but it would be interesting to see how Amazon handles this regulatory issue.

Leave a comment