Machine learning as a tool to predict future earning power

This post discussed how a machine learning algorithm can be applied to predict student earnings post-graduation.

In recent years, machine learning and artificial intelligence are two terms that have gained mass popularity, particularly due to the promise they hold for numerous industries and sectors. When considering organizations that collect substantial amounts of data from their users, machine learning provides a very unique opportunity to not only analyze this data, but also to recommend actions based on the data, with minimum or no human input, making it an extremely attractive technology. AlmaPact, a HBS-founded student financing software startup, is a specific example of how machine learning can be applied to data to make recommendations better than a human expert ever could.

The AlmaPact platform connects students to investors, and creates income share agreements between the two parties, in which the student receives an upfront lump sum amount from the investor, and in return pays the investor a specific percentage of their income for a predetermined number of years after they finish school. The most difficult aspect of this value proposition for both parties is the current inability to accurately and reliably predict future earnings of students in whom investors are investing. From the student perspective, the higher their predicted income, the better their income share terms will be. They will be asked to pay a lower percentage of their income over a shorter time period. However, if the income prediction algorithm fails, or does not accurately predict the student’s post-graduation income, investors are penalized, receiving a much smaller payback amount than was predicted when they invested in the student.

To solve this problem, machine learning presents an enormous opportunity. As the population of students and investors who are executing income share agreements grows, and by looking at a variety of data points that are both inputs to the income prediction model and real-world outcomes of student job placement, a machine learning model is able to rapidly develop an understanding of what characteristics are very strong predictors of future income (e.g. education program, educational institution, pre-matriculation career, etc.), and what characteristics are very weak, or negative predictors of future income (e.g. parents’ income, hometown, etc.). The more data that the platform ingests, the more accurately the model is able to predict income, which is a self-reinforcing value creation loop that is a hallmark of machine learning. There is an additional network effect that makes this self-reinforcing loop even more attractive; as the model continues to refine itself and is able to predict income more  accurately, the returns of the ISA become more predictable and less volatile, which attracts more investors to the platform. More investors results in more students who are able to get ISAs. This two-sided network effect is only possible with a robust machine learning algorithm.

In order to address the promise of machine learning, we (the founding team) are trying to get as much data on the platform as we can immediately, focusing on a few key data sources, so we can build a rough prototype of the model. In the near term, we need to make sure that we can continue to feed data to the model so that it is able to train itself and improve its ability to predict income. In the long term, this will result in an enormous barrier to entry for competitors, as our model will be based on years of data and will subsequently be very difficult to replicate without years of investment. This barrier to entry is one of the most promising aspects of the business, which is why we are putting much of our focus right now on how we can start acquiring and analyzing data. However, this is not a sure bet, as we don’t yet know that there will in actuality be a combination of data points that allow our model to closely predict income. There will almost certainly be some amount of volatility in our predictions, and a core assumption of ours is that the market will tolerate this volatility.

In the context of AlmaPact, I would love to hear from my classmates what data sources they intuitively think may be strong predictors of income, or what types of data we should start with to start training our machine learning model. I would also like to hear if my classmates think there are any other aspects of the business that could be considered strong candidates for a machine learning application, in addition to the income prediction application that this essay discussed.

(738 words)


Using decision tree classifier to predict income levels. July 2017.

US Adult Income: Salary Prediction November 2017.

Unfold Income Myth: Revolution in Income Models with Advanced Machine Learning Techniques for Better Accuracy.

Machine learning income prediction using census data. January 2017.

Analysis and prediction of adult incomes in U.S. January 2017.


From Auto- tune to Auto- compose


Don’t Give Up the Ship: A Smarter Approach to Maintenance in the US Navy

Student comments on Machine learning as a tool to predict future earning power

  1. David – Very interesting idea. My question revolves more around the business plan than any machine learning-related aspects per se. I understand the incredible value that can be unlocked for ML in making credit risk assessments, and financial firms are already starting to use it in that respect, but I’m curious why someone would choose a loan requiring payment in the form of % of income vs. a traditional interest rate because if I end up doing better than expected on my income, then I’m paying more for the money I received upfront. Isn’t that sort of like taking an equity stake in a person, because you receive a share of their future profits in exchange for capital today, and if so, what do you make of the ethical considerations that you now own a stake in a person (vs. a corporation)’s future?

  2. Such an interesting concept! I think ML can be hugely beneficial for your company, since like you said, the algorithms will become more accurate at predicting income over time. I want to build upon JP’s point about your business plan. There are currently companies out there (Earnest is one that comes to mind) that will give different interest rates to students depending on potential. If I’m understanding your business correctly, however, your point of differentiation is that you’re actually taking a percentage of future income. To me, as a potential client, I’d almost be hedging my bets if I finance with you. So if for some reason I earn less than what I thought I’d be, at least I won’t be paying as much in student loans as I could be. I wonder though if by constructing your business this way, you might inadvertently be cast into the market for lemons, with people who know they’ll earn less than what a model predicts they should being the majority of your customers. I also see a potential problem with future moms who choose to work reduced hours, can / will your ML algorithms account for that?

  3. Wow, what a great founding team! I’d push the team to think about what they are trying to solve for — is the point of income-sharing agreements not to improve access to education (in a way that is mutually beneficial to the recipient and an investor)? If so, by looking at predictors of which individuals have characteristics that may be negatively correlated with income, are you not likely to start excluding the majority of the individuals who would actually demand this type of product the most?

  4. Great article! Like MM points out above me, the ML technology will have to be a good screen of good intent and have an ability to detect moral hazard. Knowing that this program only appeals to those earners who do not have much faith in their future earnings potential, the company will have to be super careful in its data collection to predict those who are using the lending scheme for good rather than bad.

  5. Because the majority of graduate students undertake huge upfront costs in the hopes of increased future earning potential, income sharing with a sponsor (such as the investor) makes a lot of sense! It reminds me of indentured servitude (in a good way(?), if that makes any sense). I’m assuming you have done segmentation to see if the high-earner is Corporate_Type (short run high earners, with a high starting salary in a corporate setting) or Non_Corporate_Type (long run eventual high earners, in successful startups). Both types would require different algorithms, after being categorized into a segment or persona.

    Then, to make the algorithm more accurate for Corporate_Type, I would feed it indicators that help us see whether the candidate is hunting for high salary jobs. If we can get data to see if the candidate has registered for PE, investment banking, managerial consulting recruitment events, then this may serve as a good predictor. Obviously, a better predictor would be summer internships at such firms and satisfaction with said summer internship, but I’m assuming this is too late for your team.

    So an early indicator might be if the good friends of this candidate are recruiting for high salary jobs. Perhaps you can mine LinkedIn or Facebook data to see whether the candidate’s social circle is comprised of high earners. My hypothesis is that this is likely to impact the income of the candidate, by making it more likely for this person to want to keep up with the Jones.

  6. Hey David! I really like this idea of a market for “personal equity”!

    I have a two question/comments. The first is on the algorithm and the second on the business model itself.

    1) I understood that you are trying to look at historical data to try to find correlation between that data and realized income. Nevertheless, I would guess that the historical level of debt may actually influence someone’s career path (if that person needs to repay, will actually search for a high income job vs. his/her “life aspiration”). On the other hand, people that have “outstanding personal equity ” might go on more “risky” careers, since they do not have to repay school and will have to share their income anyway. How are you guys thinking about that potential bias in the data? Do you see any way to “correct it”?

    2) On the business model, I imagine that a decisive moment for the company comes when the first successful “personal equity” agreements start to bear results and show nice return/low default rates for investors. Given the long term nature of the “loans”, how do you see AlmaPact iterating it’s models and being able to scale it’s “personal equity” agreements? As importantly, how to follow and evaluate the probabilities of repayment over time in the first few years of AlmaPact, before actual results start to show? I feel that would be crucial to prevent a “big surprise” when repayment period begins.

Leave a comment