Visit hbs.edu

Streamlining Data Processing for Smarter Business Decisions

Effective data processing and analysis are key factors for businesses that want to make informed decisions. The working paper “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python”  by Michael W. Toffel, Senator John Heinz Professor of Environmental Management at Harvard Business School and Principle Investigator of D^3’s Climate & Sustainability Impact Lab, and Melissa Ouellet, research associate at Harvard Business School, offers a comprehensive guide on best practices for handling data. Designed for professionals using popular programming languages and software environments like Stata, R, and Python, the paper is a crucial resource for those new to econometrics and data analysis. It covers a range of topics and, for each topic, the authors provide how-to examples in Stata, R, and Python. Only a few will be highlighted in this insights article.

Key Insight: Consistent Coding Practices

“A well-defined coding style guide is essential for maintaining consistency, readability, and efficiency in code.” [1]

  • Global commands, such as memory settings and file paths, ensure the software operates uniformly.
  • Profile files in Stata, R, and Python allow for automated configuration, streamlining each session.
  • Organizing code with proper commenting and clear structure enables teams to troubleshoot and improve the workflow collaboratively.

Consistency in coding makes it easier to maintain and share code across different teams, reducing errors and increasing productivity. Establishing coding guidelines helps keep everything structured, which is essential when multiple team members handle the same code.

Key Insight: Reproducibility in Analysis

“Detailed documentation of each step in the analysis workflow ensures transparency and reproducibility. This includes documenting data sources, processing steps, analytical methods, and any assumptions or decisions made during the analysis.” [2]

  • Clear documentation and file management make it easy to track the workflow, from raw data to final results.
  • Tools like Jupyter Notebooks and R Markdown gather your code, data, and narrative all in one place, supporting transparency and reproducibility.
  • Document each stage of data processing to help ensure the quality of the final dataset and record these details for clarity for the team and reproducibility in the future.

Reproducibility is vital for ensuring that your data-driven decisions can be verified and replicated. By adopting systematic approaches, businesses can confidently share and build upon analysis results.

Key Insight: The Importance of Data Cleaning

Cleaning data ensures accuracy and prevents misleading results. Dealing with duplicates, missing values, and outliers is essential to make sure that the data you analyze is reliable. Ouellet and Toffel provide strategies with Stata, R, and Python for cleaning your data and avoiding the pitfalls that can create data disarray.

  • “Identifying the unique combination of variables that uniquely define each record ensures there are no duplicates and each entry is distinct.” [3] For example, in Python, use the `unique` method from the `pandas` library to get the unique values of a given column.  
  • “Identifying patterns of missingness and understanding their implications helps in choosing appropriate methods to address them.” [4] For example, in Stata, use the `npresent`, `nmissing` commands to identify missing values. 
  • Outliers are “data points that deviate significantly from other observations. Identifying outliers involves using domain knowledge to recognize values that are unusually high or low and assessing their validity.” [5] For example, in R, use the `boxplot.stats` function or the `outliers` package.

Key Insight: Advanced Regression Techniques

When considering your regression analyses, keep in mind the following steps:

  • Regression diagnostics and hypothesis tests help control for any violations of regression assumptions
  • Adjusting standard errors will help you take into account heteroscedasticity and clustering, which will improve model accuracy
  • Ordinary Least Regression (OLS) helps you minimize the sum of squared differences between predicted and observed values. Make sure you check the assumptions your OLS regression is based on in order to enhance the reliability of your results

Ouellet and Toffel point out that, in addition to the techniques above, logistic regression, the poisson regression, and post-regression analysis should all be considered in regression analysis. The paper provides useful examples of how to perform all techniques in Stata, R, and Python. 

Why This Matters

For business professionals, applying the best practices like those laid out by Ouellet and Toffel will lead to improved data accuracy, consistency, and deeper insights. Whether you’re streamlining internal processes or assessing market conditions, these tools will help ensure that your analyses are both reliable and reproducible. Mastering the art of data processing and analysis enables companies to make informed, data-driven decisions that foster growth, innovation, and competitive advantage.

References

[1]  Melissa Ouellet, and Michael W. Toffel, “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python”, Harvard Business School Technology & Operations Mgt. Unit Working Paper No. 25-010, (August 30, 2024): 1-36, 3. 
[2] Ouellet and Toffel, “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python,” 11. 
[3] Ouellet and Toffel, “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python,” 17. 
[4] Ouellet and Toffel, “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python,” 17.
[5] Ouellet and Toffel, “Empirical Guidance: Data Processing and Analysis with Applications in Stata, R, and Python,” 18.

Meet the Authors

Melissa Ouellet is a Research Associate at Harvard Business School. She holds an MSc in International Health Policy and Economics from The London School of Economics and Political Science (2007) and an MSc in Applied Statistics from the University of Oxford (2008). She has 15 years of experience in statistical modeling where she has significantly contributed to research on healthcare systems, valuation instruments, and innovation in the medical device and pharmaceutical industries.

Michael Toffel is the Senator John Heinz Professor of Environmental Management at Harvard Business School and is a faculty member and prinicipal investigator of D^3’s Climate and Sustainability Impact Lab. His research examines how companies are addressing climate change (especially decarbonization) and other environmental and working condition issues in their operations and supply chains.


Engage With Us

Join Our Community

Ready to dive deeper with the Digital Data Design Institute at Harvard? Subscribe to our newsletter, contribute to the conversation and begin to invent the future for yourself, your business and society as a whole.