How Machine Learning Can Bridge the Data Gap in Sustainable Finance

Oliwia Leonczak

In collaboration with a leading international bank, Zanders explored how machine learning can support more accurate, scalable, and decision-useful estimates of greenhouse gas (GHG) emissions intensity when disclosures fall short.

In the pursuit of climate-aligned finance, financial institutions face a critical challenge: incomplete emissions data. While disclosure frameworks such as the EBA’s Pillar 3 ESG requirements, the ECB’s climate risk guidance, and the EU Corporate Sustainability Reporting Directive (CSRD) continue to expand, their scope remains fragmented. Therefore, financial institutions must often assess climate-related financial risks and align portfolios without full visibility into counterparties’ environmental footprints.

In collaboration with a leading international bank, Zanders explored how machine learning can support more accurate, scalable, and decision-useful estimates of greenhouse gas (GHG) emissions intensity when disclosures fall short.

The Challenge: Incomplete GHG Emissions Disclosure

Current climate risk assessments rely heavily on firm-disclosed emissions. Yet, many companies, particularly small, private, or non-European, still do not report their GHG emissions. This inconsistency not only limits the accuracy of portfolio-level financed emissions metrics, but also hinders accurate net-zero alignment tracking and regulatory reporting.

To fill this gap, many financial institutions resort to sector-average proxies, such as those recommended by the Partnership for Carbon Accounting Financials (PCAF). These proxies assign emissions to non-reporting firms based on average industry and regional emission intensities. While widely adopted, this approach introduces substantial bias, as it overlooks firm-specific drivers such as energy use, capital intensity, or geographic differences. The result is a blind spot: portfolio assessment loses the very granularity needed to distinguish leaders from laggards in the low-carbon transition.

Predicting Emissions Intensity with Machine Learning

The main objective of the study focused on testing various supervised ML models to estimate Scope 1 and 2 GHG emissions intensity based on a variety of financial firm-level characteristics. Leveraging an unbalanced panel dataset covering worldwide public and private companies from 2021 to 2025, models were trained to learn from disclosed emissions and predict missing values with greater granularity. The dataset was split into approximately 80 % training and 20 % testing subsets, ensuring that observations from the same company (across different years) did not appear in both sets to prevent information leakage.

Two models were introduced:

Model 1, a baseline that includes financial and sectoral indicators widely available for banks, such as assets turnover; property, plant and equipment (PPE); earnings before interest and taxes (EBIT); and industry classification.

Model 2, an extended model that incorporates more advanced and less common variables such as Refinitiv ESG score; energy consumption; and earnings quality rankings.

These predictors were selected based on both academic relevance and practical availability in financial databases such as LSEG Workspace (previous Refinitiv Eikon) and S&P. 

In both Model 1 and Model 2 settings, three algorithms were compared: k-Nearest Neighbours (k-NN), Decision Trees, and Random Forests, chosen for their interpretability and practicality in low-data environments. To assess whether machine learning provides a meaningful improvement over traditional sector-average proxies, both the ML models and PCAF sector-average proxy estimates were examined on a common test set. Unification of this comparison allowed for quantifying the overall predictive gains and evaluating the implications for climate-aligned decision-making in finance.

Models performance was evaluated using standard regression metrics including Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE), ensuring consistency across models and comparison with the baseline. Beyond standard error metrics, performance was also assessed through variance recovery (reported further as Median Variance Relative Gain). This measure captures how effectively each model restores firm-level differentiation in GHG emissions intensity lost under sector-average proxies.

The entire framework was designed to balance predictive accuracy with implementation realism, aiming to improve GHG coverage for financial institutions without relying on black-box techniques or data-heavy infrastructure.

What the Models Revealed

Under each model and method, machine learning substantially outperformed the traditional PCAF proxy approach:

The Random Forest version of Model 2 emerged as the strongest performer, reducing RMSE by roughly one-third, MAE by more than a half and recovering nearly 65% of the intra-sectoral variance lost under sector-average proxies. Model 1, created for banking sector usage, scored a second place under the same Random Forest algorithm, reducing RMSE by 21% and MAE by 41%. This means that the algorithm can effectively differentiate firms within the same industry, being a critical step for a realistic transition-risk modeling or portfolio creation.

Feature importance analysis showed that Energy Use Total, PPE / Total Assets, Asset Turnover and Sector were consistently dominant predictors, confirming that emissions intensity depends jointly on operational efficiency and capital structure. However, the study also tested a transfer learning approach, where models trained only on high-disclosure sectors with sufficient reporting coverage were applied to low-disclosure sectors, unseen during training. The results showed a substantial decline in accuracy, suggesting that emission patterns are highly sector specific. In practice, this means that for ML models to exceed sector-average proxies in the GHG emission estimation context, models should be trained on datasets that include all sectors, rather than relying on samples limited to a few well-disclosing industries.

Why This Matters

More accurate emissions estimation directly supports key pillars of sustainable finance. It enhances portfolio alignment assessments, scenario analysis, and climate risk disclosure under frameworks such as Task Force on Climate-related Financial Disclosures (TCFD) and the EU Corporate Sustainability Reporting Directive (CSRD). Moreover, improved firm-level granularity enables financial institutions to better understand which clients are leading or lagging in the transition to a low-carbon economy.

By replacing rigid proxies with data-driven predictions, financial institutions can move one step closer to climate data maturity, where decisions are no longer held back by disclosure gaps but empowered by intelligent estimation.

What Zanders Can Do

As regulatory expectations tighten and data coverage remains incomplete, financial institutions need solutions that are both technically rigorous and operationally feasible. Whether addressing climate-related credit exposures, integrating ESG into portfolio construction, or navigating disclosure obligations, institutions must adopt frameworks that are adaptive, data-driven, and aligned with supervisory standards.

By combining quantitative modeling expertise, climate risk analytics, and regulatory knowledge, Zanders helps institutions move beyond generic estimates and static proxies.

Want to find out more about how we can support you in building practical ESG risk management solutions? Our ESG experts will be happy to assist you. Visit the Zanders ESG page to know more.