Public policy, regulatory agendas and the growing enthusiasm with big data have sparked an interest in the creation of comprehensive data sets to better monitor and understand financial markets. In this chapter, we use the experiences gained in constructing the National Mortgage Database (NMDB) to offer insights into the process and challenges of dataset creation. When complete, the NMDB, a joint effort of the Federal Housing Finance Agency (FHFA), the Consumer Financial Protection Bureau (CFPB), and Freddie Mac, will be a comprehensive database of loan-level information combined with information on associated properties and borrowers, starting with mortgages outstanding in 1998.

The NMDB will, among other things, help us to better understand the recent financial crisis and, as the market evolves, how and, perhaps, why mortgages perform as they do. With this knowledge, we can develop better mortgage products that meet the needs of a changing population, and far more effectively supervise and monitor the players in the housing finance market—perhaps avoiding or mitigating any future crises. The creation of the NMDB has been and continues to be an enormously challenging task. It requires bringing together multiple data sets, each developed for a specific purpose, covering different time periods and universes, and owned or operated by several public and private parties. This essay describes how the dataset is being built.

The Intent of the Data

The first stage in any data creation exercise considers the purpose of the dataset’s usage. From its inception, the NMDB was intended to be used to address a wide variety of economic and policy related questions. For example, the NMDB is designed to provide more accuracy in order to:

  • Satisfy statutory report mandates such as those required under the Homeownership Economic Recovery Act of 2008 (HERA) for the FHFA, or under the Dodd-Frank Act for the CFPB and the Department of Housing and Urban Development (HUD).
  • Measure trends in delinquencies for first and associated second lien mortgages overall and for many subpopulations (such as by previous mortgage status, credit score, race/ethnicity, geography and others).
  • Analyze the effectiveness of actions to reduce delinquencies and examine changes in indebtedness and credit scores over time.
  • Benchmark performance for regulatory oversight and enable regulators to monitor and set targets for the affordable housing performance of Fannie Mae and Freddie Mac (the GSEs) and institutions subject to the Community Reinvestment Act (CRA). In particular, performance of mortgages in targeted programs, such as the Federal Housing Administration (FHA), and those meeting the standards of the GSEs, can be compared to performance of a market-wide portfolio of comparable mortgages matched by date of origination, geography, loan size, borrower credit score and other factors.
  • Evaluate the efficacy and potential impact of counseling programs on mortgage performance, as well as counseling’s potential impact on “distressed” borrowers.
  • Using the NMDB’s survey and performance components, analyze the suitability of borrowers’ mortgage choices and their ability to sustain the mortgage, which will allow for the assessment of proposals to limit unfair or abusive lending activities.
  • Determine the contributions to and causes of the recent subprime crisis (both the boom and bust phases), as well as assess methods that may reduce the likelihood of its recurrence, using either the mortgage or borrower as the unit of analysis.
  • Allow limited conceptual analysis of broad fair lending issues on a national and market (although not lender) basis, using key information on mortgage terms and conditions, as well as borrowers’ credit worthiness and wealth.
  • Enable policymakers, researchers and regulators to improve their prepayment and default modeling and to implement “stress-test” scenarios for the entire U.S. mortgage market or for a subset of mortgages, incorporating assumptions about house prices, default and prepayment.

These potential uses determined the following key design requirements for the NMDB:

Representativeness: The purpose of the NMDB is to make inferences about the market. It is absolutely critical, therefore, that the NMDB be representative of the entire mortgage market.

Comprehensiveness: The NMDB is designed to address a wide variety of issues. This requires an equally wide variety of data, including detailed information about the mortgage, associated borrowers, and the underlying property. It also requires loan level data to conduct detailed analyses.

Timeliness: The intended regulatory and policy demands of the NMDB require that data on mortgage originations and mortgage performance be made available in relatively short order.

Usefulness: The intended public policy focus of the NMDB means that the data must be accessible to a wide variety of researchers, analysts, and housing and mortgage market advocates and practitioners in a form they can easily use but that balances privacy and competitive concerns.

Comparison of Design Criteria with Existing Data Sources

Based on these design criteria, existing databases were examined to determine the extent to which they already met the requirements and to assess their utility in constructing the NMDB. The primary sources explored were the Home Mortgage Disclosure Act (HMDA), the Federal Reserve Bank of New York’s Equifax Consumer Credit Panel, and the servicing databases owned by CoreLogic and LPS McDash. We also looked at public survey databases, including the American Housing Survey (AHS), the Survey of Consumer Finances (SCF), the Consumer Expenditure Survey (CES) and the Panel Study of Income Dynamics (PSID). We found that no existing data sets fully met the design criteria. In general, this is because a tradeoff exists between representativeness and comprehensiveness—data that are representative are rarely comprehensive, and data that are comprehensive are rarely representative.

The HMDA data include loan applications and outcomes for most mortgages with selected information about the loan, property, and borrower. The data are arguably the most representative publicly available existing data source about the mortgage market. However, it contains no information on loan performance, has little information on borrower credit-worthiness, and has up to a 21-month delay in release.

The Federal Reserve Bank of New York/Equifax Consumer Credit Panel provides a nationally representative 1-in-20 sample of individuals with credit records, observed quarterly from 1999 onward. However, little attempt has been made to clean the data of duplicates, and no additional fields have been merged to the original data. Thus, important information is missing about mortgages in the files, such as loan purpose, owner-occupancy, pricing, loan-to-value ratio, income and borrower demographics. Finally, these data have only limited accessibility to FRB staff.

The semi-annual American Housing Survey (AHS) contains comprehensive information on a nationally representative 1-in-2000 sample of mortgages of owner-occupied properties with very good property data and good borrower demographics. However, it contains no information on mortgage performance and limited information on the mortgage itself. Moreover, its public release is significantly delayed from the time the data are originally collected. The other nationally representative data sources (SCF, CES, PSID) contain no information on mortgage performance, provide only a small number of observations, and are released with significant lags.

CoreLogic and LPS McDash produce loan-level databases with performance information collected from the firms that service the mortgages. The servicing fields available from CoreLogic and LPS McDash are relatively comprehensive in both variables and size—the CoreLogic database claims about 32 million active mortgage loans, while the LPS McDash database claims about 40 million active mortgage loans. However, they offer no assurance of being representative, as they are composed of data collected from only about 25 servicers each. Moreover, mortgages cannot be tracked if servicing is transferred. Other drawbacks include limited and very costly access, minimal borrower demographics, and no information on other borrower obligations.

The credit repository data from Equifax, Experian, and Transunion are rich in credit information—by construction they incorporate data on credit card debt, installment loans, credit inquires and public records for the consumers they cover. Their marketing data add borrower characteristics including age, gender, and marital status. These data also include information on the borrowers’ moves and summary measures of their addresses such as census tract. However, there are important areas that are not covered. They lack some information on borrowers (e.g., race/ethnicity and income), mortgages (e.g., loan product and contract rate), and the underlying property (e.g., location and value).

Given these diverse and incomplete existing data sources, it was clear that a new database—the NMDB—would be required to meet the design requirements. The NMDB is designed as a 1-in-20 sample of all first lien, single-family mortgages rather than a universal registry. A sample can be large enough to support many different types of analyses but small enough to manage logistically, thus dramatically reducing both dollar and administrative costs. In addition, the use of a sampling frame permits the potential creation of a public-use version of the NMDB under federal privacy guidelines.

Credit repository data offered the best source from which to draw a nationally representative sample of mortgages. The three credit repositories all actively pursue loan servicers as data providers. As a result, they obtain information on almost the entire population of non-private mortgage loans made in the United States.

Developing the Pilot

The NMDB is unusual in its use of credit repository data as a sampling frame, which merges these data with other available sources to create a fully comprehensive data set. Given the novelty of its approach, it was critical to pilot its development prior to embarking on the creation of the complete database. Funded by Freddie Mac, the pilot enabled us to explore and resolve several critical issues. These included transforming consumer-level credit repository data to the mortgage-level NMDB and using the credit repository’s archives to construct the NMDB retrospectively, as if data had been collected on newly originated mortgages since 1998.

This required extensive collaboration with credit repository staff, and involved much “learning by doing.” The result of these efforts ensured that the complete version of the NMDB could successfully commence relatively shortly after being funded.

Credit repository data provides the basic terms of the sampled mortgages, monthly updates on their performance, information on any second liens on the sampled properties, and data on the other debt obligations of all the sampled mortgage borrowers and co-signers (including credit cards and car loans). The use of credit history provides information about borrowers’ experiences with earlier mortgages and the continued tracking of sampled borrowers until one year after termination allows for the characterization of events following the termination of sampled mortgages. In addition, repository data allow for combining credit information on all co-borrowers to provide household measures of credit worthiness.

However, as noted earlier, the credit repository data lack key information required by the NMDB. The pilot, therefore, explored techniques for merging additionally required data to the core obtained through the repository.

The credit repository data are most effectively merged with other data using personal identifying information (PII). This presents privacy and Fair Credit Reporting Act (FCRA) “permitted purpose” challenges. Under the pilot, legally acceptable, but equally effective, procedures were developed to merge credit bureau data with data from third-party providers of property information (such as purchase price) and with HMDA data to provide borrower race, ethnicity, sex and income.

Finally, there is a survey component built into the NMDB designed to collect information on borrowers’ experiences and attitudes that are not otherwise obtainable. It also provides an opportunity to collect critical information on contemporaneous issues or policy of regulatory interest.

An advisory group including participants from government agencies, non-profit organizations, consumer advocacy groups, trade and industry groups and academia provided guidance as we developed the survey instrument. Three overlapping mail surveys were developed and tested: one aimed at borrowers with newly originated mortgages; a second sent to those with active mortgages, and a third for those with terminated mortgages. The survey solicits information on financial literacy and homeownership counseling; mortgage shopping; the mortgage closing process; expectations about house price appreciation and critical household financial events; the existence of “trigger” events such as unemployment spells, large medical expenses and divorce; and detailed demographic information.

The process of administering the surveys was also examined. The FCRA restricts the NMDB survey to a mail format, which reduces costs but can substantially lower accuracy and response rates. To determine the best way encourage a high response rate, we tested several incentive strategies and cover letters.

Data Production

The pilot served as a proof of concept for the NMDB. Production of a complete database, however, requires permanent funding, as the database requires an extensive investment in data preparation, data cleaning, documentation and presentation. The first step, then, was securing funding. Formalizing the NMDB as a government resource was believed to be the most appropriate method likely to minimize duplicative effort while providing the best opportunity to make the resulting data publicly available. Additionally, the federal government is likely to be the major user of the NMDB. As a result, the FHFA and CFPB have taken the lead in the development of the database, with support from Freddie Mac staff.

Next, we needed to select a credit repository and other partners. The production staff issued a Request for Proposal and chose Experian as its credit repository partner. It is also working with other federal agencies, such as the Federal Housing Authority, the Department of Veterans Affairs, and the Rural Housing Service for assistance in administrative file matching, and is exploring relationships with third-party data providers.

Ultimately, the production team must build on the efforts engaged in the pilot to produce a working dataset. The primary data challenges in creating the NMDB are as follows:

Repository data are designed to provide once-a-month snapshots; they are not designed for tracking over time.

Repository files contain many duplicative records of a single mortgage. Duplication appears to be a particular problem for mortgages originated prior to 2007, where duplicates account for roughly 25 percent of these loans.

The repository files contain no direct measures on the purpose of a mortgage loan (home purchase or refinance) or whether it is for an owner-occupied property, vacation home or investor property and only imperfectly classify first and second liens.

The address of the property associated with the mortgage proves to be both important and elusive; mortgage servicers report the billing address of the mortgage borrowers, but this is not necessarily the property address, particularly for mortgages on non-owner occupied properties.

Matching to external data sources is critical for ensuring the comprehensiveness of the NMDB. While HMDA matching has proved feasible, match procedures for other data sources are under development.

The production team must develop tools and procedures to address these issues. This includes developing and documenting mechanized data cleaning and scrubbing protocols, resolving data duplications and inconsistencies, developing tools for tracking individuals and mortgages over time, determining the “best” value of variables that are available from multiple sources, merging with external data sets, and imputing values for missing key variables. In addition, to achieve scale and consistency, the production team must develop computer algorithms and protocols for processing the data as part of regular production maintenance.

Access and Availability of Data

In order to be useful, the NMDB data must be both accessible and relatively straightforward to use. These are significant challenges. The size and comprehensiveness of the database is its strength, but it also creates difficulties for users. The detailed information it contains raises privacy concerns, which poses a threat to its accessibility. The NMDB production team is exploring techniques for addressing these concerns.

Access to the NMDB likely will be provided initially only to Freddie Mac and federal government staff while privacy concerns and data complexities are resolved. The long-term vision includes the development of alternative versions of the NMDB with various levels of access depending on the type of information included. The complete dataset, including PII, will be maintained by Experian alone. PII will be used only for data matching or survey operations, which will be conducted using techniques to ensure that PII as well as other proprietary information is fully protected.

The fully “cleaned” version of the NMDB will be the primary one used for supervision, analysis and research and to create regular public reports on the condition of the mortgage market. One variant, which will be updated quarterly, may be used to track sample mortgages, with the mortgage as the unit of analysis. A second variant could be the historic database used to study the mortgage crisis, where either mortgages or borrowers can be used as the unit of analysis.

Federal employees and employees of the Federal Reserve Banks, Federal Home Loan Banks, Freddie Mac and Fannie Mae will be granted access to this full, cleaned version of the dataset, including geographic information to the census tract level.[1] These users will be held to strict security standards to ensure that potentially sensitive information is not released. NMDB project staff is also exploring ways to grant access to the full dataset to researchers outside of the federal government. One idea under consideration is the use of access processes similar to those employed by the Census Bureau.

NMDB project staff will also develop a data interface for the full dataset that facilitates a broad range of queries addressable with aggregated data. There are two goals here. First, a majority of potential users of the NMDB are expected to be interested in relatively simple questions. An interface will well serve these requests, and provide access to the NMDB for a wide variety of people and for many purposes. Second, by using only aggregated results, the interface will address privacy concerns. This expands the use and usefulness of the NMDB. A potential further version of the database could include only information on borrowers who have participated in the NMDB origination survey, and would be made available to the public once the release meets federal privacy guidelines.

Finally, if feasible, a public version of the full NMDB dataset will be made available. This requires that standards be developed to ensure that data released fully meet federal privacy and FCRA guidelines. It is not yet clear whether this database can be released at a mortgage or borrower level. Possibly, access will only be available to aggregated data, which can nevertheless be used to respond to a wide variety of queries.

Learning from Development of the NMDB

The creation of the NMDB has not been without challenges. These can be categorized into three primary areas: accessing and merging commercially available data with less public data; providing clean rather than raw data; and granting access to the database, given restrictions from the FCRA and privacy concerns.

The goal was to create the ideal database. All existing databases were faulty due to lack of representativeness or due to a lack of critical data fields. The tradeoff made was to sample from the most inclusive database currently available (the credit repository sampling frame) and to supplement with everything else needed. Hence, the NMDB focused on representativeness and inclusiveness. Even so, compromises were necessary, and no researcher will have everything they might like. For example, contract restrictions prevent including information on the lender and servicer and on more detailed geographic areas. Even with its limitations, the NMDB offers researchers and policymakers an invaluable resource that can contribute to better informed public policy and practice in the mortgage arena. It sets a new precedent by brokering access for public purposes to the rich information previously locked in commercial databases. Linking the credit data with critical elements from other public and private data sets can bring us closer to understanding both the complex dynamics of the mortgage market and financial implications for households.

The process of creating any database should follow the same general steps the NMDB team followed. First, start with an understanding of which questions are being raised and which answers can be sought. Second, find out whether existing databases can meet the needs. Next, assuming none exist, find the best one to start with and choose a sampling frame. Test the data collection process. Validate the data. Perfect the data to the extent possible, including processes for cleaning and removing duplication. Finally, be sure to plan for data release to allow access to the insights from the data while protecting privacy.

Our experience suggests that by following these steps, it is possible to create a database blending data from multiple sources that can be generally and widely used for a variety of purposes with substantial accuracy.

[1]   Because of contract restrictions, information on the lender and servicer will not be included.