Evolving technology has simplified and reduced the cost of creating and using linked data sets in ways that would have been unimaginable only two decades ago. Linked data sets are an increasingly important tool in marketing, in business decision making, and most relevant here, in shaping and evaluating public policy initiatives in health care, housing, and social services, among other domains. However, because these data sets often contain identifiable personal information, their creation and use can ignite broad and legitimate public concerns regarding the protection of personal privacy.

This essay discusses the tension between using linked data sets to inform policy and the privacy and other concerns that emerge from the use of such data. The United States Bureau of the Census defines a “data set” as “any permanently stored collection of information usually containing either case level data, aggregation of case level data, or statistical manipulations of either the case level or aggregated survey data, for multiple survey instances.”[1] For purposes of this essay, “data sets” include but are not limited to data in electronic format such as health records, housing records, educational records, and child welfare records. “Linked data sets” refers to the ability to be able to work with and integrate information from one data set with that contained in another, for example, to “link” school discipline records with juvenile justice records to determine if individuals who appear in the juvenile justice system were more likely than those who did not to have a disciplinary record in school. The Early Childhood Data Collaborative defines “secure linking” of data sets as “the ability for state data systems to share unduplicated data about program participation, the services a child receives and developmental assessment data across programs and over time, while data are protected from inappropriate access or use.”[2]

It seems inevitable that the use of such linked data sets will expand rapidly during the next few years, and that shared and linked data will become an essential tool of policymakers in every sphere. The reason for linked data’s growing relevance is because policy initiatives in one area—for instance, housing—typically can affect individual and community outcomes in other areas such as health or education. As a result, analyzing data from only one system frequently results in a one dimensional perspective that misses myriad outcomes in other systems, and thus makes it more difficult to accurately diagnose a problem and develop a solution. Furthermore, linking data is necessary for understanding how interwoven systems affect individuals and communities over time. But in linking data, privacy concerns must be acknowledged and addressed. This essay provides examples of the use (and in some cases misuse) of linked data bases in developing and evaluating social policy, discusses political and legal challenges to using such data, and potential solutions to those challenges.

Opportunities and Challenges

Linked data can help policymakers shed light on broad social issues in myriad ways. For example, Massachusetts created the Massachusetts Environmental Public Health Tracking Program in response to the lack of information on the impact of environmental factors on health.[3] The project provides prevalence and other information on the relationship between environmental factors and health issues such as birth defects, cancer, and heat stress. By providing this information, the project’s web portal can provide common data to environmental and local and state public health officials interested in finding solutions to problems caused by the interaction between environment and health.

Although linked data is no guarantee of coordination among policymakers, it creates a tool and opportunities to do so, in part because it permits questions to be posed and answered empirically. In a paper urging states to more readily share data across state agencies, Rebecca Carson and Elizabeth Laird[4] assert that important questions about school progress can be addressed over time, for example:

  • To what degree does participation in early childhood programs increase kindergarten readiness and do children sustain those gains through third grade?
  • What indicators suggest that students may be at risk to drop out of school, or conversely may go onto college or careers?
  • How many and what kind of high school graduates need assistance in their first year of postsecondary education?

The use of linked data for these purposes is not confined to the United States. In England, researchers linked health and social care (that is, social work, social support, personal care and related non-health services) data from disparate sources to create models that could predict which individuals aged 75 and older would require intensive social care in the subsequent 12 months. Although the models were less successful than hoped, the work points to further efforts to use linked administrative data to better target services.[5]

The type of data relevant to policy varies depending on context, situation, and source. For example, in health care, sources of data may be as disparate as social media and biometric data. “Big data” linking these various sources is enthusiastically discussed as a tool to control costs, improve the quality and efficiency of care, address fraud, and detect disease earlier through advanced technology such as electronic sensors.[6] In other fields, such as community development, there is increasing interest on the part of international bodies such as the United Nations in using data “to gain insight into human well-being and development.”[7] As promising as all these efforts might be, policymakers can anticipate political, legal, and technological challenges to using integrated data sets for policy purposes. Each is discussed briefly below with potential solutions.

Political Challenges

Public concern over personal privacy may create a barrier to data integration. It is unclear how deeply or broadly those concerns run. One poll regarding activities by the National Security Agency (NSA) to mine phone and other electronic data showed that a majority of Americans value privacy over security, while an earlier poll showed that a majority of Americans thought the NSA program was acceptable as a tool in combatting terrorism.[8] More influential in shaping public opinion are breaches of security that raise fears of identity theft on a mass scale, such as the Target data breach in late 2013. In addition, potential privacy issues emerging in geotagging (the process of adding geographical identification to a photograph or website) may add to the concern. The use of large data sets that might yield significant information about individuals without their express knowledge or consent may become more politically charged.[9]

There have also been multiple breaches involving health data, which may exacerbate fears over intrusions into privacy. For example, a health system in Texas revealed that records of up to 405,000 patients may have been compromised in December 2013 when one of its servers was hacked, potentially exposing names, dates of birth and Social Security numbers.[10] Data breaches involving health care records increased by 138 percent between 2009 and 2012, with nearly 30 million records compromised in that period.[11]

Political Solutions

Although there is no standard solution for addressing the politics of data sharing, there is little doubt that the issue has political salience and that privacy concerns must be balanced against the benefits of data use.[12] As the number and variety of examples of using integrated data in policy grow, the benefits and payoffs will emerge more clearly. Leaders of public agencies reluctant to share data to avoid the possibility of inappropriate disclosure or negative public perceptions may ultimately conclude that the benefits outweigh the risks. In addition, toolkits now exist for communicating the benefits of data integration and in the process ease doubts. Some excellent examples have been developed by the Data Quality Campaign and the National Neighborhood Indicators Partnership.[13] Ultimately, however, given that nearly all significant breaches of privacy have occurred because of insufficient security, the political issues regarding privacy can in part be addressed by improving data security.

Legal Challenges

Often those who do not want to share data believe the law does not permit it. Occasionally, this is true, but in many circumstances the claim that it is unlawful is a convenient reason to halt the conversation before it gets started. Confidentiality law in the United States is a patchwork of state and federal law. Some confidentiality laws (for example, many state health and mental health confidentiality statutes) were written long before the emergence of electronic data sets and therefore are increasingly antiquated. In other situations, such as confidentiality protection for those who are HIV positive, states wrote stringent special laws because of potential discrimination. Other laws, such as the federal Health Insurance Portability and Accountability Act (HIPAA) and the Family Educational Rights and Privacy Act (FERPA), are designed to create national standards. Courts have created other confidentiality rules. For example, the U.S. Supreme Court in 1996 ruled that clinical information created in psychotherapy sessions was privileged (that is, could not be accessed in legal proceedings).[14] But federal law does not always take precedence. For example, if a state law provides greater privacy protection to protected health information than HIPAA, then the state law applies. This complex web of overlapping and sometimes conflicting law can make negotiations over integration and use of data for policy purposes frustrating even for those fully committed to its use.

Before turning to potential solutions, it is worth noting why this complexity exists. First, each confidentiality law focuses primarily on a specific type of information created in the context addressed by the law. For example, HIPAA addresses “protected health information.” FERPA primarily addresses educational records. As a result, standards for waiving confidentiality or accessing the information in question may vary by law, for information that identifies or may identify an individual and for such records in more aggregated form.

Second, although confidentiality is a core value, it is not absolute. Every confidentiality law provides for situations in which information subject to the law may or must be released. Sometimes information specific to an individual may be sought in a legal proceeding in which the court orders release of an individual’s medical records. In other contexts, oversight agencies receive aggregated data on specific outcomes. For example, states must report child welfare data in seven categories to the U.S. Children’s Bureau for an annual report to Congress. There are similar requirements for the reporting of homeless data to the U.S. Department of Housing and Urban Development.[15] While these data do not typically identify individuals, they rest on the collection of information from numerous individual cases.

Third, the real controversy that often arises in discussions about data sharing is whether the law permits access to individually identifiable information for the purpose of data integration and use. This can make access more complicated because of reluctance to release individually identifiable information. Yet information that identifies individuals may be essential for analyses most useful to policymakers.[16] For example, New York City staff reported the benefits of specific programs for the homeless. The reports were based on five years of “mortality surveillance” data of the city’s homeless population.[17] The authors of the study noted the benefits of using real-time, individually identifiable data compared to aggregate data and it is worth quoting them at length:

“Retrospective analyses of aggregate morbidity and mortality data from a specific study period can identify health problems such as multiple comorbid conditions, substance abuse, or mental illness that result in premature death in a homeless population. However, homeless mortality surveillance offers the advantage of ongoing, systematic, and timely data collection and dissemination that reflects the current health status of the homeless population. Ongoing surveillance can identify changing trends in illness and death…in close to real time, allowing faster implementation of preventive interventions.”

Legal Solutions

Despite these problems, policymakers are using integrated data, and there are good resources available for helping those who wish to take advantage of these data and techniques navigate the complexity. For example, the University of Pennsylvania leads the Actionable Intelligence for Social Policy initiative, which is developing and using large integrated data sets, many with individually identifiable information, for policy purposes.[18] They have commissioned a series of papers, including an overview I wrote of the “state of the law” on confidentiality and access.[19] Another example of a university-based initiative is the Information Sharing Certificate Program at Georgetown University, which teaches leaders in youth-serving agencies how to overcome information-sharing challenges while protecting the privacy of youth and their families.[20]

Other resources describe agreements that enable access and use of protected data. For example, HIPAA may require the use of a “business associate agreement” between a state agency and a party accessing protected health information for purposes of analysis. The US Department of Health and Human Services offers a description of the purpose and requirements of such a business associate agreement and also provides sample agreements that can be adopted.[21] The National Neighborhood Indicators Partnership devotes a web page to the “key elements of data sharing agreements.”[22] The Data Resource Center for Child and Adolescent Health, which offers data sets based on interview data provided by the National Center for Health Statistics, provides a data use agreement with every request for data.[23] The State Data Resource Center website of the Centers for Medicare and Medicaid Services provides information on the types of data available to state Medicaid agencies enrolled in both Medicare and Medicaid, including a Data Use Agreement.[24] In short, and in contrast to a few years ago, there is a wealth of information on using individually identifiable data for policy purposes. These resources make it easier to tend to the needs of all parties while overcoming barriers to the use of large data sets, including those which contain identifiable information.

Technical Challenges

Technical advances have made the development, integration and use of large data sets possible and have created a sense of promise about integrated data’s potential. These advances include both vastly improved statistical and computational methods and the exponential growth in storage and computational capacity.

However, technical issues can also thwart the promise of the revolution in method and capacity. For example, a data set generated for one purpose (such as arrest data) may contain a different personal identifier than that contained in another (such as Medicaid data). This makes accurately linking the data sets difficult, and thus compromises the ability of analysts to perform the analyses that policymakers would like by complicating efforts to track individuals across data sets.

In 2011, 25 Semantic Web and Database researchers convened in Riga, Latvia to discuss opportunities and challenges of using “big data,” including linked data. In a summary of the proceedings[25] one of the participants suggested that there were two “challenge classes” that must be met in order to use the data widely: the first, an engineering challenge of “efficiently managing data at unimaginable scale” and the requirement for advanced computing power and software that government agencies or nonprofits likely do not have. The other class of challenges is “semantics,” that is, “finding and meaningfully combining information that is relevant to your concern.”[26]

There are also resource and skills issues. The period of rapid advancements in data integration happened to coincide with cuts to the government workforce, limiting the number of staff available to work on data development. Therefore, whether a governmental agency has the intellectual capacity to engage in this work or develop the capacity to do so is an open question in some jurisdictions. In addition, this issue is not restricted to government. A 2012 survey of Fortune 500 executives revealed significant reservations about whether they had enough skilled workers to adequately use data in business planning, an issue exacerbated by staff and analytic capacity cuts during the recession.[27]

Technical Solutions

Solutions to some technical issues may be methodological. For example, one group of researchers interested in exploring clinical issues arising in pediatric cardiac care created a method that relied on “indirect identifiers” (date of birth, date of admission, date of discharge, and sex) that permitted the linking of administrative data (e.g. Medicare) to clinical registry data, thereby permitting better care for patients by permitting analysis of where various procedures were performed for patients over time.[28] Linked clinical and administrative data will become increasingly important in evaluating health policy questions, particularly around use and cost of services, so insight into methods that create this linkage are relevant to policymakers as well as clinicians.

Others have developed techniques based on probability theory that create unduplicated counts of individuals in data sets that do not contain unique person identifiers.[29] This permits policy analyses using individual data without having to find a common identifier for linking, thus reducing the barrier in linking individuals across data sets and providing privacy protection as well.

The capacity and resource issues might also resolve with better training of students. Business and government need employees who are more familiar with integrated data sets. Whether enough colleges and universities develop curricula to meet the needs of government and private business remains to be seen, but clearly private industry is interested in stimulating the movement. IBM, for example, has announced the creation of a “big data and analytics curriculum” in partnership with a number of academic institutions. The curriculum will prepare students for what it estimates to be the 4.4 million jobs worldwide that will be supporting “big data” by 2015.[30]

Finally, as government agencies, health care and social services providers, and educational institutions among others become more sophisticated about the issues involved in linking and using large data sets, they presumably will become more sophisticated about using their authority (often derived from their status as a contractor for services) to require the collection and transfer of relevant data from different vendors. As noted earlier, federal agencies already do this because they have reporting obligations to Congress or others, and one can anticipate that state and local governments will begin to do so more frequently to generate data more suited to later analyses. Therefore, if a county social welfare agency contracts for services with providers, it can contractually require the providers (consistent with various legal norms) to provide information, including individually identifiable information necessary to monitor outcomes that the county agency is purchasing.

Outlook for the Future

Notwithstanding various challenges, the outlook for using integrated data for policy purposes is bright. The use of such data for policy is comparatively new, so it is not surprising that various political, legal, and technical challenges have arisen. Despite these challenges, it is difficult to imagine policymakers retreating from the use of linked data as these challenges are met. This is not to suggest that the development, implementation, and evaluation of all social policies will soon be informed by data. However, we do appear on the verge of an era when the use of such data for policy purposes will rapidly accelerate and expand.

On the political side, one promising development is a directive from the White House Office of Management and Budget (OMB) urging all federal agencies to set aside at least some program evaluation funding for evaluations that use integrated data. In a May 2012 memorandum, the acting director of OMB asked executive department and agency heads to “demonstrate the use of evidence” in their 2014 budget submissions. In addition, agencies proposing new evaluations were advised that “agencies can often use administrative data (such as data on wages, employment, emergency room visits or school attendance) to conduct rigorous evaluations, including evaluations that rely on random assignment, at low cost.”[31] With time, this type of support should translate into more evaluations that rely on integrated data.

Technical and methodological advances will continue to open up exciting opportunities to supplement and enhance the power of administrative data. One of the most important is Geographic Information Systems (GIS). GIS permits users to collect, store, and analyze geographic data. GIS can be used to visually display the results of data analysis, but it can also be a complementary form of analysis itself, enabling analysts to build geographic data, such as the distribution of health centers or schools, into an analytic plan. Use of GIS is expanding very quickly. For example, the National Resource Center for Child Welfare Data and Technology describes how GIS can be used in the administration and planning of child welfare services.[32] A page on the U.S. Department of Housing and Urban Development website is devoted to the use of “geospatial data resources” in examining housing issues, including a large number of data sets.[33] NASA is increasingly using data it develops using GIS to examine health-related issues alone and in partnership with agencies such as the Centers for Disease Control.[34]

Partnerships among the public, academic, and private sectors are another approach for driving innovation in the use of linked data. More colleges and universities are recognizing the potential for generating knowledge (and potential funding) through such partnerships. In addition to the IBM example and the Actionable Intelligence for Social Policy at the University of Pennsylvania, mentioned above, Harvard University has established the Institute for Quantitative Social Science as its home for social science research, with an emphasis on the use of quantitative data as a tool.[35]

Funders are stepping into this area as well. For example, the Annie E. Casey Foundation has funded a six-site project through the National Neighborhood Indicators Partnership titled “Connecting People and Place: Improving Communities through Integrated Data Systems.”[36] The project spurs collaboration among universities, nonprofits, and public agencies to expand the use of integrated data systems (IDS) to generate neighborhood indicators and inform local policy issues. Seed money like this is important in enabling communities to begin using data to improve decision making and outcomes for individuals in particular neighborhoods.

The outlook for the use of integrated data for policy purposes in virtually any sphere seems boundless. Although there are challenges to meet, the use of integrated data can, with time, dramatically improve the public policy process. It can also help better ensure that initiatives in housing, child welfare, health, social supports, and education are grounded in evidence and, equally important, are evaluated using empirical data rather than anecdotal information.

[1]   United States Bureau of the Census, Software and Standards Management Branch, Systems Support Division, “Survey Design and Statistical Methodology Metadata,” (Washington D.C.: August 1998), Section 3.3.7, page 14.

[2]   The Early Childhood Data Collaborative. “2013 State of States’ Early Childhood Data Systems” (2014). Available at http://www.ecedata.org/2013-national-results/.

[3]   For more information see, Massachusetts Department of Public Health, Bureau of Environmental Health, “The Massachusetts EPHT Program,” available at https://matracking.ehs.state.ma.us/EPHT_Program/.

[4]   R. Carson and E. Laird, “Linking Data across Agencies: States That Are Making It Work” (Data Quality Campaign, March 2010), available at http://forumfyi.org/files/States.That.Are.Making.It.Work.pdf.

[5]   M. Bardsley et al., “Predicting Who Will Use Intensive Social Care: Case Finding Tools Based on Linked Health and Social Care Data,” Age and Ageing, (Oxford University Press, Jan. 20, 2011): 1–5.

[6]   See, e.g., Institute for Health Technology Information, “Transforming Health Care through Big Data: Strategies for Leveraging Big Data in the Health Care Industry” (New York: IHTI, 2013), available at http://ihealthtran.com/big-data-in-healthcare.

[7]   United Nations Global Pulse (2013). Big Data for Development: A Primer, available at http://www.unglobalpulse.org/bigdataprimer.

[8]   Associated Press, “Poll: Americans Value Privacy over Security,” Politico, January 27, 2014, available at www.politico.com/story/2014/01/poll-americans-privacy-security-102663.html; and Pew Research Center for People and the Press, “Majority Views NSA Phone Tracking as Acceptable Anti-terror Tactic (Washington, DC: Pew, June 10, 2013), available at www.people-press.org/2013/06/10/majority-views-nsa-phone-tracking-as-acceptable-anti-terror-tactic/.

[9]   For a discussion, see A. Chawdhry, K. Paullet, and D. M. Douglas, “Raising Awareness: Are We Sharing Too Much Private Information?” Issues in Information Systems, 14(2) (2013): 375-381.

[10]  D. Carr, “Texas Hospital Exposes Huge Breach,” Information Week, Feb. 5, 2014, available at www.informationweek.com/healthcare/security-and-privacy/texas-hospital-discloses-huge-breach-/d/d-id/1113724).Names.

[11]  Erin McCann, “HIPPA Data Breaches Climb 138 Percent,” Health Care News, Feb. 6, 2014, available at www.healthcareitnews.com/news/hipaa-data-breaches-climb-138-percent. The US Department of Health and Human Services, which now tracks breaches of health information affecting 500 or more individuals, reports scores of breaches. See www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/breachtool.html.

[12]  For example, in 2012, the Obama administration attempted to draw that balance in its release of “A Framework for Protecting Privacy and Promoting Innovation in a Networked World,” available at http://www.whitehouse.gov/sites/default/files/privacy-final.pdf.

[13]  See Data Quality Campaign, “Let’s Give Them Something to Talk About: Tool for Communicating the Data Message” (Washington, DC: DQC, Jan. 29, 2013), available at http://dataqualitycampaign.org/find-resources/tools-for-communicating-the-data-message; DQC, “Cheat Sheet: Data Privacy, Security, and Confidentiality” (Washington, DC: DQC, n.d.), available at http://dataqualitycampaign.org/files/Cheat%20Sheet%20Privacy.pdf; National Neighborhood Indicators Partnership, “Why Data Providers Say No…And Why They Should Say Yes,” (Washington, DC: NNIP, Feb. 28, 2013), available at www.neighborhoodindicators.org/library/guides/why-data-providers-say-noand-why-they-should-say-yes.

[14]  Jaffee v. Redmond, 518 U.S. 1 (1996).

[15]  See http://www.hudhdx.info/.

[16]  Linking identifiable data to track cohorts is not only useful to policymakers. Linking cancer registry data to Medicare and Medicaid claims files enabled researchers to identify and track cancer patients over time to determine over time the effectiveness of care. See D. Schrag, B.A. Virnig, and J.L. Warren, “Linking Tumor Registry And Medicaid Claims To Evaluate Cancer Care Delivery,” Health Care Financing Review, 30(4) (2009): 61–73.

[17]  M. Gambatese et al., “Programmatic Impact of 5 Years of Mortality Surveillance of New York City Homeless Populations,” American Journal of Public Health 103 (2013):S193-198.

[18]  See http://www.aisp.upenn.edu/.

[19]  John Petrila, “Legal Issues in the Use of Electronic Data Systems for Social Science Research,” (Philadelphia: University of Pennsylvania, n.d.), available at: http://www.sp2.upenn.edu/aisp_test/wp-content/uploads/2012/12/0033_12_SP2_Legal_Issues_Data_Systems_000.pdf.

[20]  For more information, see http://cjjr.georgetown.edu/certprogs/informationsharing/certificateinformationsharing.html.

[21]  HHS, “Business Associate Contracts: Sample Business Associate Agreement Provisions” (Washington, DC: HHS, 2013), available at www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/contractprov.html.

[22]  See “Key Elements of Data Sharing Agreements,” available at www.neighborhoodindicators.org/library/guides/key-elements-data-sharing-agreements.

[23]  See data request form at http://childhealthdata.org/help/dataset.

[24]  See State Data Resource Center site at http://www.statedataresourcecenter.com/.

[25]  C. Bizer, P. Boncz, M.L. Brodie, and O. Erling, “The Meaningful Use of Big Data: Four Perspectives—Four Challenges,” SIGMOD Record 40 (4) (2011): 56–60.

[26]  Bizer, et al.

[27]  P. Barth and R. Bean, “There’s No Panacea for the Big Data Talent Gap,” Harvard Business Review blog, Nov. 29, 2012, available at http://blogs.hbr.org/2012/11/the-big-data-talent-gap-no-pan/.

[28]  S.K. Pasquall, et al., “Linking Clinical Registry Data with Administrative Data Using Indirect Identifiers: Implementation and Validation in the Congenital Heart Surgery Population,” American Heart Journal 160(6) (2010): 1099–1104.

[29]  A description of the method can be found in S. Banks and J.A. Pandiani, “Probabilistic Population Estimation of the Size and Overlap of Data Sets Based on Date of Birth,” Statistics in Medicine 20(2001): 1421–1430.

[30]  IBM Press release, “IBM Narrows Big Data Skills Gap By Partnering With More Than 1,000 Global Universities,” (Armonk, NY: IBM, August 14, 2013), available at https://www-03.ibm.com/press/us/en/pressrelease/41733.wss.

[31]  Office of the President, “Memorandum to the Heads of Executive Departments and Agencies: Use of Evidence and Evaluation in the 2014 Budget” (Washington, DC: The White House, May 18, 2012), available at http://www.whitehouse.gov/sites/default/files/omb/memoranda/2012/m-12-14.pdf.

[32]  National Resource Center for Child Welfare Data and Technology, “Using GIS for Policy and Planning: New York City Example” (Washington, DC: NRCCWDT, n.d.), available at www.nrccwdt.org/2011/10/using-gis-for-policy-and-planning/).

[33]  See http://www.huduser.org/portal/datasets/gis.html.

[34]  Urban and Regional Information Systems Association, “Overview of NASA’s GIS Leadership Role” (Des Plaines, IL, n.d.), available at www.urisa.org/awards/national-aeronautics-and-space-administration-nasa/.

[35]  For more information see IQSS website at www.iq.harvard.edu/.

[36]  See National Neighborhood Indicators Partnership, “Connecting People and Places: Improving Communities through Integrated Data Systems,” (Washington, DC: NNIP, June 2013), available at www.neighborhoodindicators.org/activities/projects/connecting-people-and-place-improving-communities-through-integrated-d.