Welcome to Secure Transfer, Restricted-Use Data Lake
This page introduces the U.S. Department of Labor's (DOL) Secure Transfer, Restricted-Use Data Lake (STRUDL), explaining the need for such a program and how the program is structured.
Each section has an accompanying video and text is provided as a visual aid. Documents referenced in each video are always formatted in bold and descriptions of each document can be found on STRUDL: Forms. We encourage potential applicants and program participants to reference the STRUDL Handbook for more detail on the full program process.
Video 1.1 motivates the idea of data privacy and confidentiality, the value of releasing research results and data publicly, and the tension between privacy and usefulness of these results. We will then explain how these ideas inform the motivation and structure of the STRUDL program, and the legal and moral obligation to protect privacy that STRUDL approved researchers undertake. Video 1.2 reviews the STRUDL program structure and lifecycle and introduces the process for obtaining system access.
Video 1.1: Motivating Data Privacy and Secure Transfer, Restricted-Use Data Lake
Overview
In this section, potential applicants and program participants will learn about:
- Motivation for data privacy, the value of releasing research results, and the tension between privacy and the usefulness of these results.
- How these ideas inform the motivation and structure of STRUDL, and the legal and moral obligation to protect privacy undertaken by users of restricted-use data.
What is data privacy?
Data privacy refers to the right of individuals to control the disclosure or release of sensitive information about themselves. It is a broad topic which includes data security and encryption, controlling access to sensitive data, and more.
STRUDL is designed to balance secure data access with the benefits of publicly releasing research and program evaluation results. Data must be protected while being accessed and used only for approved purposes, and confidential information must be protected—including once results are published.
Why release data, statistics, and research results?
DOL research and program evaluation activities are a crucial component of the department’s mission “to foster, promote, and develop the welfare of the wage earners, job seekers, and retirees of the United States; improve working conditions; advance opportunities for profitable employment; and assure work-related benefits and rights.”
The importance of the DOL’s mission demands continual innovation, learning, and improvement. The data collected by DOL and the research and program evaluation results generated from it advance this mission by ensuring that program activities are completed responsibly and effectively.
For example, in 2017, the DOL Chief Evaluation Office and the DOL Veterans’ Employment and Training Service Agency commissioned a program impact evaluation on the Homeless Veterans’ Reintegration Program. This program is the only federal program that focuses exclusively on providing employment services to veterans experiencing homelessness.
As depicted in Figure 1 below, DOL benefits internally from this program evaluation since the results
- contribute to the annual process to determine DOL research priorities for the upcoming year,
- build the labor evidence base informing employment and training programs and policies, and
- address DOL strategic goals and priorities.
Figure 1 also shows that these benefits are enhanced when the program evaluation results are released publicly. Publicly releasing these results provides the following benefits:
- Informs public decisionmaking and provides useful insights to organizations beyond DOL;
- Promotes awareness of DOL programs and their effectiveness
- Improves public accountability.
How does releasing data, statistics, and research results threaten privacy and confidentiality?
Privacy and confidentiality are related but different concepts. Privacy refers to the ability to determine the amount of personal information shared with others. Confidentiality entails an implicit or explicit agreement between the data subjects and the data collectors regarding whether and how additional parties may access the data providers’ information.
Although releasing data, statistics, and research results has many benefits for DOL, it can also create threats to both privacy and confidentiality. In the example of the program evaluation for the Homeless Veterans’ Reintegration Program described above, the veterans and grantees who participated in the study and interviews have a legal and moral right to privacy and confidentiality.
Releasing microdata can threaten privacy
Program participants and grantees were surveyed as part of the impact evaluation for the Homeless Veterans’ Reintegration Program, generating microdata in the form of survey results. Releasing this data publicly would allow anyone to work with the data, confirm the DOL results, or even find new insights; at first glance, this is a net benefit. However, even if the data were anonymized (removing all identifying information), it is still possible to use the microdata to re-identify individuals.
Consider the following illustrative example that highlights the potential data privacy challenges. Please note that this example is inspired by a real project—the impact evaluation described—but is not necessarily representative of the true disclosure risks.
Suppose the surveys in this project generated the following record:
Interviewee Type | Race | Gender | Current Employment Status | HVRP Site | Length of Time in Program | Program Satisfaction Rating (1–5) |
---|---|---|---|---|---|---|
Site client | White | Male | Unemployed | Springfield | 6 months | 2 |
Even though this data does not contain identifying information, a malicious actor could identify the individual based on the characteristics in this record.
In other words, suppose the malicious actor knows there are five individuals who fit the recorded description of an unemployed, white male at the Springfield site. The bad actor could then use other information to identify this person—for example, by cross–referencing a list of individuals who were at the site at the time when the survey was conducted.
Because veterans as a population have unique concerns, and unemployment and homelessness are often stigmatized, releasing this data could cause harm to program participants and the broader DOL mission.
Releasing statistics and research results can threaten privacy
Releasing confidential data is not risk-free. The program results were careful to release summary statistics that protected privacy; however, including results that are too specific could still cause privacy issues.
For example, consider a survey of program participants that collected information about veterans’ salaries after they found employment through the program. If DOL had released information about the maximum salary obtained at each site, these statistics are likely to reveal the exact income of one person. Similar to our earlier example, a malicious actor could easily identify this person using other context, such as the location of the site, other supplemental information released about the site and program participants, or even external datasets.
What is the privacy–utility tradeoff?
The examples we just walked through demonstrate the part of the tradeoff between privacy and the public good when publicly releasing results derived from confidential data. Privacy practitioners refer to this tension as the privacy–utility tradeoff, where utility refers to the overall usefulness of the released data.
It is impossible to perfectly protect privacy or perfectly preserve utility; instead, organizations like DOL must ensure they minimize the disclosure risk for individuals in their data while still allowing research and evaluation results to benefit the general public.
What is STRUDL?
STRUDL attempts to navigate the privacy–utility tradeoff by ensuring that researchers and data practitioners can still access useful data, the research conducted will benefit the DOL mission and the broader public, and research results can still be made available, while carefully screening these results for disclosure risk (or risk of sensitive information being released).
The structure of STRUDL assumes that users of DOL restricted-use data have both a moral and legal obligation to preserve the privacy of individuals and groups present in the data:
- The moral obligation of restricted-use data users is to minimize the harm of their research and maximize the benefits.
- The legal obligation is tied to a nondisclosure agreement (NDA) that all restricted-use data users are required to sign before gaining access to the data.
Approved researchers are encouraged to frequently reference the STRUDL Handbook, which contains detailed information about the program structure, policies, and processes covered in these videos.
Video 1.2: Secure Transfer, Restricted-Use Data Lake Structure and System Access
Overview
In this section, potential applicants and program participants will learn about the following:
- The STRUDL program structure and lifecycle.
- How to apply to the program and obtain system access.
Who can apply?
To comply with DOL policies, and maximize the effectiveness of the access to restricted data, applicants to the program must meet the following criteria:
- Applicants must be individuals or groups of individuals, not organizations.
- Applicants must publish or otherwise make publicly available results generated using restricted DOL data.
- Approved applicants will be solicited for required feedback about program milestones, such as the application, onboarding, disclosure review, and offboarding processes.
- Recipients must sign and abide by a nondisclosure agreement (NDA) to access and use restricted data.
- Applicants must be based in and access the data from the United States. Access to some datasets may require US citizenship.
- All applicants should incorporate a potentially lengthy clearance process into their project plans.
- The research team should have the skills and experience needed to conduct statistical research on large administrative datasets in a secure environment.
What are the features of successful project proposals?
Applicants may use restricted DOL data only for research purposes that benefit the DOL mission and the broader public. To ensure the project proposal is successful, keep the following in mind:
- Projects should be manageable in scope and offer a clear benefit to DOL and its customers.
- Projects should provide a clear justification for requiring access to restricted-use data, and should demonstrate that alternative sources of data, such as publicly accessible data, are insufficient to answer the research question.
- Projects should meet DOL guiding principles and should not be for fiduciary purposes, such as conducting market research.
- Projects should allow for the publication or release of research results.
- Project plans should incorporate a potentially lengthy clearance process into proposed timelines.
What does the project lifecycle look like at a high level?
As depicted in Figure 2, the program begins with the application process and proceeds through several additional phases:
- Once the application is approved, applicants become approved researchers and must go through the separate process of receiving security clearances and system access from DOL;
- Once access is obtained, approved researchers can conduct the research proposed in their applications. This may include iterative modifications to the project, such as adding staff to the project, changing the research output or goals specified in the initial proposal, or changing the project timeline;
- Finally, once research has been conducted, approved researchers must formally exit the program, a process which includes reviewing all outputs for disclosure risk (discussed more in Video 2.1) and ensuring all project obligations are met (discussed more in Video 4.1). Approved researchers may re-obtain access after the formal program exit only if this access is needed to address Revise & Resubmit (R&R) requests. After the work for these requests is completed, approved researchers must redo the program exit process, including the review of all outputs for disclosure risk.
How do researchers start an application?
To ensure that the application process is as smooth as possible and that the STRUDL team is prepared to process the application, potential applicants should thoroughly review all website materials and the STRUDL Handbook.
Once potential applicants review all materials, contact the Restricted Use officer at STRUDL@dol.gov to express interest in applying and receive application materials. To allow the STRUDL team to best advise them, applicants should indicate in this email whether or not they plan to use external data. The team will respond within 30 days and provide updated materials if applicable.
Once the materials are in place, applicants must complete and submit several forms to STRUDL@dol.gov. Descriptions of these forms can be found on STRUDL: Forms and are discussed in detail in the STRUDL Handbook. Applicants must provide the following information:
- An explanation of the proposed project and why access to DOL data is required;
- A description of the proposed research team and their qualifications;
- A checklist to help the STRUDL team anticipate the level of effort required to review the research output for privacy risk; and
- A mandatory nondisclosure agreement (NDA).
Within 30 days, applicants will receive notification of whether the proposal was “approved,” “rejected,” or “requires resubmission.”
Once approved, how do researchers obtain access to DOL systems and data?
After applicants have been approved (i.e., become approved researchers), they will participate in an IT security training and upon successful completion must also go through a separate application process to receive clearance for accessing DOL equipment and systems.
To start this process, approved researchers (i.e., applicants who successfully applied for STRUDL) will be contacted by DOL staff with the details of the clearance process. This process can be lengthy.
Once clearance has been granted, DOL staff will provide further instructions for requesting, obtaining, and setting up the following user tools:
- Personal Identity Verification (PIV) credentials and associated Personal Identification Number (PIN), which together will allow approved researchers to log into DOL systems; and
- Computing equipment, such as laptops and other hardware.
Further details on accessing systems and restricted data will be provided once approved researchers have completed this setup.