Using Similarity Scores to Identify Organizations of Interest by Website Research Brief

< Back to Search Results
Release Date: December 16, 2024

Using Similarity Scores to Identify Organizations of Interest by Website Research Brief

deliverable icon

Related Tags

About the Brief

Download Brief

Multiple agencies and programs within DOL may have a need to identify different categories of organizations they work with. For example, they may seek to identify employment service providers, benefits providers, local unions, or even specific types of employers. Such identification can support data collection, outreach, compliance, and enforcement activities. However, characteristics of organizations relevant to the activity are not always available in datasets, making it difficult to identify the organizations needed for contact. An automated approach can make identifying websites of potentially relevant organizations more efficient, while still allowing a human reviewer to make the final decision of whether an organization is relevant for contact.

This brief describes an automated approach using web scraping and natural language processing to identify websites of interest, provides a hypothetical example using the approach, and summarizes the lessons learned in applying this process.

Download this Summary (PDF)

Key Takeaways

  • The developed approach includes the following steps:
    1. Identify search terms that will be used by a web crawler.
    2. Automate the search process that crawls Google search results.
    3. Process website text to standardize the text for comparison.
    4. Calculate similarity scores by automatically comparing the text from websites identified to the websites of organizations already known to have the characteristics of interest.
    5. Conduct a manual review of the sites sorted by similarity scores (i.e., those most likely to be of interest).
  • While this approach is more efficient than a human performing a Google search to identify sites, it could be improved by incorporating scraping subpages in addition to the root pages used in this approach. Additionally, the process will capture irrelevant websites that use similar terminology, which means human review is a required part of the process.

Citation

Cody, S., Ring, M., Roubal, A. (2024). Westat. Using Similarity Scores to Identify Organizations of Interest by Website. Chief Evaluation Office, U.S. Department of Labor.

Download Brief    View Study Profile

The Department of Labor’s (DOL) Chief Evaluation Office (CEO) sponsors independent evaluations and research, primarily conducted by external, third-party contractors in accordance with the Department of Labor Evaluation Policy and CEO’s research development process.