Skip to content

This is a metadata assessment tool to query spreadsheet-based digital collection metadata against lexicons of offensive and outdated terminology.

License

Notifications You must be signed in to change notification settings

marriott-library/MaRMAT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Marriott Reparative Metadata Assessment Tool (MaRMAT) - Beta

The Marriott Reparative Metadata Assessment Tool (MaRMAT) is a Python application designed for auditing collections metadata files against a lexicon of potentially problematic terms. The tool's design facilitates an easy-to-follow process for assessing metadata using a lexicon of terms. For PC user's, we provide a graphical interface for file loading, column selection, and term matching, making it user-friendly for those with limited programming experience. The tool can also be run in your command line.

We value your feedback! Please take this survey to tell us about your experience using MaRMAT.

Table of Contents

  1. Project Background

    1.1 About the Tool

    1.2 The Lexicons

    1.3 Features

    1.4 Example Outputs and Tutorial

  2. The Command-Line Tool

    2.1 Usage

    2.2 Dependencies

    2.3 Notes

  3. The GUI for Windows Users

    3.1 Usage

    3.2 Dependencies

    3.3 Installation

    3.4 Troubleshooting

  4. Credits and Acknowledgments

1. Project Background

The Marriot Reparative Metadata Assessment Tool (MaRMAT) is based Duke University’s Description Audit Tool. It is intended to assist digital collections metadata practitioners in bulk analysis of metadata collections to identify potentially harmful language in description and facilitate repairing metadata to reflect current and preferred terminology. While Duke University's Description Audit Tool was created to analyze MARC XML and EAD finding aid metadata, MaRMAT was developed to analyze metadata in a spreadsheet format, allowing for assessment of Dublin Core metadata and other schemas due to only requiring key column-header names. In addition, the script has been altered to provide more custom querying capabilities.

MaRMAT is designed to query spreadsheet-based (CSV) metadata against a lexicon of potentially harmful terms in uncontrolled metadata elements, such as Title, Description, and Collection Title. Controlled metadata, such as Subject, can be queried against a database of outdated or problematic Library of Congress Subject Headings. The bulk query of multiple columns of metadata against the provided lexicon, or user-supplied custom created lexicon, is designed to facilitate efficient bulk analysis instead of individual keyword searching methods.

Identifying potentially harmful language, problematic and outdated Library of Congress Subject Headings, is one step towards reparative metadata practices. Deciding what and how to change this metadata, however, is up to metadata practitioners and involves awareness, education, and sensitivity for the communities and history reflected in digital collections. The Digital Library Federation’s Inclusive Metadata Toolkit, created by the Digital Library Federation’s Cultural Assessment Working Group, provides resources to educate and assist in reparative metadata decision-making.

1.1 About the Tool

At the most basic level, MaRMAT is designed to match terms from a lexicon with textual data and produce a CSV file containing the matched results. It utilizes the Pandas library for data manipulation and regular expressions for text processing. It was designed primarily with librarians in mind, specifically those engaged in reparative metadata practices, to assist in idenfiying terms in their metadata that may be outdated, biased, or otherwise problematic. The underlying code (including preliminary iterations) and sample lexicons for using the tool can be accessed via the Code folder of this repository. For additional information about the GUI, see GUI-Documentation.

An initial test case developed a tool for parsing, extracting, tokenizing, and preprocessing XML files containing Open Archives Initiative (OAI) feed metadata for library special collections that would then crosscheck tokens against Duke University's lexicons and append the corresponding lexicon categories (Aggrandizement, Race Euphemisms, Race Terms, Slavery Terms, Gender Terms, LGBTQ, Mental Illness, and Disability) to each row in the CSV output. This tool is accessible via the XML Test Code folder of this repository, please note that this may not work with all OAI feed formats or take into account resumption tokens.

1.2 The Lexicons

There are two lexicons provided to help begin your reparative metadata assessment. Not all of the terms in these lexicons may need remediation, rather, they may signal areas of your collections that should be reiveiwed carefully. Users may download the provided lexicons to use in MaRMAT as is, remove terms that may not be problematic in your metadata, or add additional terms and categories based on specific project needs. The only requirements for a lexicon to work against another file are that there be two columns in the CSV file: "Term" and "Category" (case sensitive). Therefore, the tool's use is not limited to assessing metadata for problematic terms; it may also be loaded with a custom lexicon to perform matching against a variety of content types.

Lexicon Description
Reparative Metadata Lexicon The Reparative Metadata Lexicon includes potentially harmful terminology organized by category and is best suited for uncontrolled metadata fields (i.e. Title, Description). This lexicon has been adapted from Duke University's lexicons, which were created for similar use cases. For the Marriott Reparative Metadata Assessment Tool (MaRMAT), Duke's lexicons were modified by transposing across their category columns to create a single lexicon (term, category) that better accommodate users adding additional terms and categories without having to adjust the underlying code structure.
Library of Congress Subject Heading (LCSH) Lexicon The LCSH Lexicon includes selected changed and canceled LCSH (mostly from 2023) and headings that have been identified as problematic. The LCSH Lexicon is best suited to run against the Subject metadata field, or other fields that contain LCSH terms

1.3 Features

  • Load lexicon and metadata files in CSV format.
  • Select columns from the metadata file for analysis.
  • Choose the column in the metadata file to be rewritten as the "Identifier" column so that the output can be reconciled with the original metadata file.
  • Select categories of terms from the lexicon for analysis.
  • Perform matching to find matches between selected columns and categories.
  • Export results to a CSV file.

1.4 Example Outputs and Tutorial

To provide users with a sense of what to expect from running MaRMAT against their own metadata collection, here are two example outputs using the provided lexicons:

  1. Example Output: Reparative Metadata Lexicon
  2. Example Output: LCSH Lexicon

Please keep in mind these reports are just snippets of larger reports. Users should be aware that there may be false positives or results that may not need remediation. For example, the LCSH term "Race" is considered a problem heading but MaRMAT may flag other headings with "race," as in "Bonneville Salt Flats Race, Utah." Likewise, the gender term "wife" may not always signal an unnamed woman, and terms that may be harmful in some contexts may not be in others. Therefore, we stress the importance of human review and intervention prior to making broad conclusions or global changes based on MaRMAT outputs.

To assist in getting started with MaRMAT, there is also a video tutorial that demonstrates the first steps in using the GUI for Windows (subtitles can be enabled in settings).

2. The Command-Line Tool

The MaRMAT can be run by any user from their command line. Where indicated in the script, provide the paths to each file, specify the columns you wish to analyze, designate your "Identifier" column, and input the categories of terms you want to match. Then, run the Python file from your command line. Additional instructions for MacOS users can be found HERE.

2.1 Usage

  1. Install Python if not already installed (Python 3.x recommended).

  2. Clone or download the MaRMAT repository.

  3. Use the command-line interface to navigate to the directory where you saved the files (e.g., Downloads, Desktop).

  4. Open the MaRMAT-2.5.py script in a text editor and provide the information for your files and what you want to analyze under "Example usage" at the very end of the script.

  5. Save the script.

  6. Run the tool in your command line using the following command: python3 MaRMAT-2.5.py

  7. Review the matching results displayed on the console or in the generated CSV file.

2.2 Dependencies

  • Python 3.x: Python is a widely used high-level programming language for general-purpose programming.

  • pandas: Pandas is a Python library that provides easy-to-use data structures and data analysis tools for manipulating and analyzing structured data, particularly tabular data. Pandas can be installed using pip in Terminal: pip install pandas

  • re: This module provides regular expression matching operations. It's a built-in module in Python and doesn't require separate installation.

Note: These dependencies are necessary to run the provided code successfully. Ensure that you have them installed before running the code.

2.3 Notes

  • Ensure that both the lexicon and metadata files are in CSV format.
  • The lexicon file should contain columns for terms and their corresponding categories ("Terms","Category").
  • The metadata file should contain the text data to be analyzed, with each row representing a separate entry.
  • The metadata file should contain a column, such as a Record ID, that you can use as an "Identifier" to reconcile the tool's output with your original metadata.
  • The tool outputs matching results to a CSV file named "matching_results.csv" in the tool's directory.

3. The GUI for Windows Users

To facilitate wider use, the MaRMAT GUI allows users to easily load a lexicon and a metadata file, select a key column (i.e., Identifier) to use in reconciling matches, and choose the columns and categories they'd like to perform matching on.

*Note: The GUI is not compatible with MacOS. Additional information on the MaRMAT GUI is available here.

3.1 Usage

  1. Loading Files:

    • Click on the "Load Lexicon" button to load the lexicon file.
    • Click on the "Load Metadata" button to load the metadata file.
  2. Selecting Columns:

    • After loading files, click "Next" to proceed to column selection.
    • Select the columns from the metadata file that you want to analyze.
  3. Selecting Identifier Column:

    • After selecting columns, choose the column in the metadata file that will serve as the key column or "Identifier" column, such as a record ID.
  4. Selecting Categories:

    • Next, choose the categories of terms from the lexicon that you want to search for.
  5. Performing Matching:

    • Click "Perform Matching" to find matches between selected columns and categories.
    • The results will be exported to a CSV file.

3.2 Dependencies

  • Python 3.x: Python is a widely used high-level programming language for general-purpose programming.

  • Tkinter: Tkinter is Python's standard GUI (Graphical User Interface) package. It is used to create desktop applications with a graphical interface. It is usually included with Python distributions, so no separate installation is required.

  • re: This module provides regular expression matching operations. It's a built-in module in Python and doesn't require separate installation.

  • pandas: Pandas is a Python library that provides easy-to-use data structures and data analysis tools for manipulating and analyzing structured data, particularly tabular data. Pandas can be installed using pip in your command line interface: py -m pip install pandas

Note: These dependencies are essential for running MaRMAT. If you don't have Python installed, you can download it from the official Python website.

3.3 Installation

No installation is required. Simply follow the steps below to download and run the Python script to start the application on your PC.

  1. Download the Python Script:

    • Download the MaRMAT-GUI-2.5.2.py script to a location on your PC where you can easily find it, such as your Desktop or Downloads.
  2. Ensure Python is Installed:

    • To make sure that Python is installed on your PC, search for "Python" in your Start Menu or look for the Python folder in your Program Files.
    • If Python is not installed, you can download and install it from the official Python website.
  3. Double-Click the Python Script:

    • Navigate to the location where you downloaded the script.
    • Double-click on the script file (i.e., MaRMAT-GUI-2.5.2.py).
  4. Application Starts:

    • The application should start running automatically; the GUI will appear on your screen.

3.4 Troubleshooting

The GUI should automacially open when you open the Python code file. If you are having issues with the GUI opening, try opening the file in Python IDLE and running it. IDLE should give ou an error message with insights as to why it is not loading correctly. If you are receiving error messages pyrelated to pandas, such as No module named 'pandas', follow these steps to install pandas.

  1. Open your command line interface

  2. Type the following into command line: py -m pip install pandas

  3. Press enter to run the command

If this process does not resolve your issue, follow these Getting Started tips to make sure python and the pip installer are running correctly on your PC: https://pip.pypa.io/en/stable/getting-started

4. Credits and Acknowledgments

Code developed by Kaylee Alexander in collaboration with ChatGPT 3.5, Rachel Wittmann, and Anna Neatrour at the University of Utah's J. Willard Marriott Library. MaRMAT Beta was released in July, 2024.

This tool was inspired by the Duke University Libraries Description Audit Tool, developed by Noah Huffman at the Rubenstein Library, and expanded by Miriam Shams-Rainey (see Description-Audit).

5. User Feedback Survey

After using MaRMAT, please take this suvery and tell us about your exeprience using MARMAT. We appreciate your feedback!

About

This is a metadata assessment tool to query spreadsheet-based digital collection metadata against lexicons of offensive and outdated terminology.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%