Introduction

The University of Nebraska-Lincoln's (UNL) Aida digital libraries research team and the Library of Congress (LC) collaborated on a "summer of machine learning" in 2019 to explore machine learning techniques for extending the accessibility of digital collections. The UNL team developed a number of prototype explorations over multiple iterations to investigate a range of questions and issues related to the digital materials, the LC's collections, and to machine learning practices in cultural heritage organizations. The UNL team employed a variety of machine learning approaches such as back-propagation neural network-based classifiers and deep learning approaches, including convolutional neural networks. More specifically, these projects involve VGG16, ResNeXt, dhSegment, and a fusion network combining ResNeXt and U-Net.

This repository includes the code developed and used across the team's explorations.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

For Exploration - Document Segmentation, the required software systems and libraries are:

Anaconda >= 4.3
Python >= 3.6
TensorFlow 1.13
CUDA 10.0 [if training on GPU]
imageio >= 2.5
pandas >= 0.24.2
shapely >= 1.6.4
scikit-learn >= 0.20.3
scikit-image >= 0.15.0
opencv-python >= 4.0.1
tqdm >= 4.31.1
sacred 0.7.4
requests >= 2.21.0
click >= 7.0

For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation, the required software systems and libraries are:

Python 3.7
MXNet 1.5
CUDA 10.0 [if training on GPU]
Matplotlib 3.1.1
opencv-python 4.1
numpy 1.17

For Exploration - Document Type Classification, the required software systems and libraries are:

Anaconda >= 4.3
Python >= 3.6
TensorFlow 1.13
CUDA 10.0 [if training on GPU]
opencv-python >= 4.0.1
numpy >= 1.16.2
scikit-learn >= 0.20.3
scikit-image >= 0.15.0
matplotlib >= 1.4.3
pandas >= 0.24.2
seaborn 0.9.0

For Exploration - Document Image Quality Assessment, the required software systems and libraries are:

Python 3.7
scipy 1.3.1
opencv-python 4.1
skimage 0.15

Installing

Step-by-step instructions on how to install required software systems and libraries for each project

For Exploration - Document Segmentation

Download Python 3.6 from https://www.python.org/downloads/
Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
Install Anaconda or Miniconda (installation procedure)
Open Terminal (for MacOS), Command-Line (for Windows)
Go to the codebase/Exploration - Document Segmentation folder
Create a virtual environment and activate it

conda create -n segmentation python=3.6
source activate segmentation

Install packages

python setup.py install

For Exploration - Graphic Element Classification and Text Extraction and Exploration - Digitization Type Differentiation

Download Python 3.7 from https://www.python.org/downloads/
Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
Install downloaded installation file
Open Terminal (for macOS), Command-Line (for Windows)
Install MXNet

pip install 'mxnet-cu100==1.5.1'

Install Matplotlib

python -m pip install -U 'matplotlib==3.1.1'

Install opencv-python

pip install 'opencv-python==4.1'

Install numpy

pip install 'numpy==1.17'

For Exploration - Document Type Classification

Download Python 3.6 from https://www.python.org/downloads/
Download CUDA 10.0 from https://developer.nvidia.com/cuda-toolkit-archive
Install Anaconda or Miniconda (installation procedure)
Open Terminal (for MacOS), Command-Line (for Windows)
Go to the codebase/Exploration - Digitization Type Differentiation folder
Create a virtual environment and activate it

conda create -n classification python=3.6
source activate classification

Install packages

python setup.py install

For Exploration - Document Image Quality Assessment

Download Python 3.7 from https://www.python.org/downloads/
Install downloaded installation file
Open Terminal (for macOS), Command-Line (for Windows)
Install scipy

pip install 'scipy==1.3.1'

Install opencv-python

pip install 'opencv-python==4.1'

Install skimage

pip install 'scikit-image==0.15'

Running the demonstrations

Exploration - Document Segmentation:

Download all files in demo/Exploration - Document Segmentation https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Segmentation
Install required softwares and libraries
Download one of the following dataset: (1) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/ENP_500 or (2) https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/difficulty_collection, for segmentation or clustering task, respectively
Copy the downlaoded folder to the downloaded 'Exploration - Document Segmentation' folder
Run one of the following command, depending on the purpose

# Activate virtual environment
source activate segmentation
# For segmentation task
python demo_segmentation.py
# For clustering task
python demo_clustering.py

Exploration - Graphic Element Classification and Text Extraction:

Download all files in demo/Exploration - Graphic Element Classification and Text Extraction https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Graphic%20Element%20Classification%20and%20Text%20Extraction
Install the required software and libraries
Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/BeyondWord_orginal_resolution
Copy the downloaded folder to the downloaded 'Exploration - Graphic Element Classification and Text Extraction' folder
Run the evaluation script

python eval.py

Exploration - Document Type Classification:

Download all files in demo/Exploration - Document Type Classification https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Type%20Classification
Install required softwares and libraries
Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/suffrage_1002
Copy the downlaoded folder to the downloaded 'Exploration - Document Type Classification' folder
Run the demonstration script

# Activate virtual environment
source activate Exploration - Document Type Classification
# Run demonstration
python demo_classification.py

Exploration - Digitization Type Differentiation:

Download all files in [demo/Exploration - Digitization Type Differentiation] https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Digitization%20Type%20Differentiation
Install the required software and libraries
Download dataset from https://git.unl.edu/unl_loc_summer_collab/labeled_data/tree/master/micrpfilm_scanning
Copy the downloaded folder to the downloaded 'Exploration - Digitization Type Differentiation' folder
Run the evaluation script

python eval.py

Breaking down into end-to-end tests

Please read the README file inside each project folder for a description of each end-to-end test.

Exploration - Document Segmentation: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/demo/Exploration%20-%20Document%20Segmentation
Exploration - Digitization Type Differentiation: <https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Graphic Element%20Classification%20and%20Text%20Extraction>
Exploration - Document Type Classification: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Document%20Type%20Classification
Exploration - Document Image Quality Assessment: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Document%20Image%20Quality%20Assessment
Exploration - Digitization Type Differentiation: https://git.unl.edu/unl_loc_summer_collab/codebase/tree/master/Exploration%20-%20Digitization%20Type%20Differentiation

Built With

Python - The programming language
CUDA Toolkit - Enable GPU for model training
MXNet - Deep learning framework
TensorFLow - Deep learning framework
Matlab - Math, graphics, programming platform

Contributing

---	Inputs	Technique	Output	Reports
Exploration - Document Segmentation (segmentation)	ENP_500 (European historical newspaper) Beyond Words	U-Net	5 class pixel-level segmented image	Progress report - Chulwoo Pack - 07312019.pdf Progress report - Chulwoo Pack - 08052019.pdf
Exploration - Document Clustering (clustering)	ENP_500 (European historical newspaper)	t-SNE	Clustered manifold	Progress report - Chulwoo Pack - 09232019.pdf
Exploration - Graphic Element Classification and Text Extraction	ENP_500 (European historical newspaper) Beyond Words	U-NeXt	Predicted region segmentation	Progress report - Yi Liu - 07302019.pdf
Exploration - Document Type Classification	suffrage_1002 (LoC Suffrage campaign)	U-Net	Type of document image: handwritten, typed, and mixed	Progress report - Chulwoo Pack - 08132919 Progress report - Chulwoo Pack - 08202019.pdf
Exploration - Document Image Quality Assessment	Civil War Campaign	DIQA	Four quality scores	Progress report - Yi Liu - 08122019.pdf Progress report - Yi Liu - 09052019.pdf
Exploration - Document Image Quality Assessment	difficulty_collection (LoC Manuscript/Mixed material)	U-Net, DIQA	visual difficulty correlation	Progress report - Chulwoo Pack - 10312019.pdf
Exploration - Digitization Type Differentiation	Civil War Campaign	ResNeXt	Classify micrpfilm or scanning	Progress report - Yi Liu - 09052019.pdf Progress report - Yi Liu - 10292019.pdf

Authors

Yi Liu - research associate and developer
Chulwoo (Mike) Pack - research associate and developer
Elizabeth Lorang - senior adviser
Leen-Kiat Soh - senior adviser
Ashlyn Stewart - research assistant

License

This project is licensed under the GPL License - see the LICENSE file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Getting Started

Prerequisites

Installing

Running the demonstrations

Breaking down into end-to-end tests

Built With

Contributing

Authors

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Exploration - Digitization Type Differentiation		Exploration - Digitization Type Differentiation
Exploration - Document Image Quality Assessment		Exploration - Document Image Quality Assessment
Exploration - Document Segmentation		Exploration - Document Segmentation
Exploration - Document Type Classification		Exploration - Document Type Classification
Exploration - Graphic Element Classification and Text Extraction		Exploration - Graphic Element Classification and Text Extraction
demo		demo
models/ResNeXt_UNeXt		models/ResNeXt_UNeXt
utils		utils
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

License

LibraryOfCongress/Exploring-ML-with-Project-Aida

Folders and files

Latest commit

History

Repository files navigation

Introduction

Getting Started

Prerequisites

Installing

Running the demonstrations

Breaking down into end-to-end tests

Built With

Contributing

Authors

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages