Code for the Collage Tool, a part of the HT-MAX project.
Collage is a tool designed for rapid prototyping, visualization, and evaluation of different information extraction models on scientific PDFs. Further, we enable both non-technical users and NLP practitioners to inspect, debug, and better understand modeling pipelines by providing granular views of intermediate states of processing.
This demo should be available and running at this URL. This server can sometimes be unstable. If it is having issues when you try to access it, please follow the Docker Compose instructions below.
For convenience, we've Dockerized all of the components of this system. To get started with the demo, simply run:
docker compose up
In the root directory of the repo. On our machines, this takes ~20 minutes to complete, largely
because of ChemDataExtractor having to download a number of models. If you do not need
ChemDataExtractor, of want to speed uip the build process significantly, comment out the
chemdataextractor
service from compose.yaml
. This sets up a Docker Compose network with three
containers: the interface, an instance of GROBID, to get reading order sections, and the
ChemDataExtractor service.
Alternatively, you can run the interface and Grobid separately. To build the interface docker image, run from the repo root:
docker build -t collage_interface .
And run the Grobid image with the command from their documentation:
docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0
(note that this command requires nvidia-docker
. Grobid runs fine without it, you can just
remove --gpus all
from the command.)
To configure the application, modify config in app_config.py
- this allows you to specify the
Grobid and ChemDataExtractor URLs, API keys for either LLM services or MathPix
Collage has three primary components:
- A PaperMage backbone that underlies our PDF processing,
defined in
papermage_components/Materials_Recipe.py
- Three software interfaces to accelerate the rapid prototyping of different kinds of models. These
interfaces are designed around token classification, i.e. classic information extraction models,
text-to-text models, such as LLMs, and multimodal models to process things like tables. These
interfaces are defined in
papermage_components/interfaces
. - A frontend, built in streamlit, that automatically visualizes modeling results produced by those
interfaces. The landing interface, where users can upload papers and customize the processing they
run on them, is in
Upload_Paper.py
. The three other interface views are defined in thepages/
package.
This repo contains the interfaces discussed above, along with several implementations of those repositories. These implementations provide the blueprint for how to implemen,t the interfaces in a number of different ways, including in-memory implementations right in the pipeline, small, Dockerized services for components with complicated environment requirements that may not be compatible with Collage, as well as a few that use external APIs. We outline these components, and how they implement their interface below. Note that because Collage is a prototyping tool, it does not aim for efficiency: all models are run on CPU. At the level of one paper, which is what the interface allows, the pipeline takes around a minute to annotate a paper.
Each interface requires users to specify the following:
predictor_identifier
: A property method that returns a readable identifier for the predictor, to be displayed in the visual interface. This is typically the name of the underlying model used to do the tagging.entity_to_process
: a parameter in the superclass constructor that specifies the PaperMageEntity
to annotate. This corresponds to whatever semantic segmentation comes from the pipeline - it can be sections in reading order, tables, paragraphs, etc.
Finally, for a new implementation to be visualized in the frontend, it must be registered in
local_model_config.py
, by adding a new LocalModelInfo
object to the MODEL_LIST
object,
which contains a model name, a description, and a function that takes no parameters that returns
an instance of the predictor. New parameters to be passed to predictor constructors should be
declared in app_config.py
This interface is intended for any model that produces annotations of spans in text, i.e. most "classical" NER or event extraction models. Users are required to override the following methods:
tag_entities_in_batch
: This method takes a list of sentences, and for each produces a list of tagged entities, wrapped in theEntityCharSpan
dataclass. In the default implementation, this batch is composed of the sentences in each paragraph. Implementors can also optionally override thegenerate_batches
method for more efficient batching.
Current implementations:
HfTokenClassificationPredictor
: this wraps any HuggingFace model that follows theTokenClassification
interface and allows it to be used in Collage. This model is run inside the same container as Collage, and therefore needs to be compatible with the version oftransformers
in the environment.ChemDataExtractorPredictor
: this predictor wraps aroundChemDataExtractorv2
. Because ChemDataExtractorv2 requires Python<3.8, we spin the annotation part of this into a small FastAPI wrapper, which we then Dockerize with Python3.7. The predictor calls the API to annotate documents.
Given the prominence of large language model-based approaches, this interface is designed to allow for text-to-text prediction. This interface can be extended by:
- overriding the
generate_from_entity_text
method (required): this method allows the user to specify a text-to-text function that applies the method of their choice. - overriding the
postprocess_to_dict
method (optional): This method allows users to postprocess the results of the above method into a dict that can be displayed as a table. This method is to allow for LLM results in structured format to be aggregated and displayed in the Summary view.
Current implementation:
LiteLLMCompletionPredictor
: this predictor allows for prompting and receiving results from multiple LLMs via API. We configure it to allow users to query OpenAI an Anthropic LLMs, and additionally allow users to bring their own API key to try the demo. Entered API keys are only stored in the streamlit session state, and are lost when the user disconnects.
Given the focus on tables and charts that many of our interview participants discussed, and the
fact that table parsing is an active research area, we additionally provide an interface for
models that parse images, the ImagePredictorABC
in order to handle multimodal
processing, including tables. Predictors that implement this interface return an output in the
form of an ImagePredictionResult
, a union type that allows users to return any combination of
a raw prediction string, a dict that represents a table, a list of bounding boxes, or a
predicted string. All of these representations, if present, are rendered in the frontend view.
This interface allows users two options of method to override:
process_image
: for predictors that only need access to the image, e.g. for captioning, or predictors that do their own OCR, users can implement just theprocess_image
method.process_entity
: for predictors that need access to more features of PaperMage's multimodal representation, users can implemnent theprocess_entity
method, which passes in each entity of the layer. Users can then access that entity's image, text, etc.
Current Implementations:
TableTransformerStructurePredictor
: This predictor uses the Microsoft Table transformer to predict bounding boxes, and then intersects those bounding boxes with PaperMageToken
s to get a parsed table representation. It overridesprocess_entity
, and returns a set of predicted cell bounding boxes, as well as a parsed table representation.MathPixTableStructurePredictor
: This predictor calls the commercial MathPix API to parse the content of tables. It overrides the
process_image
method, and returns only a parsed table representation. NOTE: because it requires an API key, the MathPix predictor is disabled in the running web interface. To test it out, get your own MathPix API key, and enter the credentials inapp_config.py
GrobidReadingOrderParser
: this parser uses GROBID to parse documents into a semantic representation based on document structure. This allows us to e.g. filter predictions by their source section, or segment batches based on section and paragraph.SciSpacySentencePredictor
: a reimplementation of the original PaperMage sentence parsed based on PySBD to use SciSpacy instead. This results in better sentence boundary prediction in the case of e.g. compounds, which contain periods in their names and cause general-domain sentence segmenters to fail.
This repo contains the following scripts:
parse_papers_to_json.py
: The script parses the content from PDFs into structured representations
in json. Currently, it runs the MaterialsRecipe
on a specified folder of papers, and dumps the json
representations to the specified output folder.
To aid development, this repo contains two notebooks that facilitate quicker development of
PaperMage predictors. dev_run_recipe_and_serialize.ipynb
takes a new PDF, runs that
MaterialsRecipe
on it, and serializes the result. dev_run_recipe_and_serialize
opens a paper
from the parsed json, and allows further manipulation.
The testing data for this project is managed and versioned by DVC, and it is stored in
this Google Drive folder.
Data and checkpoints should be stored in the data/
folder. For this project, we are symlinking
in the PDF data that we store in the NLP Collaboration Box Folder, e.g.:
ln -s $BOX_SYNC_FOLDER/NLP-collaboration-folder/AM_Creep_Papers data/AM_Creep_Papers
Data derived from those PDFs, model checkpoints, etc. will be stored in the data/
folder and
managed with DVC.
You can find instructions for installing DVC here. Once you have DVC installed, run
dvc pull
from the root of the repo. This will pull down all the files that have been checked into
DVC thus far. This will ask for permission for DVC to access the files in your Google Drive;
you should proceed with your CMU account.
DVC works in a similar fashion to git-lfs:
it stores pointers and metadata for your data in the git repository,
while the files live elsewhere (in this case, on Google Drive). As you
work with data, such as in the DVC tutorial, DVC will automatically add the files you have
tracked with it to the .gitignore
file, and add new .dvc
files that track the metadata associated
with those files.
- Pull data down : run
dvc pull
to pull down the data file into the repository folder - Modify your data : as you would without DVC, use, modify, and work with your data.
- Add new/modified data to DVC : using
dvc add ...
in a similar fashion to agit add
, add your new or modified data files to DVC - Add the corresponding metadata to git : Once the data file has been added to DVC, a
corresponding
.dvc
file will have been created. Add or update this into git, then push. - Sync the locally updated DVC data with the remote : finally, push the data itself up to Google
Drive with the
dvc push
command.
tl;dr:
- dvc pull
- dvc add <data_file>
- git add/commit <data_file.dvc>
- git push
- dvc push