Skip to content

jmurel/sample_code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repository of sample code and documentation

The repo is split into three subdirs:

  • nlp-samples features sample Python code and Jupyter notebooks for work related to NLP tasks. There are three files:

    • BERT_EEBO_experiments.ipynb is a Jupyter notebook that contains original code for experiments I am currently conducting. The notebook's code produces word embeddings from a directory of .txt files (sourced from the VEP project) using the BERT tokenizer and model in the HuggingFace transformers library. It also contains subsequent code to query the resultant embedding file for a specific word and return all instances of that word in the embedding. It also provides additional scripts to spatially analyze the relation between select words in the embeddings using PCA and t-SNE.
    • EMEMT_normalize_script.py is a script designed to run on the regularized variant of the Early Modern English Medical Texts corpus(EMEMT) created by a team at the University of Helsinki. My purpose in writing this script was to prepare the .txt files for processing by a word2vec model for digital textual analysis. The EMEMT .txt files contain XML-type tags that regularize spelling in early modern texts while retaining a record of the original spelling (e.g. <reg orig="anathomie">anatomy</reg>). My Python script removes these tags, as well as other corpus-specific features of the transcription process (e.g. irregular whitespace, square brackets, etc.). The script then saves these modified files for use in NLP tasks.
    • topic-models-tutorial-IBM.ipynb is the final draft of a tutorial I wrote for IBM's developer website on how to create LDA topic models using Python. The script uses an open-access dataset of Charles Dickens novels sourced from Project Gutenberg and available on Kaggle. It surveys data pre-processing, model training, fine-tuning, and evaluation for LDA topic models. It is written for a general audience that may not have advanced experience in Python and probability statistics. The published version may be viewed here. I have also written technical topic explainers on topic models and LDA published on IBM's website.
  • sql-samples features SQL code for a model database I created to record metadata for items in a personal or professional comics collection.

    • Comics_Collection_Tables.sql creates the tables for this relational database. There are tables to record creators (artists, writers, etc.), publication dates, series (e.g. The Amazing Spider-man vs. The Spectacular Spider-man), special events (e.g. The Death of Superman event that spans multiple series), as well as individual issues (and whether those issues are Annual issues or ongoing issues).
    • Comics_BaseData.sql contains a set of sample data for artists, writers, and comics series to insert into the tables.
  • tech-conf-publications contains a selection of past academic publications, focusing specifically on machine-learning publications published in referred technical conference proceedings.

  • xslt-samples features xslt code for a special project on which I worked as the XML textbase editor for Northeastern University's Women Writers Online. Specifically, I enhanced the workflow for the experimental peer-reviewed journal Women Writers in Context (WWiC). Authors submit articles to the journal in .docx—and on occasion, .rtf—format. In the past, WWiC copyeditors would manually convert these submissions into an in-house XML format for online publication. This involved copy-pasting paragraph-by-paragraph from the word document to an XML template. I streamlined this process by creating a workflow in which copyeditors would convert the word document to general XML via the Oxgarage conversion tool and then run the resultant XML file through an original XSLT file I wrote that transformed Oxgarage's messy XML into WWiC's in-house XML format for online publication. Unfortunately, at the time, my XSLT knowledge was insufficient to transform files directly from .docx to WWiC's XML. I had been working on improving the workflow to do so, but left the project before I was able to finish this. The subdirectory contains both the XSLT transformation file, as well as documentation I wrote for future non-technically-initiated copyeditors to apply the transformation to future submissions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages