Stylised Text Generation

Using neural networks to imitate internet forum posts

Repository to store notebooks and data used to generate realistic text using a model trained on reddit data.

Example data sets and models are included in this repository. For exploration of other more complex models used in this project please see /text_gen/ on this page. NOTE This repo is a cleaned example version of the project, if you wish to view the full repo expect less documentation.

Feel free to download your own copy, or submit a pull request.

Summary

This project uses recurrent neural networks and text downloaded from reddit to produce langauge models. This project was intended as a fun practice of implimenting some theory, and as such the models trained are on relatively small data sets and for short periods of time.

There are four main components to this project, as follows:

Data gathering
Data pre-processing
Model training and text generation
Feature-Target analysis

To run the code in this repo you will need common python data anlysis modules (numpy matplotlib etc) In addition, you will need keras, tensorflow and scikit-learn. \TODO add requirements.txt

Data Gathering

A collection of raw data are provided within /raw_data/ with all text posts within one file, separated by new lines.

To collect new data from different subreddits or of different sizes / date ranges you will need to run /data_gathering/reddit_download.ipynb.

I have excluded the authentication tokens and logins used to do this a user can generate their own.

The download modules used interface with praw and psaw. However, device needs to authenticated using the method described in reddit's API. How to do this can be found here

The text data downloaded contains metadata in the file title. This is described within the notebook itself.

Data pre-processing

The raw data is unsurpisingly not suitable to be passed to a neural network. There are two main reasons for processing the data:

convert it into a strucutre suitable for training
clean and transform the data in accordance with the task in mind

Which raw file to process is selected at the top of the /src/pre-processing.ipynb notebook.

NOTE some cells will take a significant amount of memory and time to run. The largest amount of data I was able to process on a laptop was 50,000, there may be unknown bottlenecks above this so I would recommend going above 20,000 at your own risk.

Parameters such as the size of the sequence length and the minimum frequency of word occurance can be specified.

Processed data is output as a .pickle file, this is done to preserve all structure, and to pass parameters, such as sequence length, and objects, such as the tokeniser used, to the model.

Model Training

To train the model the processed data is required. The values in the names of the .pickle file give information about the data contained within (size, sequence length etc).

The keras framework used is sequential. To better model the complexity of the data feel free to increase the number of embeddings, and number of GRU units.

For the 5,000 post data set, training will take around 10 minutes with early stopping on my machine.

Pair-Analysis

I was interested in the limit of prediction ability with this data set. Using the processed data some analysis of unique feature-target pairs are conducted. For 5,000 posts approx 25% have multiple possible feature-target pairs. This increases with corpus size.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_gathering		data_gathering
processed_data		processed_data
raw_data		raw_data
saved_models		saved_models
slide_diagrams		slide_diagrams
src		src
README.md		README.md
Stylised Text Generation.pptx		Stylised Text Generation.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stylised Text Generation

Using neural networks to imitate internet forum posts

Summary

Data Gathering

Data pre-processing

Model Training

Pair-Analysis

About

Releases

Packages

Languages

jonathonmellor/text_generation_cc

Folders and files

Latest commit

History

Repository files navigation

Stylised Text Generation

Using neural networks to imitate internet forum posts

Summary

Data Gathering

Data pre-processing

Model Training

Pair-Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages