Skip to content
This repository has been archived by the owner on Apr 16, 2018. It is now read-only.
/ normalizr Public archive
forked from davidmogar/cucco

Python library for text normalization

License

Notifications You must be signed in to change notification settings

benfei/normalizr

 
 

Repository files navigation

normalizr

Normalizr is a Python library for text normalization that offers a bunch of actions to manipulate your text as much as you want. With normalizr you can replace symbols, punctuation, remove stop words and much more.

Installation

The easiest way to install the latest version is by using pip to pull it from PyPI:

$ pip install normalizr

You may also use Git to clone the repository from Github and install it manually:

$ git clone https://github.com/davidmogar/normalizr.git
$ cd normalizr
$ python setup.py install

Usage

The next code shows a sample usage of this library:

from normalizr import Normalizr

normalizr = Normalizr(language='en')
print(normalizr.normalize('Who let the dog out?'))

This would apply all normalizations to the text Who let the dog out?. The output for this normaliations would be the next one:

let dog

It's also possible to send a list of normalizations to apply, which will be executed in order.

from normalizr import Normalizr

normalizr = Normalizr(language='en')

normalizations = [
    'remove_extra_whitespaces',
    ('replace_punctuation', {'replacement': ' '})
]

print(normalizr.normalize('Who    let   the dog out?', normalizations))

Output:

Who let the dog out

If you only need to apply one normalization, use one of these methods:

  • remove_accent_marks
  • remove_extra_whitespaces
  • remove_stop_words
  • replace_charachters
  • replace_emails
  • replace_emojis
  • replace_hyphens
  • replace_punctuation
  • replace_symbols
  • replace_urls

Supported languages

At the moment, normalizr can remove stop words in these languages:

  • Danish (da)
  • Dutch (nl)
  • English (en, default)
  • Finnish (fi)
  • French (fr)
  • German (de)
  • Hungarian (hu)
  • Italian (it)
  • Norwegian (no)
  • Portuguese (pt)
  • Russian (ru)
  • Spanish (es)
  • Swedish (sv)

Contributing

So you have a new feature you think would be great for normalizr? Go for it! Pull request are warmly welcome. Not in the mood of implement it yourself? You can still create an issue and tell about it there. Feedback is always great!

About

Python library for text normalization

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%