23 February 2021
Working on NLP projects? Tired of always looking for the same silly preprocessing functions on the web, such as removing accents from French posts? Tired of spending hours on Regex to efficiently extract email addresses from a corpus? Amale El Hamri will show you how NLPretext got you covered!
NLPretext overview
NLPretext is composed of 4 modules: basic, social, token and augmentation.
Each of them includes different functions to handle the most important text preprocessing tasks.
Basic preprocessing
The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:
example = “I have forwarded this email to obama@whitehouse.gov”
example = replace_emails(example, replace_with=”*EMAIL*”)
print(example)
# “I have forwarded this email to *EMAIL*”
Social preprocessing
The social module is a catalogue of handy functions that can be useful when processing social data, such as:
example = “I take care of my skin 😀”
example = extract_emojis(example)
print(example) #[‘:grinning_face:’]
Text augmentation
The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.
Create your end to end pipeline
Default pipeline
Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.
If you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.
Custom pipeline
If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.
NLPretext installation
To install the library please run