23 February 2021
Working on NLP projects? Tired of always looking for the same silly preprocessing functions on the web, such as removing accents from French posts? Tired of spending hours on Regex to efficiently extract email addresses from a corpus? Amale El Hamri will show you how NLPretext got you covered!

Medium Tech Blog by Artefact

NLPretext overview

NLPretext is composed of 4 modules: basic, social, token and augmentation.

Each of them includes different functions to handle the most important text preprocessing tasks.

Basic preprocessing

The basic module is a catalogue of transversal functions that can be used in any use case. They allow you to handle:

  • Bad whitespaces in a text, end of line characters
  • Encoding issues
  • Special characters such as currency symbols, numbers, punctuation marks, latin and non-latin characters

  • Emails and phone numbers
from nlpretext.basic.preprocess import replace_emails
example = “I have forwarded this email to obama@whitehouse.gov”
example = replace_emails(example, replace_with=”*EMAIL*”)
print(example)
# “I have forwarded this email to *EMAIL*”

Social preprocessing

The social module is a catalogue of handy functions that can be useful when processing social data, such as:

  • hashtags extraction/removal

  • emojis extraction/removal

  • mentions extraction/removal

  • html tags cleaning

from nlpretext.social.preprocess import extract_emojis
example = “I take care of my skin 😀”
example = extract_emojis(example)
print(example) #[‘:grinning_face:’]

Text augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

Create your end to end pipeline

Default pipeline

Our library provides a Preprocessor object to efficiently pipe all preprocessing operations.
If you need to keep all elements of your text and perform minimum cleaning, use the default pipeline. It normalizes whitespaces and removes newlines characters, fixes unicode problems and removes recurrent artifacts from social data such as mentions, hashtags and HTML tags.

from nlpretext import Preprocessor
text = “I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris n”
preprocessor = Preprocessor()
text = preprocessor.run(text) print(text)
# “I just got the best dinner in my life !!! I recommend”

Custom pipeline

If you have a clear idea of what preprocessing functions you want to pipe in your preprocessing pipeline, you can add them in your own Preprocessor.

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters, remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = “I just got the best dinner in my life @latourdargent !!! I recommend 😀 #food #paris n”
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={‘lang’: ‘en’})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text) print(text) # “dinner life recommend”

NLPretext installation

To install the library please run

pip install nlpretext

You can find the github repository here and the library documentation here

This article was first published on the Artefact Tech Blog on Medium.