Natural Language Processing

Natural Language Processing#

NLP#

Spell checking
Speech recognition
Translators
Analyse sentiment (positive/negative) of text
Extract topics from text (e.g. news articles)
Generate text (e.g. chatbots)
Search engines (e.g. Google)
…

Working with text data#

Algorithms work well with numbers
working with text = meaningfully transforming your data into numbers
meaningful = depends on your application

Converting text into numbers#

this is also called text preprocessing

Text processing → text to numbers#

Local representations

Encoding with a unique number
Statistical Encodings

Distributed Representations

Word Embeddings

Text processing → text to numbers#

Encoding with a unique number

Easy to create, but the numbers have no relational representation

the relationship between words is not captured
models cannot interpret well these representation

Text processing → text to numbers#

Statistical Encodings

Creating vectors of the size of the vocabulary

leads to large sparse features space
not very efficient

Text processing → text to numbers#

Word Embeddings

embedding = new latent space

properties and relationships between items are preserved
less number of dimensions
less sparseness

Statistical Encodings#

Text Preprocessing#

Tokenization
CountVectorizer
TF-IDF
N-grams
Normalization
Stemming
Lemmatization

Tokenization#

import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("punkt_tab")

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

text = "Let us learn some NLP. NLP is amazing!"

word_tokenize(text)

['Let', 'us', 'learn', 'some', 'NLP', '.', 'NLP', 'is', 'amazing', '!']

sent_tokenize(text)

['Let us learn some NLP.', 'NLP is amazing!']

CountVectorizer#

Converting a collection of text documents to a matrix of token counts

sklearn’s CountVectorizer

CountVectorizer#

Note:

Gives a lot of weight to frequent (and maybe not so informative) words… → TF-IDF fixes this

corpus = [
    'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
 ]

cv = CountVectorizer()

X = cv.fit_transform(corpus)

features = cv.get_feature_names_out()
print(f"Features - {features}")
 
output = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print("\n",output)

Features - ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

    and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1

from sklearn.linear_model import LogisticRegression

y = ['document 1', 'document 2', 'document 3', 'document 4']
model = LogisticRegression().fit(X, y)

query = ['What is about second document?']

query_transformed = cv.transform(query)

print('prediction:',model.predict(query_transformed)[0])
print('probability:',model.predict_proba(query_transformed)[0])

prediction: document 2
probability: [0.2178996  0.39701782 0.16718298 0.2178996 ]

TF-IDF#

TF-IDF: Term Frequency * Inverse Document Frequency

→ measure how important a word is to a document in a corpus

Note:

A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.

TF-IDF#

TF:

\[\text{tf}(t, d)=\frac{\textrm{Total number of times term}\;t\textrm{ appears in document}d}{\textrm{Total number of terms in document}\;d} \]

IDF:

\[\text{idf}(t, D)= \log\frac{\textrm{Total document D}}{\textrm{Number of document containing term}\;t}\]

TF-IDF:

\[\text{tfidf}(t, d, D)=\text{tf}(t, d) \cdot \text{idf}(t, D)\]

TF-IDF#

sklearn’s TF-IDF

In detail article how Tf-IDF works.

corpus = [
   'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    
]

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus)

X.toarray()

array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])

df = pd.DataFrame((X.toarray().round(2)), columns=tfidf.get_feature_names_out())
df

	and	document	first	is	one	second	the	third	this
0	0.00	0.47	0.58	0.38	0.00	0.00	0.38	0.00	0.38
1	0.00	0.69	0.00	0.28	0.00	0.54	0.28	0.00	0.28
2	0.51	0.00	0.00	0.27	0.51	0.00	0.27	0.51	0.27
3	0.00	0.47	0.58	0.38	0.00	0.00	0.38	0.00	0.38

N-grams#

To model sequences of words… for example ice and cream make more sense as a 2-gram when they appear together

can be at word level or at character level

n-grams

from nltk import ngrams

n = 4

for i in range(1, n):
    print(f"{i} gram\n")
    ngram = ngrams(text.split(), i)
    for gram in ngram:
        print(gram)
    print("-"*10)

1 gram

('Let',)
('us',)
('learn',)
('some',)
('NLP.',)
('NLP',)
('is',)
('amazing!',)
----------
2 gram

('Let', 'us')
('us', 'learn')
('learn', 'some')
('some', 'NLP.')
('NLP.', 'NLP')
('NLP', 'is')
('is', 'amazing!')
----------
3 gram

('Let', 'us', 'learn')
('us', 'learn', 'some')
('learn', 'some', 'NLP.')
('some', 'NLP.', 'NLP')
('NLP.', 'NLP', 'is')
('NLP', 'is', 'amazing!')
----------

Normalization#

[‘List’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

Do we want to distinguish between “List” and “list”?

Sometimes we do: “White House” vs. “white house”

Notes: Normalization is the process of converting text data into a standardized form to reduce complexity and improve the efficiency of machine learning models. This can include lowercasing, stemming/lemmatization, …

Stemming#

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘list’, ‘list’, ‘list’, ‘list’, ‘.’]

Note:

Stemming reduces words to a shorter form, a form that might have no meaning.

Lemmatization#

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘listed’, ‘list’, ‘listing’, ‘listing’, ‘.’]

Note:

Lemmatization uses the language dictionary to get the base word of a word.

stemmer = nltk.PorterStemmer()

text = "We are learning how a stemmer works"
text1 = "People are running so fast." 

tokenized_text = word_tokenize(text1)
stem = [stemmer.stem(word) for word in tokenized_text]
stem

['peopl', 'are', 'run', 'so', 'fast', '.']

lemmatizer = nltk.WordNetLemmatizer()

tokenized_text = word_tokenize(text1)
lemm = [lemmatizer.lemmatize(word) for word in tokenized_text]
lemm

['People', 'are', 'running', 'so', 'fast', '.']

Stemming or Lemmatization?#

It depends…

Stemming is faster
Lemmatization preserves more information

Stopwords#

some words do not provide meaningful information … they are not “content words”
the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

“Apple is looking at buying U.K. startup for $1 billion”

Stopwords#

some words do not provide meaningful information … they are not “content words”
the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

“Apple is looking at buying U.K. startup for $1 billion”

nltk.download("stopwords")
from nltk.corpus import stopwords

print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]

[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

POS Tagging#

Part Of Speech tagging - assigning grammatical annotations
- ADJ - adjective
- NOUN
- VERB
- …

Which are verbs and nouns here?

“Apple is looking at buying U.K. startup for $1 billion”

universaldependencies

POS Tagging#

Part Of Speech tagging - assigning grammatical annotations
- ADJ - adjective
- NOUN
- VERB
- …

Which are verbs and nouns here?

“Apple is looking at buying U.K. startup for $1 billion”

universaldependencies POS

from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/runner/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.

True

tokenized_text = word_tokenize(text1)
tag = pos_tag(tokenized_text)
tag

[('People', 'NNS'),
 ('are', 'VBP'),
 ('running', 'VBG'),
 ('so', 'RB'),
 ('fast', 'RB'),
 ('.', '.')]

PRP = Personal pronoun
VBP - Verb, non-3rd person singular present
VBG - Verb, ending in ‘-ing’ or present participle
VBZ - Verb, 3rd person singular present
WRB - Wh-adverb
NN - Noun, singular or mass
RB - Adverb

POS Tagging using Spacy#

!pip install spacy
!python -m spacy download en_core_web_sm

import spacy
  
nlp = spacy.load("en_core_web_sm")

# new_text = "The car is blue"
doc = nlp(text1)
  
# Token and Tag
for token in doc:
    print(token, token.pos_)

People NOUN
are AUX
running VERB
so ADV
fast ADV
. PUNCT

PRON - Pronoun
NOUN - Noun
VERB - Verb
AUX - Auxiliary
DET - Determiner
SCONJ - Conjunction

Named Entities#

Named Entities are real-world objects that are assigned a name: person, country, book, product..
The recognition of entities is based on training data so it’s not perfect.

What entities do you think are in this text?

“Apple is looking at buying U.K. startup for $1 billion”

Named Entities#

Named Entities are real-world objects that are assigned a name: person, country, book, product..
The recognition of entities is based on training data so it’s not perfect.

What entities do you think are in this text?

“Apple is looking at buying U.K. startup for $1 billion”

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text," - ", ent.label_)

Apple  -  ORG
U.K.  -  GPE
$1 billion  -  MONEY

GPE - Geographical Entity
ORG - Organization
MONEY - Monetary value

from spacy import displacy

displacy.render(doc, style="ent")

Apple ORG is looking at buying U.K. GPE startup for $1 billion MONEY

for token in doc:
    print(token, token.pos_)

Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM

displacy.render(doc, style="dep")

So.. what do we do with all that?#

document similarity
text classification
…

Text similarity or Document Similarity#

Each document is a vector of features.

Similarity between documents is the similarity between vectors

Usage:

search engines: query to document
clustering of documents: document to document
Question & Answering platforms: query to query

Text classification#

You can use your favourite classifier with text

Logistic Regression provides nice baseline
AUC score as performance metric

Some applications:

spam detection
sentiment analysis
hate speech analysis

Word Embeddings#

Represent feature space in smaller dimension
Similar words are near in embedding space
Trained by using neural networks
→ Use those trained weights as first layer in your NLP neural network.

Word similarity#

Is “St Pauli” more similar to:

De Wallen → Similar type

or

HSV → Similar topic?

Result depends on the context … or on the feature space / embedding you chose

Using Embeddings#

Relevant items for your task should be similar in the embedding space / i.e close to each other.

.

How do we get Word Embeddings#

CBOW - Continuous Bag of Words

Predict the current word based on the context words
Input (X): context words , Output (y): current word
For example, “One word was missing in all that.”

How do we get Word Embeddings#

Skip-Gram

Predict the context words based on the current word
Input (X): current word, Output (y): context words
For example, “One word was missing in all that.”

How do we get Word Embeddings#

Creating word embeddings

Using pre-trained embeddings#

Most times you do not have enough data to get good word embeddings for your task, instead you can use pre-trained word embeddings.

There are different kinds of word embeddings:

static word embeddings: Word2vec (google), GloVe (Standford University), fastText (Facebook),
contextual word embeddings: ELMo, Bert (google), gpt-2/3/4 (openAI), …

example: pretrained word embeddings

Word Embeddings#

!pip install gensim
!pip install scipy==1.12

import gensim.downloader as api

## List available embeddings
info = api.info()

for model_name, model_data in sorted(info['models'].items()):
    print(model_name)

__testing_word2vec-matrix-synopsis
conceptnet-numberbatch-17-06-300
fasttext-wiki-news-subwords-300
glove-twitter-100
glove-twitter-200
glove-twitter-25
glove-twitter-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-wiki-gigaword-50
word2vec-google-news-300
word2vec-ruscorpora-300

# caveat: If you don't have enough RAM, this cell can crash your kernel

wv = api.load("word2vec-google-news-300")
glove = api.load("glove-twitter-100")
fasttext = api.load("fasttext-wiki-news-subwords-300")

wv.most_similar("coffee", topn=10)

[('coffees', 0.721267819404602),
 ('gourmet_coffee', 0.7057086825370789),
 ('Coffee', 0.6900454759597778),
 ('o_joe', 0.6891065835952759),
 ('Starbucks_coffee', 0.6874972581863403),
 ('coffee_beans', 0.6749704480171204),
 ('latté', 0.664122462272644),
 ('cappuccino', 0.662549614906311),
 ('brewed_coffee', 0.6621608138084412),
 ('espresso', 0.6616826057434082)]

wv.get_vector("coffee").shape

(300,)

glove.most_similar("coffee", topn=10)

[('tea', 0.8275877237319946),
 ('beer', 0.7744594216346741),
 ('breakfast', 0.7694926261901855),
 ('coffe', 0.762207567691803),
 ('starbucks', 0.7606451511383057),
 ('food', 0.75710529088974),
 ('wine', 0.7540071606636047),
 ('drink', 0.7533924579620361),
 ('milk', 0.7433452010154724),
 ('cream', 0.7419354915618896)]

wv.distance("coffee", "tea")
# wv.distance("coffee","coffees")

0.43647074699401855

wv.distance("coffee", "onion")

0.8041959255933762

Semantic relationships#

wv.most_similar(positive=["king", "woman"], negative=["man"], topn=5)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133)]

Capture Semantic relationships:

gender (man ↔ woman)

royalty (king ↔ queen)

Visualize Semantics with Graphs#

TensorFlow projector

Hugging Face & Transformers#

Transformers#

Neural network architecture
Sequences of Encoders and Decoders
Use self-attention mechanisms to process input data in parallel
Handle long-range dependencies in text using many context vectors.

Transformers#

Encoder:

Self-attention layer:
- looks at other words in the input sentence as it encodes a specific word
Feed-forward neural network:
- applied to each position of the input sentence.

Decoder:

Encoder-Decoder-Attention layer:
- helps the decoder focus on relevant parts of the input sentence

Transformers#

Is the term it connected to animal or street ?

self-attention allows the model to associate it with animal.

Hugging Face#

> 7k pre trained NLP models on huggingface.co

Zero-Shot Learning#

A pretrained model performs a downstream task directly from a natural language description
input: “Classify the sentiment of: Today is a great day!!”

(see notebook 2,3 in workbooks)

Resources#

Getting started with NLP (Pyladies)
NGram Loader
spaCy
Text similarities
Neural models for information retrieval
Glove
What is a transformer? (3blue1brown)
How does zero shot learning work [video, text]?
Sentiment Analysis with VADER [stand alone, using nltk]

Natural Language Processing

Contents

Natural Language Processing#

NLP#

Working with text data#

Converting text into numbers#

Text processing → text to numbers#

Text processing → text to numbers#

Text processing → text to numbers#

Text processing → text to numbers#

Statistical Encodings#

Text Preprocessing#

Tokenization#

CountVectorizer#

CountVectorizer#

TF-IDF#

TF-IDF#

TF-IDF#

N-grams#

Normalization#

Stemming#

Lemmatization#

Stemming or Lemmatization?#

Stopwords#

Stopwords#

POS Tagging#

POS Tagging#

POS Tagging using Spacy#

Named Entities#

Named Entities#

So.. what do we do with all that?#

Text similarity or Document Similarity#

Text classification#

Word Embeddings#

Word Embeddings#

Word similarity#

Using Embeddings#

How do we get Word Embeddings#

How do we get Word Embeddings#

How do we get Word Embeddings#

Using pre-trained embeddings#

Word Embeddings#

Semantic relationships#

Visualize Semantics with Graphs#

Hugging Face & Transformers#

Transformers#

Transformers#

Transformers#

Hugging Face#

Zero-Shot Learning#

Resources#