Natural Language Processing#

NLP#

  • Spell checking

  • Speech recognition

  • Translators

  • Analyse sentiment (positive/negative) of text

  • Extract topics from text (e.g. news articles)

  • Generate text (e.g. chatbots)

  • Search engines (e.g. Google)

Working with text data#

  • Algorithms work well with numbers

  • working with text = meaningfully transforming your data into numbers

  • meaningful = depends on your application

Converting text into numbers#

  • this is also called text preprocessing

Text processing → text to numbers#

Local representations

  • Encoding with a unique number

  • Statistical Encodings

Distributed Representations

  • Word Embeddings

Text processing → text to numbers#

Encoding with a unique number

Easy to create, but the numbers have no relational representation

  • the relationship between words is not captured

  • models cannot interpret well these representation

Text processing → text to numbers#

Statistical Encodings

Creating vectors of the size of the vocabulary

  • leads to large sparse features space

  • not very efficient

Text processing → text to numbers#

Word Embeddings

embedding = new latent space

  • properties and relationships between items are preserved

  • less number of dimensions

  • less sparseness

Statistical Encodings#

Text Preprocessing#

  • Tokenization

  • CountVectorizer

  • TF-IDF

  • N-grams

  • Normalization

  • Stemming

  • Lemmatization

Tokenization#

import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("punkt_tab")

from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
text = "Let us learn some NLP. NLP is amazing!"
word_tokenize(text)
['Let', 'us', 'learn', 'some', 'NLP', '.', 'NLP', 'is', 'amazing', '!']
sent_tokenize(text)
['Let us learn some NLP.', 'NLP is amazing!']

CountVectorizer#

Converting a collection of text documents to a matrix of token counts

sklearn’s CountVectorizer

CountVectorizer#


Note:

Gives a lot of weight to frequent (and maybe not so informative) words… → TF-IDF fixes this

corpus = [
    'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
 ]
cv = CountVectorizer()

X = cv.fit_transform(corpus)
features = cv.get_feature_names_out()
print(f"Features - {features}")
 
output = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
print("\n",output)
Features - ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']

    and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1
from sklearn.linear_model import LogisticRegression

y = ['document 1', 'document 2', 'document 3', 'document 4']
model = LogisticRegression().fit(X, y)
query = ['What is about second document?']

query_transformed = cv.transform(query)

print('prediction:',model.predict(query_transformed)[0])
print('probability:',model.predict_proba(query_transformed)[0])
prediction: document 2
probability: [0.2178996  0.39701782 0.16718298 0.2178996 ]

TF-IDF#

TF-IDF: Term Frequency * Inverse Document Frequency

→ measure how important a word is to a document in a corpus

Note:

A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.

TF-IDF#

TF:

\[\text{tf}(t, d)=\frac{\textrm{Total number of times term}\;t\textrm{ appears in document}d}{\textrm{Total number of terms in document}\;d} \]

IDF:

\[\text{idf}(t, D)= \log\frac{\textrm{Total document D}}{\textrm{Number of document containing term}\;t}\]

TF-IDF:

\[\text{tfidf}(t, d, D)=\text{tf}(t, d) \cdot \text{idf}(t, D)\]

TF-IDF#


sklearn’s TF-IDF

In detail article how Tf-IDF works.

corpus = [
   'This is the first Document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    
]

tfidf = TfidfVectorizer()

X = tfidf.fit_transform(corpus)

X.toarray()
array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])
df = pd.DataFrame((X.toarray().round(2)), columns=tfidf.get_feature_names_out())
df
and document first is one second the third this
0 0.00 0.47 0.58 0.38 0.00 0.00 0.38 0.00 0.38
1 0.00 0.69 0.00 0.28 0.00 0.54 0.28 0.00 0.28
2 0.51 0.00 0.00 0.27 0.51 0.00 0.27 0.51 0.27
3 0.00 0.47 0.58 0.38 0.00 0.00 0.38 0.00 0.38

N-grams#

To model sequences of words… for example ice and cream make more sense as a 2-gram when they appear together

can be at word level or at character level

n-grams

from nltk import ngrams
n = 4

for i in range(1, n):
    print(f"{i} gram\n")
    ngram = ngrams(text.split(), i)
    for gram in ngram:
        print(gram)
    print("-"*10)
1 gram

('Let',)
('us',)
('learn',)
('some',)
('NLP.',)
('NLP',)
('is',)
('amazing!',)
----------
2 gram

('Let', 'us')
('us', 'learn')
('learn', 'some')
('some', 'NLP.')
('NLP.', 'NLP')
('NLP', 'is')
('is', 'amazing!')
----------
3 gram

('Let', 'us', 'learn')
('us', 'learn', 'some')
('learn', 'some', 'NLP.')
('some', 'NLP.', 'NLP')
('NLP.', 'NLP', 'is')
('NLP', 'is', 'amazing!')
----------

Normalization#

[‘List’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

Do we want to distinguish between “List” and “list”?

Sometimes we do: “White House” vs. “white house”

Notes: Normalization is the process of converting text data into a standardized form to reduce complexity and improve the efficiency of machine learning models. This can include lowercasing, stemming/lemmatization, …

Stemming#

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘list’, ‘list’, ‘list’, ‘list’, ‘.’]

Note:

Stemming reduces words to a shorter form, a form that might have no meaning.

Lemmatization#

[‘list’, ‘listed’, ‘lists’, ‘listing’, ‘listings’, ‘.’]

→ [‘list’, ‘listed’, ‘list’, ‘listing’, ‘listing’, ‘.’]

Note:

Lemmatization uses the language dictionary to get the base word of a word.

stemmer = nltk.PorterStemmer()

text = "We are learning how a stemmer works"
text1 = "People are running so fast." 
tokenized_text = word_tokenize(text1)
stem = [stemmer.stem(word) for word in tokenized_text]
stem
['peopl', 'are', 'run', 'so', 'fast', '.']
lemmatizer = nltk.WordNetLemmatizer()
tokenized_text = word_tokenize(text1)
lemm = [lemmatizer.lemmatize(word) for word in tokenized_text]
lemm
['People', 'are', 'running', 'so', 'fast', '.']

Stemming or Lemmatization?#

It depends…

  • Stemming is faster

  • Lemmatization preserves more information

Stopwords#

  • some words do not provide meaningful information … they are not “content words”

  • the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

“Apple is looking at buying U.K. startup for $1 billion”

Stopwords#

  • some words do not provide meaningful information … they are not “content words”

  • the list of non-content words is language specific and corpus specific

What would you say are stop words in this text?

“Apple is looking at buying U.K. startup for $1 billion”

nltk.download("stopwords")
from nltk.corpus import stopwords

print(stopwords.words('english'))
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she'd", "she'll", "she's", 'should', 'shouldn', "shouldn't", "should've", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was', 'wasn', "wasn't", 'we', "we'd", "we'll", "we're", 'were', 'weren', "weren't", "we've", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't", 'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've"]
[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

POS Tagging#

  • Part Of Speech tagging - assigning grammatical annotations

    • ADJ - adjective

    • NOUN

    • VERB

Which are verbs and nouns here?

“Apple is looking at buying U.K. startup for $1 billion”

universaldependencies

POS Tagging#

  • Part Of Speech tagging - assigning grammatical annotations

    • ADJ - adjective

    • NOUN

    • VERB

Which are verbs and nouns here?

Apple is looking at buying U.K. startup for $1 billion”

universaldependencies POS

from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/runner/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
True
tokenized_text = word_tokenize(text1)
tag = pos_tag(tokenized_text)
tag
[('People', 'NNS'),
 ('are', 'VBP'),
 ('running', 'VBG'),
 ('so', 'RB'),
 ('fast', 'RB'),
 ('.', '.')]
  • PRP = Personal pronoun

  • VBP - Verb, non-3rd person singular present

  • VBG - Verb, ending in ‘-ing’ or present participle

  • VBZ - Verb, 3rd person singular present

  • WRB - Wh-adverb

  • NN - Noun, singular or mass

  • RB - Adverb

POS Tagging using Spacy#

!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
  
nlp = spacy.load("en_core_web_sm")
# new_text = "The car is blue"
doc = nlp(text1)
  
# Token and Tag
for token in doc:
    print(token, token.pos_)
People NOUN
are AUX
running VERB
so ADV
fast ADV
. PUNCT
  • PRON - Pronoun

  • NOUN - Noun

  • VERB - Verb

  • AUX - Auxiliary

  • DET - Determiner

  • SCONJ - Conjunction

Named Entities#

  • Named Entities are real-world objects that are assigned a name: person, country, book, product..

  • The recognition of entities is based on training data so it’s not perfect.

What entities do you think are in this text?

“Apple is looking at buying U.K. startup for $1 billion”

Named Entities#

  • Named Entities are real-world objects that are assigned a name: person, country, book, product..

  • The recognition of entities is based on training data so it’s not perfect.

What entities do you think are in this text?

Apple is looking at buying U.K. startup for $1 billion”

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text," - ", ent.label_)
Apple  -  ORG
U.K.  -  GPE
$1 billion  -  MONEY
  • GPE - Geographical Entity

  • ORG - Organization

  • MONEY - Monetary value

from spacy import displacy

displacy.render(doc, style="ent")
Apple ORG is looking at buying U.K. GPE startup for $1 billion MONEY
for token in doc:
    print(token, token.pos_)
Apple PROPN
is AUX
looking VERB
at ADP
buying VERB
U.K. PROPN
startup NOUN
for ADP
$ SYM
1 NUM
billion NUM
displacy.render(doc, style="dep")
Apple PROPN is AUX looking VERB at ADP buying VERB U.K. PROPN startup NOUN for ADP $ SYM 1 NUM billion NUM nsubj aux prep pcomp dobj dep prep quantmod compound pobj

So.. what do we do with all that?#

  • document similarity

  • text classification

Text similarity or Document Similarity#

Each document is a vector of features.

Similarity between documents is the similarity between vectors

Usage:

  • search engines: query to document

  • clustering of documents: document to document

  • Question & Answering platforms: query to query

Text classification#

You can use your favourite classifier with text

  • Logistic Regression provides nice baseline

  • AUC score as performance metric

Some applications:

  • spam detection

  • sentiment analysis

  • hate speech analysis

Word Embeddings#

Word Embeddings#

  • Represent feature space in smaller dimension

  • Similar words are near in embedding space

  • Trained by using neural networks
     → Use those trained weights as first layer in your NLP neural network.

Word similarity#

Is “St Pauli” more similar to:

  • De Wallen → Similar type

or

  • HSV → Similar topic?

Result depends on the context … or on the feature space / embedding you chose

Using Embeddings#

Relevant items for your task should be similar in the embedding space / i.e close to each other.

.

How do we get Word Embeddings#

CBOW - Continuous Bag of Words

  • Predict the current word based on the context words

  • Input (X): context words , Output (y): current word

  • For example, “One word was missing in all that.”

How do we get Word Embeddings#

Skip-Gram

  • Predict the context words based on the current word

  • Input (X): current word, Output (y): context words

  • For example, “One word was missing in all that.”

How do we get Word Embeddings#

Creating word embeddings

Using pre-trained embeddings#

Most times you do not have enough data to get good word embeddings for your task, instead you can use pre-trained word embeddings.

There are different kinds of word embeddings:

  • static word embeddings: Word2vec (google), GloVe (Standford University), fastText (Facebook),

  • contextual word embeddings: ELMo, Bert (google), gpt-2/3/4 (openAI), …

example: pretrained word embeddings

Word Embeddings#

!pip install gensim
!pip install scipy==1.12
import gensim.downloader as api

## List available embeddings
info = api.info()

for model_name, model_data in sorted(info['models'].items()):
    print(model_name)
__testing_word2vec-matrix-synopsis
conceptnet-numberbatch-17-06-300
fasttext-wiki-news-subwords-300
glove-twitter-100
glove-twitter-200
glove-twitter-25
glove-twitter-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-wiki-gigaword-50
word2vec-google-news-300
word2vec-ruscorpora-300
# caveat: If you don't have enough RAM, this cell can crash your kernel

wv = api.load("word2vec-google-news-300")
glove = api.load("glove-twitter-100")
fasttext = api.load("fasttext-wiki-news-subwords-300")
wv.most_similar("coffee", topn=10)
[('coffees', 0.721267819404602),
 ('gourmet_coffee', 0.7057086825370789),
 ('Coffee', 0.6900454759597778),
 ('o_joe', 0.6891065835952759),
 ('Starbucks_coffee', 0.6874972581863403),
 ('coffee_beans', 0.6749704480171204),
 ('latté', 0.664122462272644),
 ('cappuccino', 0.662549614906311),
 ('brewed_coffee', 0.6621608138084412),
 ('espresso', 0.6616826057434082)]
wv.get_vector("coffee").shape
(300,)
glove.most_similar("coffee", topn=10)
[('tea', 0.8275877237319946),
 ('beer', 0.7744594216346741),
 ('breakfast', 0.7694926261901855),
 ('coffe', 0.762207567691803),
 ('starbucks', 0.7606451511383057),
 ('food', 0.75710529088974),
 ('wine', 0.7540071606636047),
 ('drink', 0.7533924579620361),
 ('milk', 0.7433452010154724),
 ('cream', 0.7419354915618896)]
wv.distance("coffee", "tea")
# wv.distance("coffee","coffees")
0.43647074699401855
wv.distance("coffee", "onion")
0.8041959255933762

Semantic relationships#

wv.most_similar(positive=["king", "woman"], negative=["man"], topn=5)
[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133)]

Capture Semantic relationships:

gender (man ↔ woman)

royalty (king ↔ queen)

Visualize Semantics with Graphs#

TensorFlow projector

Hugging Face & Transformers#

Transformers#

  • Neural network architecture

  • Sequences of Encoders and Decoders

  • Use self-attention mechanisms to process input data in parallel

  • Handle long-range dependencies in text using many context vectors.

Transformers#

Encoder:

  • Self-attention layer:

    • looks at other words in the input sentence as it encodes a specific word

  • Feed-forward neural network:

    • applied to each position of the input sentence.

Decoder:

  • Encoder-Decoder-Attention layer:

    • helps the decoder focus on relevant parts of the input sentence

Transformers#

Is the term it connected to animal or street ?

self-attention allows the model to associate it with animal.

Hugging Face#

> 7k pre trained NLP models on huggingface.co

Zero-Shot Learning#

  • A pretrained model performs a downstream task directly from a natural language description

  • input: “Classify the sentiment of: Today is a great day!!”

(see notebook 2,3 in workbooks)

Resources#