Natural Language Processing is a component of Artificial Intelligence in the field of linguistics teaching machines to decipher the way humans communicate so they can respond to us in a similar or natural manner. Although NLP is widely used for voice assistance, it has applications elsewhere, from email filtering to text generation, from search engines to language translation, text summarization and text analysis.

An other example is chatbots, a simple messaging icons usually seen on marketplace websites. Their way of responding to human is oftentimes so natural that it can be difficult to distinguish we’re texting with a human or a machine.

There are several tools provided to work in NLP. Python programming language provides a Natural Language Toolkit NLTK and other open source libraries and educational resources for NLP programming. Statistical Analysis combines Machine Learning and Deep Learning models along with Computer Algorithms to extract and differentiate the text and voice data and statistically provide meaning to all the elements.

Currently, Neural NLP is progressing since Representation Learning and Deep Neural Network-style machine learning started widespread in NLP. NLP helps in a vast range of services starting from Business Analytics, Speech recognition, Social media etc.

In the following lines we’ll focus on how text generation is possible and how to implement one using python and the tensorflow with some code lines and comments to guide us.

What are the steps of text generation ?

As stated, text is a non-structured and sequential form of data. Non-structured means that data is difficult to comprehend for computers. On the other hand, sequential data refers to a series or an order where one thing follows the other. For machines to generate text for humans, they must first be trained to convert unstructured data into structured data and learn how to produce text. They can then generate data for us. These are the steps to follow to generate text with Python:

Import dependencies

When starting a python project, the first thin is to import all the required dependencies to run the program. In our case, some of the most helpful dependencies are:

tensorflow to compute and execute dataflow graph
requests to download data
os to manage directory
pickle to save machine learning model
tqdm to add a progress bar

        
      
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout, Activation
from keras.layers import LSTM

import requests
import os
import pickle
import tqdm

Load the Data

The next step is to load the data into Python to make it easy to read and access.

Here, we will use a combined collection of all Shakespearean sonnets as a dataset for this tutorial but we can choose any book/corpus we want. This collection that can be downloaded from here. I cleaned up this file to remove the start and end credits.

The text file is opened and saved in text. This content is then converted into lowercase, to reduce the number of possible words by reducing our vocabulary (a list of unique word) and punctuation as well as replacing two consecutives newlines with just one

Download and Save the dataset

        
      
# Create a data folder if it doesn't already exist
if not os.path.exists("data"):
    os.mkdir('data')

# Download
content = requests.get("https://www.gutenberg.org/cache/epub/1041/pg1041.txt").text

# Save
open("data/sonnets.txt", "w", encoding="utf-8").write(content)

122408

Load the data

        
      
# Data file path
file_path = "data/sonnets.txt"
basename = os.path.basename(file_path)

text = open(file_path, encoding="utf-8").read()
text = text.lower()

Creating Character/ Word Mappings

Now that we loaded and cleaned the dataset successfully, we need a way to convert these characters into integers, there are a lot of Keras and Scikit-Learn utilities out there for that, but we are going to make this manually in Python.

Since we have alphabet as our list that contains all the unique characters of our dataset, we can make two dictionaries that map each character to an integer number and vice-versa.

Then it’s time to map it. Mapping is a step in which we assign an arbitrary number to a character/word in the text. In this way, all unique characters/words are mapped to a number. This is important, because machines understand numbers far better than text, and this subsequently makes the training process easier.

        
      
characters = sorted(list(set(text)))
n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

I have created a dictionary with a number assigned to each unique character present in the text. All unique characters are first stored in characters and are then enumerated.

It must also be noted here that I have used character level mappings and not word mappings. However, when compared with each other, a word-based model shows much higher accuracy as compared to a character-based model. This is because the latter model requires a much larger network to learn long-term dependencies as it not only has to remember the sequences of words, but also has to learn to predict a grammatically correct word. However, in case of a word-based model, the latter has already been taken care of.

But since this is a small dataset (with 119405 characters), and the number of unique words (4,605 in number) constitute around one-fourth of the data, it would not be a wise decision to train on such a mapping. This is because if we assume that all unique words occurred equally in number (which is not true), we would have a word occurring roughly four times in the entire training dataset, which is just not sufficient to build a text generator.

Print some statistics about the data

        
      
n_characters = len(text)
n_unique_characters = len(characters)

print("Unique characters: ", characters)
print("Numbers of unique characters: ", n_unique_characters)
print("Number of characteres", n_characters)

Unique characters:  ['\n', ' ', '!', '"', '#', '$', '%', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '—', '‘', '’', '\ufeff']
Numbers of unique characters:  60
Number of characteres 119405

Save dictionnaries

Let’s save these two dictionnaries to a file using pickle to retrieve them later, when we test our generator

        
      
# save these dictionnary for a later use
basename = "venom"
pickle.dump(char_to_n, open(f"{basename}-char_to_n.pickle", "wb"))
pickle.dump(n_to_char, open(f"{basename}-n_to_char.pickle", "wb"))

Data Preprocessing

Since text is unstructured data, it has a lot of meaningless words. In text analysis, data must be cleared, processed, and converted into structured data.

This is the most tricky part when it comes to building LSTM models. Transforming the data at hand into a relatable format is a difficult task.

I’ll break down the process into small pieces to make it easier to understand

        
      
sequence_length = 100
batch_size = 128
epochs = 100

Encode the data

We gonna convert each character of the text into it’s corresponding integer value

        
# Convert all text into integers
encoded_text = np.array([char_to_n[c] for c in text])

Create a custom dataset object

Since we want to scale our code for larger datasets, we need to use tf.data API for efficient dataset handling, as a result, create a tf.data.Dataset object of this encoded_text data

        
# Create a custom dataset object
char_ds_object = tf.data.Dataset.from_tensor_slices(encoded_text)

Now, our new object has all the characters of our text, let’s print the first ones

        
      
# Printing some data
for characters in char_ds_object.take(20):
    print(characters.numpy(), n_to_char[characters.numpy()])

﻿
t
h
e
1
p
r
o
j
e
c
t
1
g
u
t
e
n
b
e

Construct input sample

At this point, we need to build our sequences, as mentionned earlier. We want each input sequence to be a sequence of characters of a specific lenght, store in sequence_length variable and the output of a single character that is the next one. For this, luckily, we can use tf.data.Dataset’s batch() method to gather characters together

        
# Build Sequences by batching
sequences = char_ds_object.batch( 2*sequence_length + 1, drop_remainder=True)

You notice i’ve converted the integer sequences into normal text using n_to_char dictionnary built earlier

        
      
# Print some sequences
for sequence in sequences.take(3):
    print(''.join([n_to_char[i] for i in sequence.numpy()]))

﻿the project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of anyone anywhere in the united states and
most other parts of the world at no cost and with almost no re
strictions
whatsoever. you may copy it, give it away or re-use it under the terms
of the project gutenberg license included with this ebook or online at
www.gutenberg.org. if you are not located in the
 united states, you
will have to check the laws of the country where you are located before
using this ebook.

title: the sonnets

author: william shakespeare

release date: september, 1997 [ebook #104

Sample_splitter function

Now, each sample is represented, let’s prepare our train and targets. We have to convert a single sample (sequence of characters) into multiple(train, target) samples. For this task, we can use the flat_map() method which takes callback function that loops over all our data samples

        
      
def sample_splitter(sample):

    length = len(sample)
    ds = tf.data.Dataset.from_tensors((sample[:sequence_length], sample[sequence_length]))

    for i in range(0, length-sequence_length, 1):

        sequence = sample[i: i + sequence_length]
        label = sample[i + sequence_length]

        # extend the dataset with these samples with the concatenate() method
        other_ds = tf.data.Dataset.from_tensors((sequence, label))
        ds = ds.concatenate(other_ds)
    return ds

# Prepare sequences and labels
dataset = sequences.flat_map(sample_splitter)

Here, sequence is our train array, and label is our target array.

sequence_length is the length of the sequence of characters that we want to consider before predicting a particular character.

The for loop is used to iterate over the entire length of the text and create such sequences (stored in sequence) and their true values (stored in label). Now, it’s difficult to visualize the concept of true values here. Let’s have a good understanding of the code above with this example:

Let’s say we have a sequence length of 4 (too small but good for explanation) and the text “hello cameroon”, we would have our sequence and label (not encoded as numbers for ease of understanding) as below:

X	Y
[h, e, l, l]	[o]
[e, l, l, o]	[ ]
[l, l, o, ]	[c]
[l, o, i, ]	[a]

…. ….

We do that on all samples, in the end, we’ll see that we increased the number of training samples. And we’ve used the concatenate() to join these samples together.

One-hot Encoding

For categorical data where no ordinal relationship exists, the integer encoding is not enough. So, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In our case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the example below, for the color variable we’ve 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

red	green	blue
0	1	0
0	0	1
1	0	0

As a second example, if ‘v’ is encoded as 3 and n_unique_characters = 7, the result should be the vector: [0, 0, 1, 0, 0, 0, 0], since ‘v’ is the third character

        
# One-hot encode the sequences and the labels
def one_hot_encoding(sequence, label):
    return tf.one_hot(sequence, n_unique_characters), tf.one_hot(label, n_unique_characters)

dataset = dataset.map(one_hot_encoding)

We’ve used the map() method to one-hot encode each sample on our dataset, tf.one_hot() method did the job. Let’s show some data samples and their corresponding shapes

        
      
for e in dataset.take(5):
    print("Input:", ''.join([n_to_char[np.argmax(char_vector)] for char_vector in e[0].numpy()]))
    print("Target:", n_to_char[np.argmax(e[1].numpy())])
    print("Input shape:", e[0].shape)
    print("Target shape:", e[1].shape)
    print("="*50, "\n")

Input: ﻿the project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of an
Target: y
Input shape: (100, 60)
Target shape: (60,)
==================================================

Input: ﻿the project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of an
Target: y
Input shape: (100, 60)
Target shape: (60,)
==================================================

Input: the project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of any
Target: o
Input shape: (100, 60)
Target shape: (60,)
==================================================

Input: he project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of anyo
Target: n
Input shape: (100, 60)
Target shape: (60,)
==================================================

Input: e project gutenberg ebook of the sonnets, by william shakespeare

this ebook is for the use of anyon
Target: e
Input shape: (100, 60)
Target shape: (60,)
==================================================

Each input element has the shape of (sequence_length, alphabet size). In this case, we have 60 unique characters and 100 is the sequence length. The shape of the output is a one-hot-encoded one-dimensional vector

Repeat, Shuffle and Batch the dataset

        
      
ds = dataset.repeat().shuffle(1024).batch(batch_size, drop_remainder=True)

Building Model

Modeling is the most crucial part in text generation. First, the computer is trained to produce text by being fed both sequence and label. In doing so, it is taught to identify various patterns in natural languages. This means that in the future, it can generate an output of its own if it’s fed the sequence.

Now let’s build our model, it has basically 02 LSTM layers(Long Short-term Memory), a form of model that helps predict sequential data; and an arbitrary number of 700 LSTM units.

Try to experiment with different model architectures, you’re free to do whatever you want!

The output layer is a fully-connected layer with 60 units where each neuron corresponds to a character (probability of the occurrence of each character).

We’re using Adam optimizer here but we can use different optimizer in order to evaluate performance.

        
      
model = Sequential()
model.add(LSTM(700, input_shape=(sequence_length, n_unique_characters), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(700))
model.add(Dropout(0.2))
model.add(Dense(n_unique_characters, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm (LSTM)                  (None, 100, 700)          2130800
_________________________________________________________________
dropout (Dropout)            (None, 100, 700)          0
_________________________________________________________________
lstm_1 (LSTM)                (None, 700)               3922800
_________________________________________________________________
dropout_1 (Dropout)          (None, 700)               0
_________________________________________________________________
dense (Dense)                (None, 60)                42060
=================================================================
Total params: 6,095,660
Trainable params: 6,095,660
Non-trainable params: 0
_________________________________________________________________

We are building a sequential model with 02 LSTM layers having 700 units each. The first layer needs to be fed in with the input shape. In order for the next LSTM layer to be able to process the same sequences, we enter the return_sequences parameter as True.

Also, dropout layers with a 20% dropout have been added to check for over-fitting. The last layer outputs a one-hot encoded vector which gives the character output.

Training the Model

        
      
# define the model path
model_weights_path = f"results/{basename}-{sequence_length}.h5"

# Make a result folder if it doesn't exist or select an existing one
if not os.path.isdir('results'):
    os.mkdir('results')

# Train the model
model.fit(ds, steps_per_epoch=(len(encoded_text) - sequence_length) // batch_size, epochs = epochs)

# Save the model
model.save(model_weights_path)

We fed the dataset object that we create earlier, and since the model object has no idea on many samples are there in the dataset, we specified steps_per_epoch parameter, which is set to the number of training samples divided by the batch_size.

Generate new Text

Finally, here is the fun part, now that the model is built and trained, we just have to generate our poetry

Create a seed

We need a sample text, a seed to start generating. This will depend on your problem, you can take sentences from the training data in which it will perform better, but we will try to produce a new chapter of this book:

        
      
seed = 'chapter xviii'

if it’s a single notebook, You do not have to follow the following three sections

Load dictionnaries

Let’s load the dictionnaries that map each integer to a character and vise-versa that we saved before in the character mappings phase

        
      
# load characters dictionaries
char_to_n = pickle.load(open(f"{basename}-char_to_n.pickle", "rb"))
n_to_char = pickle.load(open(f"{basename}-n_to_char.pickle", "rb"))
dict_size = len(char_to_n)

Rebuild the model

        
      
#### Rebuild the model
model = Sequential([
    LSTM(700, input_shape=(sequence_length, dict_size), return_sequences=True),
    Dropout(0.2),
    LSTM(700),
    Dropout(0.2),
    Dense(dict_size, activation='softmax'),
])

Load saved weights

Equally, we need to load the optimal set of model weights. to avoid to retrain the model

        
# load the optimal weights
model.load_weights(f"results/{basename}-{sequence_length}.h5")

Generate our Poetry

        
      
n_chars = 500

# Generating characters
generated = ""

for i in tqdm.tqdm(range(n_chars), "Generating text\n"):

    # Make an input sequence
    X = np.zeros((1, sequence_length, dict_size))
    for t, char in enumerate(seed):
        X[0, (sequence_length - len(seed)) + t, char_to_n[char]] = 1
    # predict the next character
    prediction = model.predict(X, verbose=0)[0]

    # converting the vector to an integer
    next_index = np.argmax(prediction)

    # converting the integer to a character
    next_char = n_to_char[next_index]

    # add the character to results
    generated += next_char

    # shift seed and the predicted character
    seed = seed[1:] + next_char

print("Seed:", seed)
print("Generated text:")
print(generated)

Generating text

Seed:  crouse
    b
Generated text:
,
  but when in thee dear
  as heart as my chase,
  cally that beauty mish,
  when i am not crouse
    but when in thee dear
  as heart as my chase,
  cally that beauty mish,
  when i am not crouse
    but when in thee dear
  as heart as my chase,
  cally that beauty mish,
  when i am not crouse
    but when in thee dear
  as heart as my chase,
  cally that beauty mish,
  when i am not crouse
    but when in thee dear
  as heart as my chase,
  cally that beauty mish,
  when i am not crouse
    b

All we’ve done here is starting with a seed text, naking the input sequence, and predicting the next character on one hand. On the other hand, we shift the input sequence by removing the first character and adding the predicted character. This gives us a slightly changed sequence that still has the expected sequence length.

We then feed this updated input sequence to the model to predict another character. Repeating this process n_chars times will generate a text with N characters.

Conclusion

It is clearly English that we are reading. However, the sentences don’t make much sense. In fact, this result has several causes, in particular the length of our dataset which did not have enough samples. Also, the architecture of our model not being optimal, we find ourselves quite easily in loops repeating words ad infinitum. However, we quickly overcome this concern by adding layers to our sequential model.

In our case, after several attempts, we were able to observe acceptable parameters for our model. We almost have the impression that our model is really trying to understand and write poetry. It’s funny.

It should be noted that this tutorial does not only apply to text in English but to all languages. Indeed, we could even generate code if we have enough lines of code.

Source Code

train.py

        
      
# import dependencies
import numpy as np
import tensorflow as tf

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM

import requests
import os
import pickle


# Create a data folder if it doesn't already exist
if not os.path.exists("data"):
    os.mkdir('data')

# Download
content = requests.get("https://www.gutenberg.org/cache/epub/1041/pg1041.txt").text # comment if already downloaded

# Save
open("data/sonnets.txt", "w", encoding="utf-8").write(content) # comment if already downloaded

# Data file path
file_path = "data/sonnets.txt"
basename = os.path.basename(file_path)

# Load the data
text = open(file_path, encoding="utf-8").read()
text = text.lower()

# Character Mappings
characters = sorted(list(set(text)))
n_to_char = {n:char for n, char in enumerate(characters)}
char_to_n = {char:n for n, char in enumerate(characters)}

# Print some statistics
n_characters = len(text)
n_unique_characters = len(characters)
print("Unique characters: ", characters)
print("Numbers of unique characters: ", n_unique_characters)
print("Number of characteres", n_characters)

# save these dictionnary for a later use
basename = "venom"
pickle.dump(char_to_n, open(f"{basename}-char_to_n.pickle", "wb"))
pickle.dump(n_to_char, open(f"{basename}-n_to_char.pickle", "wb"))

sequence_length = 100
batch_size = 128
epochs = 100

# Encode the data by converting all text into integers
encoded_text = np.array([char_to_n[c] for c in text])

# Create a custom dataset object
char_ds_object = tf.data.Dataset.from_tensor_slices(encoded_text)

# Printing some data
for characters in char_ds_object.take(6):
    print(characters.numpy(), n_to_char[characters.numpy()])

# Build Sequences by batching
sequences = char_ds_object.batch( 2*sequence_length + 1, drop_remainder=True)

# Print some sequences
for sequence in sequences.take(3):
    print(''.join([n_to_char[i] for i in sequence.numpy()]))

# Sample_splitter function
def sample_splitter(sample):
    length = len(sample)
    ds = tf.data.Dataset.from_tensors((sample[:sequence_length], sample[sequence_length]))
    for i in range(0, length-sequence_length, 1):
        sequence = sample[i: i + sequence_length]
        label = sample[i + sequence_length]
        # extend the dataset with these samples with the concatenate() method
        other_ds = tf.data.Dataset.from_tensors((sequence, label))
        ds = ds.concatenate(other_ds)
    return ds
# Prepare sequences and labels
dataset = sequences.flat_map(sample_splitter)

# One-hot encode the sequences and the labels
def one_hot_encoding(sequence, label):
    return tf.one_hot(sequence, n_unique_characters), tf.one_hot(label, n_unique_characters)

dataset = dataset.map(one_hot_encoding)

# print some samples
for e in dataset.take(3):
    print("Input:", ''.join([n_to_char[np.argmax(char_vector)] for char_vector in e[0].numpy()]))
    print("Target:", n_to_char[np.argmax(e[1].numpy())])
    print("Input shape:", e[0].shape)
    print("Target shape:", e[1].shape)
    print("="*50, "\n")

# repeat, shuffle and batch the dataset
ds = dataset.repeat().shuffle(1024).batch(batch_size, drop_remainder=True)

# building the model
model = Sequential()
model.add(LSTM(700, input_shape=(sequence_length, n_unique_characters), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(700))
model.add(Dropout(0.2))
model.add(Dense(n_unique_characters, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
model.summary()

# define the model path
model_weights_path = f"results/{basename}-{sequence_length}.h5"

# Make a result folder if it doesn't exist or select an existing one
if not os.path.isdir('results'):
    os.mkdir('results')

# Train the model
model.fit(ds, steps_per_epoch=(len(encoded_text) - sequence_length) // batch_size, epochs = epochs)

# Save the model
model.save(model_weights_path)

text_generator.py

        
      
import numpy as np
import pickle
import tqdm
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
import os


# dataset file path
file_path = "data/sonnets.txt"

# file_path = "data/python_code.py"
basename = "venom"

# load characters dictionaries
char_to_n = pickle.load(open(f"{basename}-char_to_n.pickle", "rb"))
n_to_char = pickle.load(open(f"{basename}-n_to_char.pickle", "rb"))
dict_size = len(char_to_n)

sequence_length = 100

# Build the model
model = Sequential([
    LSTM(700, input_shape=(sequence_length, dict_size), return_sequences=True),
    Dropout(0.2),
    LSTM(700),
    Dropout(0.2),
    Dense(dict_size, activation='softmax'),
])

# load the optimal weights
model.load_weights(f"results/{basename}-{sequence_length}.h5")

# specify the feed to first characters to generate
seed = 'love is war'

n_chars = 500
# Generating characters
generated = ""
for i in tqdm.tqdm(range(n_chars), "Generating text\n"):
    X = np.zeros((1, sequence_length, dict_size))
    for t, char in enumerate(seed):
        X[0, (sequence_length - len(seed)) + t, char_to_n[char]] = 1
    prediction = model.predict(X, verbose=0)[0]
    next_index = np.argmax(prediction)
    next_char = n_to_char[next_index]
    generated += next_char
    seed = seed[1:] + next_char

print("Seed:", seed)
print("Generated text:")
print(generated)

Natural Language Processing use case - Poet/Writer (Text Generator) - Understand, analyse and generate a poetry using Python and Tensorflow

What are the steps of text generation ?

Import dependencies

Load the Data

Download and Save the dataset

Load the data

Creating Character/ Word Mappings

Save dictionnaries

Data Preprocessing

Encode the data

Create a custom dataset object

Construct input sample

Sample_splitter function

One-hot Encoding

Repeat, Shuffle and Batch the dataset

Building Model

Training the Model

Generate new Text

Create a seed

Load dictionnaries

Rebuild the model

Load saved weights

Generate our Poetry

Conclusion

Source Code

train.py

text_generator.py

Links

Getting The Code On Kaggle

Getting The Code On Github

Add me on LinkedIn

Natural Language Processing use case - Poet/Writer (Text Generator) - Understand, analyse and generate a poetry using Python and Tensorflow

What are the steps of text generation ?

Import dependencies

Load the Data

Download and Save the dataset

Load the data

Creating Character/ Word Mappings

Save dictionnaries

Data Preprocessing

Encode the data

Create a custom dataset object

Construct input sample

Sample_splitter function

One-hot Encoding

Repeat, Shuffle and Batch the dataset

Building Model

Training the Model

Generate new Text

Create a seed

Load dictionnaries

Rebuild the model

Load saved weights

Generate our Poetry

Conclusion

Source Code

train.py

text_generator.py

Links

Getting The Code On Kaggle

Getting The Code On Github

Add me on LinkedIn

Further Reading

Gradient Descent for Logistic Regression in Python

Cats vs Dog with Logistic Regression using Gradient Descent in Python

Understanding Natural Language Processing