Using Pytesseract To Convert Images Into A HTML Site
Convert images to a string with Google Tesseract and then into a static HTML site using Python
March 7th, 2020
Overview
Using Google's Tesseract OCR library, we will scan images from a dataset and create a HTML website out of it with navigation. We will be covering an array of topics including the Pytesseract library, Google's Tesseract library, Makefiles, regex, and more. This post is to serve as an introduction to the power of neural networks through basic OCR.
View a video of the project in action here.
Feel free to follow along by refering to the GitHub repository for this Python OCR project The datasets and the styles.css
file are within this repository.
Creating the project structure
Create the project root directory:
$ mkdir python-ocr-tutorial
$ cd python-ocr-tutorial
mkdir python-ocr-tutorial
: creates our root directory
cd python-ocr-tutorial
: changes current directory to our project directory
Create our project folders:
$ mkdir html data utils
mkdir html
: our html folder where our html would output to
mkdir data
: our data folder where our images will be
mkdir utils
: our utils package where we will keep all of our utility functions
Create our project files:
$ touch utils/utils.py utils/__init__.py main.py requirements.txt Makefile
touch
: command that creates / updates modified date for a file
touch utils/utils.py
: creates our utils file where our util functions will live
touch utils/__init__.py
: tells Python that utils
should be treated as a package
touch main.py
: creates the file that will call our utils functions
touch requirements.txt
: creates the file that tells pip
/ pip3
what packages we need to be installed
touch Makefile
: creates the Makefile that will help us run important tasks like run
, test
, and clean
Setting Up Our Libraries
We will first need to download Tesseract
. Pytesseract
is a wrapper for Google's library. Which means it serves as a bridge from Python to Tesseract
. In order for the Python library to work, you need to install the Tesseract library through Google's install guide.
In requirements.txt
add the following:
pytesseract==0.3.2
pytesseract
: A wrapper for Google's Tesseract OCR library that allows us to scan images and extract that data into a string
Update your Makefile:
init:
pip3 install -r requirements.txt
init
: this is the name of the command that can be called via $ Make init
. The name could be anything.
pip3 install -r requirements.txt
: pip3
is Python 3's package installer. (you may need to use pip
if you do not have python 3
)
-r requirements.txt
: This is a requirement option for pip3
. The requirements.txt
file contains the list of dependencies that pip3
needs to install. You can run $ pip3 help install
for more info.
Run $ Make init
:
$ Make init
pip3 install -r requirements.txt
Processing /Users/.../Caches/pip/...
...
Collecting Pillow
...
Installing collected packages: Pillow, pytesseract
Successfully installed Pillow-7.0.0 pytesseract-0.3.2
Note: When installing Python dependencies, it would be ideal to use virtualenv
, however, for this guide we will not cover that.
Make will throw an error if you use spaces instead of tabs. Some IDEs automatically converts tabs into spaces. You will have to turn that off or use
nano
orvim
.You can verify if the file is valid by running
$ cat -e -t -v Makefile
. If you see^I
before each line, that means it is valid and is using tabs. If you only see spaces, then you need to convert them into tabs.
If you noticed, $ Make init
installed Pillow
automatically. This is because pytesseract
requires it and it installs it for us. You can view more info about libraries installed through pip3
by running $ pip3 show PACKAGE_NAME
in terminal (not the Python console):
$ pip3 show pytesseract
Name: pytesseract
Version: 0.3.2
...
Location: /usr/local/lib/python3.7/site-packages
Requires: Pillow
From this, we can see that this library requires Pillow
which is a fork of PIL
(Python Imaging Library). This library allows us to pass a path to an image to pytesseract
and automatically process it for us.
The issue with our requirements.txt
is that it is prone to installing different Pillow
versions. If this project is setup on a different device, the Pillow
dependency will install the latest version. While it is nice to use the newest release, updating dependencies should be done intentionally, not automatically to prevent apps from breaking. We should uninstall Pillow
, update our requirements.txt
to specifically install Pillow 7.0.0
, and run $ Make init
.
Uninstall Pillow
using the $ pip3 uninstall PACKAGE_NAME
command:
$ pip3 uninstall Pillow
Uninstalling Pillow-7.0.0:
Would remove:
/usr/local/lib/python3.7/site-packages/PIL/*
/usr/local/lib/python3.7/site-packages/Pillow-7.0.0.dist-info/*
Proceed (y/n)? y
Successfully uninstalled Pillow-7.0.0
Update requirements.txt
:
pytesseract==0.3.2
Pillow==7.0.0
Install dependencies:
$ Make init
pip3 install -r requirements.txt
...
Collecting Pillow==7.0.0
...
Installing collected packages: Pillow
Successfully installed Pillow-7.0.0
Trying out Tesseract
Before we cover our program, we should take a close look at tesseract
and pytesseract
to understand the core of our project.
First, populate the data
folder. Download the data
folder from the image-to-html repo and place the contents in our data folder. That means all of the jpg
images go inside python-ocr-tutorial/data/
Start Python with $ python3
:
$ python3
Python 3.7.6 (default, Dec 30 2019, 19:38:28)
[Clang 11.0.0 (clang-1100.0.33.16)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Note: I will be using python3
for this guide, python
(older versions) should work as well.
We need to tell Python to import the pytesseract
library:
>>> import pytesseract
>>>
If we see no errors, it means that we have successfully imported pytesseract
.
If you do see an error, you may need to install tesseract
. For this guide, I have using 4.1.1
which allows us to use their newer Neural nets LSTM engine. pytesseract
will automatically use the OCR engine based on what's available. Visit Google's tessearct install guide.
The pytesseract
library provides us with the image_to_string
method that allows us to pass in a path to an image or an image object. The library will scan an image and return the text it recognizes from the image. If no text is found, nothing would be returned. We can try scanning our first image from the dataset (./data/python_dataset_01.jpg
):
>>> pytesseract.image_to_string('./data/python_dataset_01.jpg')
'Chapter 1: Lorem\n\nLorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nisi ...Praesent ut diam aliquet, dapibus felis in,'
>>>
After a few seconds, the text from the image should be returned. Keep in mind that OCR is not perfect, our dataset is fortunately very ideal for OCR as it is oriented correctly and has text that is very clear and consistent.
We won't worry about it much in this guide, but tesseract
allows us to configure which OCR engine mode and page segmentation mode to use. Run $ tesseract --help-extra
to see our options:
$ tesseract --help-extra
# ...
OCR options:
# ...
--psm NUM Specify page segmentation mode.
--oem NUM Specify OCR Engine mode.
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.
# ...
From here, we can see that we have plenty of options to pass to the --psm
and --oem
config options. We can pass these options through pytesseract
by using the config
parameter in our image_to_string
method. Here is an example of intentionally using the wrong psm (I set it to expect vertically aligned text):
>>> pytesseract.image_to_string('./data/python_dataset_01.jpg', config="--psm 5")
"=\n5\nLEE:\nce\nae\not\nnH\nHite\noo. =\nnda ...a5\nicees 3 3\nor _ 8s\nbaka gs\nGas g\n528 Ze\n© © 5\nEe 52\n> 28\nzo\n2"
The text is unreadable due to Tesseract
reading the documment vertically.
Overview Of The Utils File
We will have the following methods in our utils.py
file:
extract
: finds all of the images in the data
folder and returns an array containing each line from all of the images
build_chapters
: builds a hash where the keys are the chapter titles and the value of each key is a string equal to the contents of the chapter
get_chapter_file
: converts a chapter string into the appropriate html file name (Chapter 1: Hello World
would equal to hello-world.html
)
build_html_file
: converts a key and value pair from the build_chapters
method into a html page
convert_chapter_to_spinal
: converts a chapter name into spinal-case ("Hello World" becomes "hello-world")
Creating The Extract Method:
The extract method's goal is to return an array of lines from all of the images in the data
folder. So the first step is to figure out which files that we will need to run through tesseract
. We can do this through the built-in glob
package. We can pass a pattern to the glob
library and it will return a list of files that match to it, similar to the ls
command.
Globbing refers to the process that unix carrys out filename expansions. It is not regular expression, though it can be similar. glob
uses the unix style pathname pattern expansion which we won't use for anything complex. In fact, we just need to use a simple wildcard (*
). If we want to get a list of the jpg images in the data
folder, we just need to use this pattern: data/*.jpg
. This grabs any file path that is within the data
folder and ends with .jpg
. The *
could be any string.
We can test this out by using the glob
method from the glob
library. In the Python console, import the glob
library and run the pattern above using the glob
method:
>>> import glob
>>> glob.glob('data/*.jpg')
['data/python_dataset_16.jpg', ... 'data/python_dataset_33.jpg']
>>>
Note: be sure to run this in the root directory of the project, not within the data folder
We can loop through this array of filepaths, run it through tesseract
, and append it into one giant string which we can then break into an array, line by line. Here is the full extract
method:
import pytesseract
import glob
def extract(path='./data/*.jpg'):
pages = glob.glob(path)
pages.sort()
text = ''
for page in pages:
print('extracting: {}'.format(page))
image_string = pytesseract.image_to_string(page)
text += image_string
lines = text.split('\n')
return lines
import ...
: imports the pytesseract and glob libraries
def extract
: creates the extract method
(path='./data/*.jpg')
: The extract
method takes in a path
parameter that defaults to the path we tested above. This will allow us to use a smaller and consistent dataset in our tests.
pages = glob.glob(path)
: this retrieves an array of filepaths for the pages/files in our dataset. It then assigns it to the pages
variable
pages.sort()
: glob does not guarantee that the files are returned in the right order. We want to sort this array in alphanumeric order so that an earlier page does not come after a later page. Keep in mind that sort()
mutates the array, it does not return a new array:
>>> a = [1, 2, 4, 3]
>>> a
[1, 2, 4, 3]
>>> a.sort()
>>> a
[1, 2, 3, 4]
text = ''
: we instantiate the text
string to be blank. We will append the strings that tesseract
extracts from all our images to the text
variable.
for page in pages:
: this loops through the pages
variable we have above. The page
variable refers to the current filepath the loop is currently on
print(...)
: This prints the current status to the console, since we have a large dataset, it is good to know which page it is on.
'extracting: {}'.format(page)
: format
will replace {}
with the variable we pass in (page
).
image_string = pytesseract.image_to_string(page)
: this OCRs the current page and assigns the string to the image_string
variable
text += image_string
: this appends image_string
to text
. This means that text
will be an extremely large string containing all of the text from our images, like one large document.
lines = text.split('\n')
: the split
method is part of the string class where you can create an array of strings at breakpoints within a string. For example, \n
means new line which tesseract
returns for each line. We want to iterate through these seperately so we should create a breakpoint for \n
.
Here is an example of the split
command:
>>> lines = 'line1
line2
line3'
>>> lines.split('
')
['line1', 'line2', 'line3']
return lines
: this returns the array of lines from all of the documents. We will use this array to build our chapters hash.
Note: There are two blank lines above the method to follow pep8
standards
Testing Our extract Method:
Since the extract
method goes through all 40+ images, we should lower that amount temporarily for testing purposes. We can slice an array to only the first 3 items by using [:3]
. The colon ( :
) tells Python to extract everything before index 3
.
We can update the pages
variable to only loop through a few images:
# ...
pages.sort()
pages = pages[:3]
# ...
Now that our temporary fix is in place, we can test the library in the Python console:
>>> from utils.utils import extract
>>> extract()
extracting: data/python_dataset_01.jpg
extracting: data/python_dataset_02.jpg
extracting: data/python_dataset_03.jpg
['Chapter 1: Lorem', '', ... 'fermentum porta risus.']
>>>
from utils.utils import extract
: This tells Python that we want to import
the extract
method from utils/utils.py
. The first utils
refers to the package name (due to the __init__.py
file) and the second utils
refers to the utils.py
file.
If a very large array was returned, your extract
method works as intended!
Creating the build_chapters Method
Our goal of this method is to convert the array from the extract()
method into a hash of chapters.
For example, lines would become chapters here:
lines = ['Chapter 1: Lorem', 'line 1', 'line 2', 'Chapter 2: Ipsum', 'line 3']
build_chapters(lines) # {'Chapter 1: Lorem': 'line 1\nline2', 'Chapter 2: Ipsum', 'line 3'}
In order to achieve this, we need to recognize which line is a chapter and which one is a normal line. We can tell if a string is a chapter if it starts with Chapter NUMBER:
. Fortunately, regular expressions can be used to see if a string matches a "chapter" pattern. We will do this through the built-in re
library:
>>> import re
>>> re.match(r"^(Chapter [0-9]+:)", 'Chapter 1: Lorem')
<re.Match object; span=(0, 10), match='Chapter 1:'>
>>> re.match(r"^(Chapter [0-9]+:)", 'Ipsum')
>>>
import re
: imports the built-in Python regex library
re.match
the match method will return an object if the string matches with the pattern provided
r"^(Chapter [0-9]+:)"
: This is the regular expression pattern that checks for Chapter NUMBER:
, we will cover what this means soon
'Chapter 1: Lorem'
: this is the string we want to check against
<re.Match object..>
: The object that is returned when a match is found
re.match(..., 'Ipsum')
: Nothing is returned from this because it is not a chapter
Here is a breakdown of the r"^(Chapter [0-9]+:)"
pattern:
r
: signifies that it is a regular expression
^
: indicates that the string starts with the pattern so Lorem chapter 1: ipsum
won't match. For the sake of this project, we will assume that tesseract
will OCR the chapters correctly
(...)
: indicates the pattern that should be applied to ^
Chapter
: tells the pattern to look for Chapter
in the string
[0-9]
: tells the pattern to look for a number between 0
to 9
in this position
+
: tells the pattern to look for one or more of the pattern before it. In this case it looks for one or more 0-9
:
: tells the pattern to look for :
Here is the general workflow of the build_chapters
method:
- create a blank chapters hash and a string for the current chapter key
- loop through each line
- if it is a chapter, update the current chapter_key
- if it is not a chapter, append it to the current chapter key's value
- return chapters hash
Here is the build_chapters
method:
import re
# ...
def build_chapters(lines):
chapters = {}
cur_chapter = 'Intro'
for line in lines:
is_chapter = re.match(r"^(Chapter [0-9]+:)", line)
if is_chapter:
cur_chapter = line
elif cur_chapter in chapters.keys():
content = '{}\n'.format(line)
chapters[cur_chapter] += content
else:
content = '{}\n'.format(line)
chapters[cur_chapter] = content
return chapters
import re
: imports the regex library. This should at the top of the file with the other imports
def build_chapters(lines)
: defines the build_chapters
method and has lines
as the parameter
chapters = {}
: instantiates a blank chapters hash
cur_chapter = 'Intro'
: assigns intro
to cur_chapter
. This is only a fallback in the case that the document does not start with a chapter line. So any non-chapter line that occurs before the first chapter line would be assigned under the intro
key
for line in lines:
: loops through the lines
is_chapter = re.match(...)
: checks if the current line is a chapter using the regex pattern we used before
if is_chapter
: we will handle chapter lines differently from non-chapter lines so we use an if statement
cur_chapter = line
: this line tells our method that it is done with the last chapter, and the content that follows will be for the next chapter. (ex. Chapter 1: Lorem
-> Chapter 2: Ipsum
)
elif cur_chapter in chapters.keys():
: This checks if cur_chapter
is a key in our hash. If a key/value pair has already been instantiated, we can just append a line to the current value. We can't use cur_chapter in chapters
because we won't be checking against an array of keys. So we use the chapters.keys()
method which returns an array of keys.
content = '{}\n'.format(line)
: builds our content string to add a new line (\n
). chapters[cur_chapter] += content
: Since we know that cur_chapter
exists in chapters
, we can append it to its current value using the +=
operator.
else:
: if the content is not in the string, then we will instantiate the key/value pair
content = '{}\n'.format(line)
: same as above. It is important to no repeat code (DRY), however, in this case it was only repated twice so it is not that bad. Otherwise we should extract it to a method or refactor our method. Since it is only one simple line, we won't need to worry about it.
chapters[cur_chapter] = content
: since the key/value pair doesn't exist we would need to create it using the =
operator. This line will only occur once per chapter.
return chapters
: return the hash of chapters we created
Testing build_chapters
In the Python console, lets pass the result of our extract()
method into our build_chapters
method:
>>> from utils.utils import extract, build_chapters
>>> lines = extract()
extracting: data/python_dataset_01.jpg
extracting: data/python_dataset_02.jpg
extracting: data/python_dataset_03.jpg
>>> chapters = build_chapters(lines)
>>> chapters
{'Chapter 1: Lorem': '\nLorem ...\n', 'Chapter 2: lpsum': '\nFusce ...'}
>>> chapters.keys()
dict_keys(['Chapter 1: Lorem', 'Chapter 2: lpsum', 'Chapter 3: Dolor'])
Note: be sure to save the utils file and restart the Python console
If you can see a dict of keys for each chapter, then your function works correctly!
Creating The convert_chapter_to_spinal method And InvalidChapterException
Before we work on converting the chapters hash into html, we have some helper methods to create. First, we need to build the convert_chapter_to_spinal
method. In addition, we will create a custom exception called InvalidChapterException
.
Our convert_chapter_to_spinal
method converts a chapter key such as Chapter 1: Lorem Ipsum Dolor
into spinal case without the chapter portion: lorem-ipsum-dolor
Here is our convert_chapter_to_spinal
method:
def convert_chapter_to_spinal(chapter):
name = re.sub(r"^(Chapter [0-9]+: )", '', chapter)
if name == chapter:
raise InvalidChapterException
name = name.lower().replace(' ', '-')
return name
def convert_chapter_to_spinal(chapter):
: defines our method and takes a string called chapter
as a parameter
re.sub(.., '', chapter)
: The regex library offers a sub
method to replace matches using regex. This uses the pattern we created above to find that match and remove it from chapter
and then assign it to name
. You don't have to do this, but I wanted to keep the html urls short and easier to read. The first parameter is the pattern, the 2nd parameter is what we replace it with, and the last one is what we are checking against.
if name == chapter:
: At the point, chapter should be Chapter 1: Lorem
, and the name
variable should be lorem
. If the re.sub
method didn't work, then name
should be equal to chapter since nothing was replaced. If the string isn't a valid chapter, it should throw an error.
raise InvalidChapterException
: this raises a custom exception, we will cover how this works shortly. When an exception is raised, it goes to the method that calls it. The method will throw an exception if it doesn't catch it (by wrapping it in a try
except
)
name = name.lower().replace(' ', '-')
: This cleans up our name
string. It converts Lorem Ipsum
into lorem-ipsum
. The lower()
method converts are sting to lowercase and .replace
replaces all spaces (' '
) with a dash (-
).
return name
: we then return the name we created.
Here is the custom exception we create below the method:
class InvalidChapterException(Exception):
"""Chapter name is invalid"""
pass
class InvalidChapterException(Exception):
creates a class named InvalidChapterException
that inherits the Exception
class.
"""Chapter name is invalid"""
: This is the message that will be returned if the exception is raised
pass
: we don't need anything in this exception since we inherit the Exception
class
Now, whenever we raise InvalidChapterException
, it will refer to this exception.
Testing Convert_chapter_to_spinal And InvalidChapterException
Restart the Python console and try passing strings to the new convert_chapter_to_spinal
method:
>>> from utils.utils import convert_chapter_to_spinal
>>> convert_chapter_to_spinal('Chapter 1: Lorem')
'lorem'
>>> convert_chapter_to_spinal('Chapter 2: Lorem Ipsum')
'lorem-ipsum'
>>> convert_chapter_to_spinal('Dolor Sit')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../utils/utils.py", line 42, in convert_chapter_to_spinal
raise InvalidChapterException
utils.utils.InvalidChapterException
>>>
We can see that the method converts are chapters appropriately and raises our custom error when an invalid chapter has been passed.
Creating The get_chapter_file Method
Our last helper method converts our chapter string into a file name:
def get_chapter_file(chapter):
chapter_spinal_case = convert_chapter_to_spinal(chapter)
return '{}.html'.format(chapter_spinal_case)
def get_chapter_file(chapter):
: Defines the get_chapter_file
method and requires a chapter
string as a parameter
convert_chapter_to_spinal(chapter)
: uses the helper method we created in the last section and assigns it to chapter_spinal_case
return '{}.html'.format(chapter_spinal_case)
: returns the converted chapter string as a html file name. Chapter 1: Lorem
becomes lorem.html
.
We can test this method by passing the same strings from the last helper method test:
>>> from utils.utils import get_chapter_file
>>> get_chapter_file('Chapter 1: Lorem')
'lorem.html'
>>> get_chapter_file('Chapter 2: Lorem Ipsum')
'lorem-ipsum.html'
>>> get_chapter_file('Dolor Sit')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../utils/utils.py", line 53, in get_chapter_file
chapter_spinal_case = convert_chapter_to_spinal(chapter)
File ".../utils/utils.py", line 43, in convert_chapter_to_spinal
raise InvalidChapterException
utils.utils.InvalidChapterException
>>>
We can see that the proper html file names are being returned. We can see where our error initially gets raised and going up the chain to the get_chapter_file
method
Creating the build_html_files Method
We have one final method left which is to create the html files. Lets walk through creating one manually in the Python console:
>>> from utils.utils import get_chapter_file
>>> chapters = {'Chapter 1: Lorem': 'content
line2
line3'}
>>> chapter_key = list(chapters)[0]
>>> chapter_file = get_chapter_file(chapter_key)
>>> content = chapters[chapter_key]
>>> file_name = '{0}{1}'.format('html/', chapter_file)
>>> html_file = open(file_name, 'w')
>>> html_file.write(content)
>>> html_file.close()
from utils.utils import get_chapter_file
: imports the get_chapter_method
, we will just create mock data instead of using the other helper methods for now
{'Chapter 1: Lorem': 'content'}
: creates an example chapters dict
list(chapters)[0]
: we convert chapters into a list which will only return an array of keys. We do not use the chapters.keys()
method for simplicity. In Python 3, .keys()
returns dict_keys
which are iteratable, but not indexible. This is because Python 3.6 and older do not order keys within our dict/hashes. They use less memory, however, we will use the less efficient route and convert it to a list. In this project, we won't need to look to much into being efficient, however, it is an important concept to keep in mind. We then take the list of keys, and retrieve the fist item by using [0]
get_chapter_file(chapter_key)
: this uses our helper method from before to retrieve our html file name which will be lorem.html
.
chapters[chapter_key]
: retrieves the value of our key and assigns it to content
'{0}{1}'.format('html/', chapter_file)
: creates our html file path which will be html/lorem.html
html_file = open(file_name, 'w')
: this opens our html file which will allow us access to it. w
means that we will just be writing to it.
html_file.write(content)
: this appends our content
to our html file. When we open our html file, this is what we will see. In our actual method, we will have a html template.
html_file.close()
: this closes our html_file for writing, we want to do this on every iteration / file
We can now see that our html file is created under html/lorem.html
:
This was a very simplified approach to our method, however, it is a good step towards understanding the overall idea of it. We are essentially retrieving a hash of chapters and looping through each key. Remember that each key is a chapter, where its value is the content for that chapter.
So in every iteration, we have a chapter that we need to create a html file for. We get the name of the file, the chapter name, and the content of the chapter content through out helper methods. We open up our html file and pass in html code to our file. Once that is complete we close the file and move onto the next chapter until we are done. Lets take a look at our build_html_files
method:
def build_html_files(chapters, dest='html/'):
for chapter in chapters.keys():
chapter_file = get_chapter_file(chapter)
file_name = '{0}{1}'.format(dest, chapter_file)
html_file = open(file_name, 'w')
paragraph = chapters[chapter].replace('\n\n', '<br/><br/>')
content = """
<html>
<head>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<div>
<h1>{0}</h1>
<p>{1}</p>
</div>
</body>
</html>
""".format(chapter, paragraph)
html_file.write(content)
html_file.close()
def build_html_files(chapters, dest='html/'):
creates our build_html_files
method with chapters
as a parameter and a dest
as a paramater (Destination of the html folder, defualting to html/
)
for chapter in chapters.keys():
: remember that .keys()
returns dict_keys
which are iteratable. So we can loop through each chapter key in our chapters
variable
chapter_file = get_chapter_file(chapter)
: uses our helper method from before to get the chapter file name (ex. lorem.html
)
file_name = '{0}{1}'.format(dest, chapter_file)
: builds the entire file path (ex. html/lorem.html
)
html_file = open(file_name, 'w')
: opens the html file in write mode
chapters[chapter].replace('\n\n', '<br/><br/>')
: this grabs the value of our chapters key and replaces the newlines with <br/><br/>
. In html, the <br/>
element refers to a linebreak, where two line breaks results in a blank line as spacing. This allows us to properly see the new lines (\n\n
) in our html file. This is then assigned to paragraph
.
content = """
<html>
<head>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<div>
<h1>{0}</h1>
<p>{1}</p>
</div>
</body>
</html>
""".format(chapter, paragraph)
This is our entire html file as a string. It is very basic which is all we need. We use triple quotes ("""
) to signify that it a multiline html file. The link
meta tag imports a styles.css
file, which we will cover soon. <h1>{0}</h1>
is where our chapter title will be and <p>{1}</p>
is where our chapter content will be. {0}
and {1}
refer to the first and second param in the format
method (chapter
, paragraph
). The entire html content is then assigned to content
.
html_file.write(content)
: this writes our html content to the current html file.
html_file.close()
: this closes the html file.
Testing build_html_files
We can test the build_html_files
function out now (we will create styles.css later).
But, we still need to clear out our html folder. Add the clean
command to your Makefile:
init:
pip3 install -r requirements.txt
clean:
rm html/*.html
rm
: this is the remove file command in the unix shell
html/*.html
: this is a pattern passed to rm
that tells it to delete every html file in the html folder.
Run Make clean
to clean up our html folder.
Start the Python terminal again and test our new method:
>>> from utils.utils import build_html_files
>>> chapters = {'Chapter 1: Lorem': 'content
line2
line3'}
>>> build_html_files(chapters)
>>>
Open up the html folder, and view the lorem.html
file:
At this point your utils.py
file should look like:
import pytesseract
import glob
import re
def extract(path='data/*.jpg'):
pages = glob.glob(path)
pages.sort()
pages = pages[:3]
text = ''
for page in pages:
print('extracting: {}'.format(page))
image_string = pytesseract.image_to_string(page)
text += image_string
lines = text.split('
')
return lines
def build_chapters(lines):
chapters = {}
cur_chapter = 'Intro'
for line in lines:
is_chapter = re.match(r"^(Chapter [0-9]+:)", line)
if is_chapter:
cur_chapter = line
elif cur_chapter in chapters.keys():
content = '{}
'.format(line)
chapters[cur_chapter] += content
else:
content = '{}
'.format(line)
chapters[cur_chapter] = content
return chapters
def convert_chapter_to_spinal(chapter):
name = re.sub(r"^(Chapter [0-9]+: )", '', chapter)
if name == chapter:
raise InvalidChapterException
name = name.lower().replace(' ', '-')
return name
class InvalidChapterException(Exception):
"""Chapter name is invalid"""
pass
def get_chapter_file(chapter):
chapter_spinal_case = convert_chapter_to_spinal(chapter)
return '{}.html'.format(chapter_spinal_case)
def build_html_files(chapters, dest='html/'):
for chapter in chapters.keys():
chapter_file = get_chapter_file(chapter)
file_name = '{0}{1}'.format(dest, chapter_file)
html_file = open(file_name, 'w')
paragraph = chapters[chapter].replace('
', '<br/><br/>')
content = """
<html>
<head>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<div>
<h1>{0}</h1>
<p>{1}</p>
</div>
</body>
</html>
""".format(chapter, paragraph)
html_file.write(content)
html_file.close()
Creating The main.py File
We can now connect everything together in our main.py
file. Update your main.py
to include the following:
from utils.utils import extract, build_chapters, build_html_files
lines = extract()
chapters = build_chapters(lines)
build_html_files(chapters)
At this point, this file should be self-explanatory. We import the extract method to extract the lines from our images. Then, we import the build_chapters
method to create our chapters
hash using lines
. Then we pass it to build_html_files
to create our html files.
Save this file and run $ Make clean
and then $ python3 main.py
(or $ python main.py
):
$ python3 main.py
extracting: data/python_dataset_01.jpg
extracting: data/python_dataset_02.jpg
extracting: data/python_dataset_03.jpg
Now if we check if the files exist in our html folder using ls
:
$ ls html
dolor.html lorem.html lpsum.html
We now have our 3 chapters created! Open one of them to check it out:
This page could use some styling. Download the styles.css
files from the html
folder in the github repo and put it in our html
folder. Once complete, refresh the page:
This now looks a lot more readable. There is still one more feature we can add to it though: navigation.
Adding Navigation
We can improve our existing code to allow us to navigate from page to page with a next
and previous
button. Here is the basic overview of how we can achieve that:
- update our for loop to include the index we are on
- If we are on the 2nd iteration or later, add a previous button
- if we are before the last page, add the next button
Starting with the first step, update your loop from:
for chapter in chapters.keys():
to:
chapter_keys = list(chapters)
for index, chapter in enumerate(chapter_keys):
list(chapters)
: this converts our chapters.keys()
into a list
of keys that is indexable and assigns it to chapter_keys
for index, chapter
: adds index
as a variable that increments every loop.
enumerate(chapter_keys)
: Python provides us with a built-in function called enumerate
which will keep track of the iterations for us, allowing us to retrieve the index
value each loop.
We can now work on adding our previous link:
# ...
html_file = open(file_name, 'w')
prev_link = ''
if index > 0:
prev_chapter = chapter_keys[index - 1]
prev_chapter_file = get_chapter_file(prev_chapter)
prev_link = '<p><a href="{}">Previous</a></p>'.format(
prev_chapter_file)
# ...
content = """
<html>
...
<h1>{0}</h1>
<p>{1}</p>
{2}
...
</html>
""".format(chapter, paragraph, prev_link)
prev_link = ''
: This is our previous link, which is defaulted to be blank
if index > 0:
: if it is the 2nd page or later
prev_chapter = chapter_keys[index - 1]
: gets the previous chapter
get_chapter_file(prev_chapter)
: gets the previous chapter file name
'<p><a href="{}">Previous</a></p>'
: our html for the previous link, the link would be relative to the folder it is in (html
), so we can just pass the normal file name (ex. lorem.html
). We get add the file link through the format
method.
content = """...{2}..."""
: this refers to the previous link string. If it is blank, it would be as if it was never added. If the string has the html tags, it will appear in the html page as a link. This means that we won't need a condition to hide the previous link if one does not exist. If it doesn't exist, it won't appear in the html file.
.format(chapter, paragraph, prev_link)
: adds the previous link as a parameter
The next
button is basically the same thing, except our conditional would be if (index < len(chapters) - 1):
so that it checks if the current page isn't the last page. It will also pull the next chapter using chapter_keys[index + 1]
instead of pulling the previous chapter.
Here is the full method with the naviagtion buttons:
def build_html_files(chapters, dest='html/'):
chapter_keys = list(chapters)
for index, chapter in enumerate(chapter_keys):
chapter_file = get_chapter_file(chapter)
file_name = '{0}{1}'.format(dest, chapter_file)
html_file = open(file_name, 'w')
prev_link = ''
next_link = ''
if index > 0:
prev_chapter = chapter_keys[index - 1]
prev_chapter_file = get_chapter_file(prev_chapter)
prev_link = '<p><a href="{}">Previous</a></p>'.format(
prev_chapter_file)
if (index < len(chapters) - 1):
next_chapter = chapter_keys[index + 1]
next_chapter_file = get_chapter_file(next_chapter)
next_link = '<p><a href="{}">Next</a></p>'.format(
next_chapter_file)
paragraph = chapters[chapter].replace('\n\n', '<br/><br/>')
content = """
<html>
<head>
<link rel="stylesheet" href="styles.css">
</head>
<body>
<div>
<h1>{0}</h1>
<p>{1}</p>
{2}{3}
</div>
</body>
</html>
""".format(chapter, paragraph, prev_link, next_link)
html_file.write(content)
html_file.close()
Once you have updated your code, run $ Make clean
and then python3 main.py
. After the script runs, view your html files to see the next and previous buttons:
Testing The Entire Dataset
Now that we verified it worked for the first 3 images, we can test it for the entire dataset by removing pages = pages[:3]
. Update the extract
method from:
def extract(path='data/*.jpg'):
pages = glob.glob(path)
pages.sort()
pages = pages[:3]
text = ''
# ...
to:
def extract(path='data/*.jpg'):
pages = glob.glob(path)
pages.sort()
text = ''
# ...
Now run $ Make clean
and python3 main.py
:
$ python3 main.py
extracting: data/python_dataset_01.jpg
...
extracting: data/python_dataset_38.jpg
If everything went well, you should now have a static website in html/
!
What's Next?
At this point, the project is at a good position to build off of. You can add more features like a table of contents, word count, and more. I recommend viewing the Github Repo to see how I tested the utils.py file using pytest
and coverage
. I am always available to help with any questions, so don't hesitate to contact me!