Harnessing the Power of Python: Converting Images to Text

Reading a book in the park

In today’s digital era, images play a crucial role in communication and information sharing. However, extracting meaningful information from images can be a challenging task. That’s where the power of Python and its libraries, such as pytesseract and open-cv, come into play. In this blog post, we’ll explore the fascinating world of converting images to text using Python, uncovering the possibilities and applications of this remarkable technique.

Understanding Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is the technology that enables computers to extract text from images or scanned documents. By leveraging OCR, we can convert images into editable and searchable text, providing a wealth of opportunities for various applications, including data entry automation, document analysis, and content extraction.

Python Image Libraries

Python offers several powerful libraries that make it relatively easy to perform image to text conversion. The two most widely used libraries are:

  1. Tesseract OCR: Tesseract is an open-source OCR engine developed by Google. It supports over 100 languages and provides robust text recognition capabilities. Python provides an interface to Tesseract through the pytesseract library, enabling seamless integration of OCR functionality into Python applications.
  2. OpenCV: OpenCV is a popular computer vision library that includes various image processing functions. While not primarily an OCR library, OpenCV provides a strong foundation for preprocessing images before passing them to an OCR engine. It can be used for tasks such as noise removal, image enhancement, and text localization, improving the accuracy of OCR results.

Converting Images to Text with Python:

To get started with image to text conversion in Python, you’ll need to install the necessary libraries. Use the following commands in your terminal or command prompt:

pip install pytesseract
pip install opencv-python

Once the libraries are installed, you can utilize the power of OCR in Python with the following steps:

  1. Import the required libraries:
import cv2
import pytesseract
  1. Load the image:
image = cv2.imread('image.jpg')
  1. Perform OCR using pytesseract
text = pytesseract.image_to_string(image)
print(text)
  1. If the image isn’t clear or if the text is surrounded by pictures, sdd config options to image_to_string. This is especially true if you see garbage in the text or if text isn’t aligning correctly. You may need to adjust the --psm 4 setting. Sometimes 2, 4 or 8 will work best. This Stack Overflow conversation describes the psm option in detail: https://stackoverflow.com/questions/44619077/pytesseract-ocr-multiple-config-options
config_opts = ("--oem 1 --psm 4")
text = pytesseract.image_to_string(image, config=config_opts)
print(text)
  1. Analyze and utilize the extracted text. At this stage, the text should be extracted, so you will be able to operate on it as you would any other text in Python or directly insert it into a database.

Applications and Use Cases

The ability to convert images to text opens up numerous possibilities across various domains. Here are a few use cases where Python’s image to text conversion capabilities can be invaluable:

  1. Data Entry Automation: Automatically extracting data from forms, invoices, or receipts and converting them into machine-readable text can significantly streamline data entry processes.
  2. Document Analysis: Converting scanned documents or handwritten notes into editable text allows for efficient content analysis, searchability, and text mining.
  3. Accessibility: Converting text from images can improve accessibility for visually impaired individuals by enabling text-to-speech applications or screen readers to interpret the content.
  4. Content Extraction: Extracting text from images can aid in content curation, social media monitoring, and sentiment analysis, allowing businesses to gain valuable insights from visual content.

Python provides an extensive range of tools and libraries for converting images to text, thanks to its versatility and powerful third-party packages. With the help of OCR libraries like Tesseract and image processing capabilities offered by OpenCV, developers can effortlessly extract text from images and unlock a multitude of applications. Automating data entry, analyzing documents, or extracting content, Python’s image to text conversion capabilities makes this capability fairly easy.

Be sure to checkout the other Python articles here: https://sim10tech.com/category/python/

Regular Expressions in Python

python regular expressions

I designed a tool a while back in Python that used sar and Solaris explorer data for capacity analysis. One of the issues I faced was needing to find data in between two regular expressions. Fortunately, Python has a powerful regular expression module called re module. Working with regexes can be daunting if you haven’t worked with it before. If you’re unfamiliar with regular pattern matching, please read this: RegEx Primer

Using the Regular Expressions (RE) Module

Three main methods of the re module are compile()match() and search(). The compile() method creates a regex object which makes searching through data much faster. match() will return a re.match object only if the beginning of the string matches the pattern. search() will find any occurrence of the pattern within the string. This is a fairly simple example in that it’s only a string being matched. Typically, the string will actually be patterns instead of simple strings. As an example, something like ^fd.ss$ is more common in pattern matching. This statement says:

  1. ^fd – find “fd” at the beginning of the line. ^ means to match at the beginning of the line.
  2. .ss after finding “fd”, match any character followed by “ss”. The . matches any one character.
  3. ss$ “ss” is the last two characters at the end of the line. $ says end of line, but not including new line characters.
import re
data_str = 'this is my search string'
srch_recomp = re.compile('string')
# Match won't find anything since 'string' is not at the beginning of data_str
regex_found = re.match(srch_recomp, data) 
type(regex_found)

regex_found = re.search(srch_recomp, data_str) # Search will find the pattern in data_str
regex_found
_sre.SRE_Match object; span=(18, 24), match='string'>
In this example, we change the variable src_recomp so re.match() will find the pattern.

data_str = 'this is my search string' 
srch_recomp = re.compile('this')
regex_found = re.match(srch_recomp, data)
regex_found
_sre.SRE_Match object; span=(0, 4), match='this'>

PYTHON FORWARD SEARCH

The algorithm is fairly simple to search for data between two patterns. Using the regular expressions module, re, search for a begin string, append all of the lines in a list until end string is found. This example class is using a file, but the file object can be easily replaced with another object type. Comments in code if you don’t need the begin_re and end_re strings in the final output.

Regular expressions is a complex subject at first mostly because the pattern matching syntax is so different. Start by reading and trying simple expressions at first. For the most part, re follows standard matching syntax, so knowing grep in Linux/UNIX will transfer that knowledge into Python easily. Refer to the re documentation here: Python RE module or check Stackoverflow for examples.

import re

class LookForward():
    """
        begin_re: beginning search pattern
        end_re: end serach pattern
        file_name: File name to search for begin_re and end_re strings.
        Return: a double list of search elements
    """
    def __init__(self, begin_re, end_re, file_name):
        self.begin_re = begin_re
        self.end_re = end_re
        self.file_name = file_name

    def look_forward(self):
        """
           Method that returns a list containing lines between
           begin and end regular expressions.
        """
        return_val = []
        try:
            with open(self.file_name) as file_ctx:
                f_data = file_ctx.readlines()
        except (OSError, PermissionError) as err:
            print(f"Encountered an error while opening {self.file_name}:"
                  f" {err}")
            raise OSError
        for line in f_data:
            begin_pattern = re.compile(self.begin_re)
            begin_match = re.search(begin_pattern, line)
            final_pattern = re.compile(self.end_re)
            # if there is a match for the beginning search pattern, then 
            # start parsing until end_re is found.
            if begin_match:
                try:
                    for next_line in f_data:
                        # take next line in file append each line to 
                        # first list. strip() removes the new line char.
                        return_val.append(next_line.strip())
                        final_match = re.search(final_pattern, next_line)
                        # check if new_line is a match for end_re
                        if final_match:
                            # Uncomment the line below if the end_re should
                            # not be included in the results:
                            # return_val = return_val[:-1]
                            # break the inner loop since end_re was found
                            break
                except StopIteration:
                    continue
        return return_val

To implement this class, initialize the LookForward class by passing begin and end regular expressions and a filename to search. In the example below, “this” is the begin search string, “that” is the end search string and “text.txt” is the file that is searched for these strings.

lf_data = LookForward("this", "that", "test.txt")
lf_data.look_forward()
output = lf_data.look_forward()
for lines in output:
    print(lines)

Checkout the other Python related articles here.

Passwords with non-standard Characters in JSON using Python

Path less traveled

Python and JSON

I had a requirement to have passwords contain a slash \ in an API call with JSON. However, when attempting to run json.dumps for the credentials, Python would throw this exception:

Expecting value: line 1 column 34 (char 33)

Not surprisingly, a solution wasn’t found on Stack Overflow or any internet searches. I’m guessing the reason being \ is a reserve character in JSON similar to Python. Unfortunately, that didn’t matter as the requirements were already set and accepted, so I needed to find a fix. I attempted the following:

  1. Escape the \ with two backslashes like this: \\.
  2. Different quotes: ' and ''.
  3. Encapsulating the quotes like '" and "'.
  4. Using strict=False for json.loads.
    1. Example: json.loads(json_creds, strict=False)
    2. This was the most cited workaround I found, but it never worked with the slash. json.loads would throw the Expecting value exception every time.

However, none of that worked mostly because, in all honestly, it shouldn’t. The reserver characters are there just like in Python for the language to function correctly. We wouldn’t add an @ in a method or function definition for the same reason we shouldn’t add \ in passwords for JSON. I’m digressing a bit – back to how to work around this.

I found that if the password is encoded using json.dumps first, and then passed to the JSON URL, it worked perfectly.

password = "This.\Sample"
encoded_pw = json.dumps(password)
JSON_DATA = "{\"username\": \"" + username + "\", \"password\":" + encoded_pw + "}"

For other Python-related articles, please checkout other Python articles here.

Recursion with Python

Python recursion

I had a bug that was difficult to trace down. I had a double list that I removed some of the elements using the remove() method, however, not all of the elements were removed. In fact, the code was removing only every other element. The bug turned out to be the call to remove() would shorten the list by one, and thus cause a skipping effect. For example:

some_lst = [['a','b'],['c','d'],['e','f']]

for i in some_lst:
  if re.match(regexToMatch, i):
     some_lst.remove(i)

I needed to rewind the list if remove() was called. At first, I thought this would be perfect for recursion. As it turned out, with the amount of data I needed the algorithm to search through, recursion was not a the best solution.

A recursive algorithm is one that calls itself. These functions will additionally have a decrementing counter or meet some condition to exit the recursive call. Having an exit condition is required since the function calls itself, otherwise, the function will loop until the system runs out of memory. This is a fast algorithm, however, it is not suited for all conditions. For example, small datasets recursion is best with regards to performance, but as the data grows, performance decreases quickly. This is because memory is consumed and the process incurs a context switch for every function call. The code below worked great on a small subset of data, but did not scale to what was needed for the project.

import re

def remove_processes(the_list: list):
    for (hostname, process) in the_list:
        for line in exclude_processes:
            try:
                if re.search(line.strip(), process):
                    the_list.remove([hostname,process])
                    remove_processes(theList)
            except ValueError:
                continue
    return the_list

Eventually, I replaced the for loop with while() and a counter. When a match is found, count is subtracted by 1 so every element in theList is evaluated. This also removes any duplicates.

def remove_processes(the_list):
    count = 0
    while count <= len(the_list) - 1:
        hostname = the_list[count][0]
        process = the_list[count][1]
        count += 1
    try:
        if re.search(exclude_process, process):
            the_list.remove([hostname, process])
            # reduce the count by 1 which resolves skipping
            # every other element
            count -= 1
    except ValueError:
        continue

    return the_list

Using Recursion with JSON

Here is another example of recursion. recurs() searches for k in a JSON document. If key isn’t found, it continues to the next element until all elements are searched. Once key is found, the function updates the value to “changed”. I used this function to remove passwords in a JSON document so it could be sent externally.

def recurs(k: str, json_doc: dict):
    """
    Takes a string, k and searches through json_doc
    :param k: search string
    :param json_doc: JSON / dictionary to search through for k
    """
    for i in json_doc.keys():
        if isinstance(json_doc[i], list):
            for p in range(len(json_doc[i])):
                json_doc[i][p][k] = 'changed'
        elif isinstance(json_doc[i], str):
           if k in json_doc:
               json_doc[k] = 'changed'
        elif isinstance(json_doc[i], dict):
           json_doc[i][k] = 'changed'
        try:
            for v in json_doc[i].values():
                if isinstance(v, dict):
                   return recurs(k, v)
        except AttributeError:
            pass
    return json_doc

Python – Working with Lists

 

One of the most common data types in Python is the list. A list is basically an array in other languages, however with a list, you can mix different data types in the same list. You can test this in the interactive shell:

>>> a_list = ['dog', 'cat', 1, 3, 1000]
>>> print(a_list)
['dog', 'cat', 1, 3, 1000]
>>> type(a_list[3])
<class 'int'>
>>> type(a_list[1])
<class 'str'>

Working with Python Lists

After declaring a list, we will need to either add data, delete data or assign data to another variable. In order do accomplish these tasks, we will need to use the append() method to add data, pop() or remove() to delete data, and subset the list to retrieve elements or assign to another variable.

Lists are subsetted by using the brackets ([]), a positional number and / or using a colon (:). Using a colon will allow a subset range.

Note: list elements start at 0 not 1.

>>> a_list = ['dog', 'cat', 'bat']
>>> b_str = a_list[0] # Take the first element and assign it to b_str
>>> b_str
'dog'
>>> type(b_str)
<class 'str'>
>>> a_list[-1] # Take the last element from the list 
'bat'
>>> a_list[-2] # Take the second to last element from the list
'cat'
>>> type(a_list[0]) # Lists can contain ints, strings, dictionaries or other lists
<class 'str'>
>>> type(a_list[-1])
<class 'int'>

Using a range with list elements sometimes is prone to defects in code. The number before the colon is the starting point, and the number after is the position to end minus 1. For example, a_list[1:4] starts at element 1 and ends at element 3, not 4. If you’ve developed in other languages, this will take some time to acclimate to Python’s way of list subscripting.

>>> a_list
['cat', 1, 3, 1000]
>>> a_list[0:3] # Take the first element through the second element
['cat', 1, 3]
>>> a_list[1:-1] # Take the second element through the second to last element
'cat', 1, 3]
Here is an example of a list containing another list and dictionary. Lists containing other lists is common in JSON or RESTful development, so becoming familiar with the syntax is important as you develop more complex or web-enabled applications. We still can call dictionary methods like keys() or values().
>>> b_list = ['this', 'new' 'list']
>>> a_list.append(b_list)
>>> a_list
['cat', 1, 3, 1000, ['this', 'newlist']]

>>> my_dct = {'language': 'Python'}
>>> a_list.append(my_dct)
>>> a_list
['cat', 1, 3, 1000, ['this', 'newlist'], {'language': 'Python'}]
>>> a_list[-1].keys() # We can call dictionary methods  
dict_keys(['language'])
>>> a_list[-1].values()
dict_values(['Python'])

 Simple Merge

Here are some examples of using Python lists by merging two lists into one. This simple example appends the values from second_list onto first_list.

firstList = ['dog', 'cat', 'tiger', 'rhnio']
secondList = ['2 x 18g', '250m swap']

for line in secondList:
   firstList.append(line)

Merge Lists Based on A Condition

The following code merges two lists, but will insert the second list after finding a specific string, in this case a hostname. outerList.pop(0) removes the first element from the list, and then inserts the remaining list into firstList.

firstList = ['dog', 'cat', '2', '500', 'daeo', 'DL580', '8', '128']
secondList = [['dog', '2 x 18g', '250m swap'], ['daeo', '4 x 146g', '16g swap']]

try:
    for outerList in secondList:
        systemName = outerList.pop(0)
        for innerList in outerList:
            indexPos = firstList.index(systemName)
            firstList.insert(indexPos + 1, innerList)
except ValueError as err:
    print('ValueError: {}'.format(err))