Regular Expressions in Python

I designed a tool a while back in Python that used sar and Solaris explorer data for capacity analysis. One of the issues I faced was needing to find data in between two regular expressions. Fortunately, Python has a powerful regular expression module called re module. Working with regexes can be daunting if you haven’t worked with it before. If you’re unfamiliar with regular pattern matching, please read this: RegEx Primer

Using the Regular Expressions (RE) Module

Three main methods of the re module are compile(), match() and search(). The compile() method creates a regex object which makes searching through data much faster. match() will return a re.match object only if the beginning of the string matches the pattern. search() will find any occurrence of the pattern within the string. This is a fairly simple example in that it’s only a string being matched. Typically, the string will actually be patterns instead of simple strings. As an example, something like ^fd.ss$ is more common in pattern matching. This statement says:

^fd – find “fd” at the beginning of the line. ^ means to match at the beginning of the line.
.ss after finding “fd”, match any character followed by “ss”. The . matches any one character.
ss$ “ss” is the last two characters at the end of the line. $ says end of line, but not including new line characters.

import re
data_str = 'this is my search string'
srch_recomp = re.compile('string')
# Match won't find anything since 'string' is not at the beginning of data_str
regex_found = re.match(srch_recomp, data) 
type(regex_found)

regex_found = re.search(srch_recomp, data_str) # Search will find the pattern in data_str
regex_found
_sre.SRE_Match object; span=(18, 24), match='string'>
In this example, we change the variable src_recomp so re.match() will find the pattern.

data_str = 'this is my search string' 
srch_recomp = re.compile('this')
regex_found = re.match(srch_recomp, data)
regex_found
_sre.SRE_Match object; span=(0, 4), match='this'>

PYTHON FORWARD SEARCH

The algorithm is fairly simple to search for data between two patterns. Using the regular expressions module, re, search for a begin string, append all of the lines in a list until end string is found. This example class is using a file, but the file object can be easily replaced with another object type. Comments in code if you don’t need the begin_re and end_re strings in the final output.

Regular expressions is a complex subject at first mostly because the pattern matching syntax is so different. Start by reading and trying simple expressions at first. For the most part, re follows standard matching syntax, so knowing grep in Linux/UNIX will transfer that knowledge into Python easily. Refer to the re documentation here: Python RE module or check Stackoverflow for examples.

import re

class LookForward():
    """
        begin_re: beginning search pattern
        end_re: end serach pattern
        file_name: File name to search for begin_re and end_re strings.
        Return: a double list of search elements
    """
    def __init__(self, begin_re, end_re, file_name):
        self.begin_re = begin_re
        self.end_re = end_re
        self.file_name = file_name

    def look_forward(self):
        """
           Method that returns a list containing lines between
           begin and end regular expressions.
        """
        return_val = []
        try:
            with open(self.file_name) as file_ctx:
                f_data = file_ctx.readlines()
        except (OSError, PermissionError) as err:
            print(f"Encountered an error while opening {self.file_name}:"
                  f" {err}")
            raise OSError
        for line in f_data:
            begin_pattern = re.compile(self.begin_re)
            begin_match = re.search(begin_pattern, line)
            final_pattern = re.compile(self.end_re)
            # if there is a match for the beginning search pattern, then 
            # start parsing until end_re is found.
            if begin_match:
                try:
                    for next_line in f_data:
                        # take next line in file append each line to 
                        # first list. strip() removes the new line char.
                        return_val.append(next_line.strip())
                        final_match = re.search(final_pattern, next_line)
                        # check if new_line is a match for end_re
                        if final_match:
                            # Uncomment the line below if the end_re should
                            # not be included in the results:
                            # return_val = return_val[:-1]
                            # break the inner loop since end_re was found
                            break
                except StopIteration:
                    continue
        return return_val

To implement this class, initialize the LookForward class by passing begin and end regular expressions and a filename to search. In the example below, “this” is the begin search string, “that” is the end search string and “text.txt” is the file that is searched for these strings.

lf_data = LookForward("this", "that", "test.txt")
lf_data.look_forward()
output = lf_data.look_forward()
for lines in output:
    print(lines)

Checkout the other Python related articles here.

Docker Containers Part 2 – Working with Images

docker containers and images administration

If you haven’t installed Docker, please read Part 1 of this Docker series.

Managing container lifecycles is more involved than starting and stopping. In this second part of Docker Containers, we show how to administer images locally and on remote repositories. The syntax for maintaining images is the subcommand image. We will cover list/ls/inspect, pull and rm/prune in this article.

Working with Images

List Images

The first part of managing images is to know which images are being used, what their disk utilization is and the image version. Listing images is done in one of two ways: either long docker image list or more Linux/UNIX friendly docker image ls.

docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
python latest e285995a3494 10 days ago 921MB
postgres latest 75993dd36176 10 days ago 376MB

Inspecting images provides information such as environment variables, parent image, commands used during initialization, network information, volumes and much more. This data is vital when troubleshooting issues with container startup or creating new images. The following is only an excerpt – the actual command has about two pages of data.

docker inspect postgres
[
{
"RepoTags": [
"postgres:latest"
],
"Hostname": "81312c458473",
"ExposedPorts": {
"5432/tcp": {}
},
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/postgresql/14/bin",
"PG_MAJOR=14",
"PG_VERSION=14.5-1.pgdg110+1",
"PGDATA=/var/lib/postgresql/data"
],
"Cmd": [
"/bin/sh",
"-c",
"#(nop) ",
"CMD [\"postgres\"]"
.......

Search Repositories

Searching repositories can be accomplished by either going to Docker Hub, or searching by command line, docker search <string>, so you never have to leave the shell. Here’s an example:

docker search postgres
NAME DESCRIPTION STARS OFFICIAL AUTOMATED
postgres The PostgreSQL object-relational database sy… 11486 [OK]
bitnami/postgresql Bitnami PostgreSQL Docker Image 154 [OK]
circleci/postgres The PostgreSQL object-relational database sy… 30
ubuntu/postgres PostgreSQL is an open source object-relation… 19
bitnami/postgresql-repmgr 18
rapidfort/postgresql RapidFort optimized, hardened image for Post… 15

Pulling Images

In order to run containers, you will need to pull the image from a repository. This can be accomplished either by docker pull <image name> or docker run <image name> which will automatically pull the image if it doesn’t exist locally. By default, pull will get the latest version, Alternatively, you can specify a version by using a colon, :, after the image name like this: docker pull <image>:4.2.0

docker pull postgres
Using default tag: latest
latest: Pulling from library/postgres
31b3f1ad4ce1: Pull complete
1d3679a4a1a1: Pull complete
667bd4154fa2: Pull complete
87267fb600a9: Pull complete
Digest: sha256:b0ee049a2e347f5ec8c64ad225c7edbc88510a9e34450f23c4079a489ce16268
Status: Downloaded newer image for postgres:latest
docker.io/library/postgres:latest

Removing and Pruning Images

Unfortunately, Docker doesn’t automatically remove images, so disk utilization tends to grow fairly quickly if not managed. Docker has two commands to remove images, prune, rm and rmi. As part of normal maintenance, prune should run in cron every few weeks or once a month depending on how active the system is.

docker image prune – Deletes unused images
docker rm <container IDs> – Removes container IDs from the system.
docker rmi <image ID> – Remov the image.

docker image prune
WARNING! This will remove all dangling images.
Are you sure you want to continue? [y/N] y
<none>                               <none>      48e3a3f31a48   10 months ago   999MB
<none>                               <none>      89108dc97df7   10 months ago   1.37GB
<none>                               <none>      26e43fa5dd7c   11 months ago   998MB
<none>                               <none>      b98d351f790b   11 months ago   1.37GB
<none>                               <none>      334a4df3c05a   11 months ago   998MB
<none>                               <none>      17c5a57654e4   11 months ago   1.37GB

Please checkout the other container articles here.

Passwords with non-standard Characters in JSON using Python

Python and JSON

I had a requirement to have passwords contain a slash \ in an API call with JSON. However, when attempting to run json.dumps for the credentials, Python would throw this exception:

Expecting value: line 1 column 34 (char 33)

Not surprisingly, a solution wasn’t found on Stack Overflow or any internet searches. I’m guessing the reason being \ is a reserve character in JSON similar to Python. Unfortunately, that didn’t matter as the requirements were already set and accepted, so I needed to find a fix. I attempted the following:

Escape the \ with two backslashes like this: \\.
Different quotes: ' and ''.
Encapsulating the quotes like '" and "'.
Using strict=False for json.loads.
1. Example: json.loads(json_creds, strict=False)
2. This was the most cited workaround I found, but it never worked with the slash. json.loads would throw the Expecting value exception every time.

However, none of that worked mostly because, in all honestly, it shouldn’t. The reserver characters are there just like in Python for the language to function correctly. We wouldn’t add an @ in a method or function definition for the same reason we shouldn’t add \ in passwords for JSON. I’m digressing a bit – back to how to work around this.

I found that if the password is encoded using json.dumps first, and then passed to the JSON URL, it worked perfectly.

password = "This.\Sample"
encoded_pw = json.dumps(password)
JSON_DATA = "{\"username\": \"" + username + "\", \"password\":" + encoded_pw + "}"

For other Python-related articles, please checkout other Python articles here.

Recursion with Python

I had a bug that was difficult to trace down. I had a double list that I removed some of the elements using the remove() method, however, not all of the elements were removed. In fact, the code was removing only every other element. The bug turned out to be the call to remove() would shorten the list by one, and thus cause a skipping effect. For example:

some_lst = [['a','b'],['c','d'],['e','f']]

for i in some_lst:
  if re.match(regexToMatch, i):
     some_lst.remove(i)

I needed to rewind the list if remove() was called. At first, I thought this would be perfect for recursion. As it turned out, with the amount of data I needed the algorithm to search through, recursion was not a the best solution.

A recursive algorithm is one that calls itself. These functions will additionally have a decrementing counter or meet some condition to exit the recursive call. Having an exit condition is required since the function calls itself, otherwise, the function will loop until the system runs out of memory. This is a fast algorithm, however, it is not suited for all conditions. For example, small datasets recursion is best with regards to performance, but as the data grows, performance decreases quickly. This is because memory is consumed and the process incurs a context switch for every function call. The code below worked great on a small subset of data, but did not scale to what was needed for the project.

import re

def remove_processes(the_list: list):
    for (hostname, process) in the_list:
        for line in exclude_processes:
            try:
                if re.search(line.strip(), process):
                    the_list.remove([hostname,process])
                    remove_processes(theList)
            except ValueError:
                continue
    return the_list

Eventually, I replaced the for loop with while() and a counter. When a match is found, count is subtracted by 1 so every element in theList is evaluated. This also removes any duplicates.

def remove_processes(the_list):
    count = 0
    while count <= len(the_list) - 1:
        hostname = the_list[count][0]
        process = the_list[count][1]
        count += 1
    try:
        if re.search(exclude_process, process):
            the_list.remove([hostname, process])
            # reduce the count by 1 which resolves skipping
            # every other element
            count -= 1
    except ValueError:
        continue

    return the_list

Using Recursion with JSON

Here is another example of recursion. recurs() searches for k in a JSON document. If key isn’t found, it continues to the next element until all elements are searched. Once key is found, the function updates the value to “changed”. I used this function to remove passwords in a JSON document so it could be sent externally.

def recurs(k: str, json_doc: dict):
    """
    Takes a string, k and searches through json_doc
    :param k: search string
    :param json_doc: JSON / dictionary to search through for k
    """
    for i in json_doc.keys():
        if isinstance(json_doc[i], list):
            for p in range(len(json_doc[i])):
                json_doc[i][p][k] = 'changed'
        elif isinstance(json_doc[i], str):
           if k in json_doc:
               json_doc[k] = 'changed'
        elif isinstance(json_doc[i], dict):
           json_doc[i][k] = 'changed'
        try:
            for v in json_doc[i].values():
                if isinstance(v, dict):
                   return recurs(k, v)
        except AttributeError:
            pass
    return json_doc