Section 15: Advanced Python Modules

  Python Bootcamp 0 to Hero

< Section 14 | Section 16 >

Collections Module

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3512826?start=0#questions

The collections module is a built in module that implements specialized container datatypes providing alternatives to python’s general purpose build-in containers.  We’ve already discussed several of the basics: dict, list, set and tuple

Counter

Counter counts the number of time an item appears in a list.

Counter is a dict subclass which helps count hashable objects.  Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value

from collections import Counter
from collections import Counter
mylist = [1,1,1,1,2,2,3,3,3,4,4,2,2,5,1,5,2,5,4,6,6,7]
Counter(mylist)

Counter({1: 5, 2: 5, 3: 3, 4: 3, 5: 3, 6: 2, 7: 1})

s = 'Mississippi'
Counter(s)

Counter({‘M’: 1, ‘i’: 4, ‘s’: 4, ‘p’: 2})

Count the word repititions in a sentence

s = "How many Up times does times show times uP in show this UP sentence"
Counter(s.lower().split()

Counter({‘how’: 1,
‘many’: 1,
‘up’: 3,
‘times’: 3,
‘does’: 1,
‘show’: 2,
‘in’: 1,
‘this’: 1,
‘sentence’: 1})

Common methods

s = "How many Up times does times show times uP in show this UP sentence"
words = s.lower().split()
c = Counter(words)

.most_common(n)

c.most_common(2)

[(‘up’, 3), (‘times’, 3)]

To get the Least Common elements:
.most_common(:-n-1:-1)
Hint: Use a negative step

c.most_common()[:-2-1:-1]

[(‘sentence’, 1), (‘this’, 1)]

List()

Shows unique elements

list(c)

[‘how’, ‘many’, ‘up’, ‘times’, ‘does’, ‘show’, ‘in’, ‘this’, ‘sentence’]

sum(.values())

Total of all counts

sum(c.values())

14

defaultdict

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3512824#questions

defaultdict is a dictionary like oibject which provides all methods provided by dictionary, but takes first argument (default_factory) as default data type for the dictionary.  using defaultdict is faster than doing the same using dict.set_default method.

A defaultdict will never raise a KeyError.  Any key that does not exist gets the value returned by the default factory.

from collections import defaultdict
d = {}
d['one']

KeyError: ‘one’

d = defaultdict(object)
d['one']
for item in d:
    print(item)

one

Assign default values to 0 using lambda

d = defaultdict(lambda: 0)
d['two]

0

d['three']=3
d['three']

3

Ordered Dictionaries OrderedDict

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3779906#questions

Standard dict object do not retain a specific order

d = {}
d['a']=1
d['b']=2
d['c']=3
d['d']=4
d['e']=5
for k,v in d.items():
    print( k,v)

a 1
b 2
e 5
d 4
c 3

from collections import OrderedDict
d = orderedDict()
d['a']=1
d['b']=2
d['c']=3
d['d']=4
d['e']=5
for k,v in d.items():
    print( k,v)

a 1
b 2
c 3
d 4
e 5

With OrderedDict, order is important

d1={'a':1, 'b':2}
d2={'b':2, 'a':1}
d1 == d2

True
this would fail with OrderedDict, because the order of the keys would not match.

namedtuple

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3512830#questions

These are similar to creating classes with both named and indexed values.

Named values are sent as a string with spaces between each key value.

from collections import namedtuple
Dog = namedtuple('Dog','age breed name')
sam = Dog(age=2, breed='Lab', name='Sammy')
print(sam.breed)
print(sam[0])

Lab
2

Datetime

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3547908#questions

datetime.time

import datetime
#datetime.time( hours, minutes, seconds, microseconds)
t = datetime.time(5, 25, 1)
print(t.hour)
print(t.minute)
print(t.second)
print(t.microsecond)
print(t.resolution

5
25
1
0
0:00:00.000001

datetime.date

today = datetime.date.today()
print(today)

2019-06-17

today.timetuple()

time.struct_time(tm_year=2019, tm_mon=6, tm_mday=17, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=168, tm_isdst=-1)

d1 = datetime.date(2019, 6, 17)
d2 = d1.replace(year = 2018)
print(d2)

2018-06-17

Math on Dates

lastyear = today.replace(year = 2018)
print(today - lastyear)

365 days, 0:00:00

Python Debugger pdb

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3547912#questions

Tool good for tracing through your code and being able to see where errors occur.

import pdb
x = [1,3,5]
y = 4
z = 7
result = y +z
print(result)
result2 = x + y
print(result2)

11
TypeError: can only concatenate list (not “int”) to list

Modify your code to stop before the error point and perform some checking
To stop, press ‘Q’

import pdb
x = [1,3,5]
y = 4
z = 7
result = y +z
print(result)
pdb.set_trace()
result2 = x + y
print(result2)

11

(Pdb) x

[1, 3, 5]

Timing your Code

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3547914#questions

This is useful for checking small sections of code for optimizing.  Running these small snipits multiple times will provide more accurate results than a single iteration.

The command is passed to the function as a string.

import timeit
# How long does it take to create the string: '0-1-2-3...99'
timeit.timeit('"-".join(str(n) for n in range(100))', number = 10000)

0.6442793710000387

# as a list comprehension
timeit.timeit('"-".join([str(n) for n in range(100)])', number=10000)

0.5642282049999494

# as a map function
timeit.timeit('"-".join(map(str, range(100)))', number=10000)

0.3922121389999802

To use Jupiter Notebooks built in magic fucntion

%timeit "-".join(str(n) for n in range(100))

61.9 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit "-".join(map(str, range(100)))

39.4 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Regular Expressions – re

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3547904#questions

These are a common interview question!

Regular expressions are text matching patters used for

  • finding repetition
  • matching patterns

Basic search

import re
re.search('hello','hello world')

<re.Match object; span=(0, 5), match=’hello’>
Notice this is a ‘Match’ object

import re
patterns = ['term1', 'term2'] 
text = 'This is a string with term2, but not the other term'
for pattern in patterns:
    print(f"Searching for {pattern} in:\n{text}")
    if re.search(pattern, text):
        print("\nMatch found!\n")
    else:
        print("\nNo match found.\n")

Searching for term1 in:
This is a string with term2, but not the other term

No match found.

Searching for term2 in:
This is a string with term2, but not the other term

Match found!

Object type

No match = NoneType
match = re.Match

match = re.search(patterns[0],text)
type(match)

NoneType

match = re.search(patterns[1],text)
type(match)

re.Match

Methods on Match

match.start()

22

match.end()

27

Splitting with Regular Expressions

splitterm = '@'
phrase = "Is your email address hello@gmail.com?"
re.split(splitterm, phrase)

[‘Is your email address hello’, ‘gmail.com?’]

findall

re.findall('match', 'Here is one match and here is another match')

[‘match’, ‘match’]

Using Meta Characters

def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %(pattern))
        print(re.findall(pattern,phrase))
        print('\n')

Repetition Syntax

There are five ways to express repetition in a pattern:

  • A pattern followed by the meta-character * is repeated zero or more times.
  • Replace the * with + and the pattern must appear at least once.
  • Using ? means the pattern appears zero or one time.
  • For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
  • Use {m,n} where m** is the minimum number of repetitions and **n is the maximum. Leaving out n** {m,} means the value appears at least **m times, with no maximum.
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: ‘sd*’
[‘sd’, ‘sd’, ‘s’, ‘s’, ‘sddd’, ‘sddd’, ‘sddd’, ‘sd’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘sdddd’]

Searching the phrase using the re check: ‘sd+’
[‘sd’, ‘sd’, ‘sddd’, ‘sddd’, ‘sddd’, ‘sd’, ‘sdddd’]

Searching the phrase using the re check: ‘sd?’
[‘sd’, ‘sd’, ‘s’, ‘s’, ‘sd’, ‘sd’, ‘sd’, ‘sd’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘sd’]

Searching the phrase using the re check: ‘sd{3}’
[‘sddd’, ‘sddd’, ‘sddd’, ‘sddd’]

Searching the phrase using the re check: ‘sd{2,3}’
[‘sddd’, ‘sddd’, ‘sddd’, ‘sddd’]

Character Sets

(Think of a list without commas) [ab] = ‘a’ or ‘b’
Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a** or **b. Let’s see some examples:

test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = ['[sd]',    # either s or d
                's[sd]+']   # s followed by one or more s or d

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: ‘[sd]’
[‘s’, ‘d’, ‘s’, ‘d’, ‘s’, ‘s’, ‘s’, ‘d’, ‘d’, ‘d’, ‘s’, ‘d’, ‘d’, ‘d’, ‘s’, ‘d’, ‘d’, ‘d’, ‘d’, ‘s’, ‘d’, ‘s’, ‘d’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘s’, ‘d’, ‘d’, ‘d’, ‘d’]

Searching the phrase using the re check: ‘s[sd]+’
[‘sdsd’, ‘sssddd’, ‘sdddsddd’, ‘sds’, ‘sssss’, ‘sdddd’]

Exclusion

We can use ^ to exclude terms by incorporating it into the bracket syntax notation. For example: [^…] will match any single character not in the brackets. Let’s see some examples:

test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
re.findall('[^!.? ]+',test_phrase)

[‘This’, ‘is’, ‘a’, ‘string’, ‘But’, ‘it’, ‘has’, ‘punctuation’, ‘How’, ‘can’, ‘we’, ‘remove’, ‘it’]

Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet. For instance, [a-f] would return matches with any occurrence of letters between a and f.

Let’s walk through some examples:

test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: ‘[a-z]+’
[‘his’, ‘is’, ‘an’, ‘example’, ‘sentence’, ‘ets’, ‘see’, ‘if’, ‘we’, ‘can’, ‘find’, ‘some’, ‘letters’]

Searching the phrase using the re check: ‘[A-Z]+’
[‘T’, ‘L’]

Searching the phrase using the re check: ‘[a-zA-Z]+’
[‘This’, ‘is’, ‘an’, ‘example’, ‘sentence’, ‘Lets’, ‘see’, ‘if’, ‘we’, ‘can’, ‘find’, ‘some’, ‘letters’]

Searching the phrase using the re check: ‘[A-Z][a-z]+’
[‘This’, ‘Lets’]

Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

Code Meaning
\d a digit
\D a non-digit
\s whitespace (tab, space, newline, etc.)
\S non-whitespace
\w alphanumeric
\W non-alphanumeric

Escapes are indicated by prefixing the character with a backslash r, eliminates this problem and maintains readability.

Personally, I think this use of r to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: ‘\\d+’
[‘1233’]

Searching the phrase using the re check: ‘\\D+’
[‘This is a string with some numbers ‘, ‘ and a symbol #hashtag’]

Searching the phrase using the re check: ‘\\s+’
[‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘]

Searching the phrase using the re check: ‘\\S+’
[‘This’, ‘is’, ‘a’, ‘string’, ‘with’, ‘some’, ‘numbers’, ‘1233’, ‘and’, ‘a’, ‘symbol’, ‘#hashtag’]

Searching the phrase using the re check: ‘\\w+’
[‘This’, ‘is’, ‘a’, ‘string’, ‘with’, ‘some’, ‘numbers’, ‘1233’, ‘and’, ‘a’, ‘symbol’, ‘hashtag’]

Searching the phrase using the re check: ‘\\W+’
[‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ #’]

StringIO

https://www.udemy.com/complete-python-bootcamp/learn/lecture/3547906#questions

The StringIP module implements an in-memory file like object.  This object can then be used as input or output to most functions that would expect a standard file object.

import StringIO
message = "This is just a normal string."
f = StringIO.StringIO(message)
f.read()
f.write()
f.seek(0)

 

 

LEAVE A COMMENT