Counter and namedtuple

2015-09-01 20:22 | Source

Collections¶

Counter - For counting the number of things in a thing :)
namedtuple - For writing self-documenting code!

collection.Counter¶

Counters are dictionaries that have a number value for each key. They're very useful when you need to count the number of elements in a collection, for example the word count of a portion of text.

In [1]:

from collections import Counter, namedtuple
from pathlib import Path
from pprint import pprint
import requests
import csv

In [2]:

# Here we count the colors in a list
color_count = Counter(['red', 'blue', 'red', 'green', 'blue', 'blue', 'yellow'])
print('color_count:', color_count, '\n')

# Values can also be passed explicitly
color_count_2 = Counter(red=5, green=2, blue=7, orange=1)
print('color_count_2:', color_count_2, '\n')

# We can also compare Counter objects

# Addition
print('addition:', color_count + color_count_2, '\n')

# Subtraction
print('subtraction:', color_count - color_count_2, '\n')

# Intersection
print('intersection:', color_count & color_count_2, '\n')

# Union
print('union:', color_count | color_count_2, '\n')

color_count: Counter({'blue': 3, 'red': 2, 'yellow': 1, 'green': 1}) 

color_count_2: Counter({'blue': 7, 'red': 5, 'green': 2, 'orange': 1}) 

addition: Counter({'blue': 10, 'red': 7, 'green': 3, 'orange': 1, 'yellow': 1}) 

subtraction: Counter({'yellow': 1}) 

intersection: Counter({'blue': 3, 'red': 2, 'green': 1}) 

union: Counter({'blue': 7, 'red': 5, 'green': 2, 'orange': 1, 'yellow': 1})

In [3]:

# Let's do a word count on Dr. Suess' Yertle the Turtle
r = requests.get('http://www.spunk.org/texts/prose/sp000212.txt')
yertle = Counter(r.text.split())

print('Most common 10 words:')
print('\n'.join(map(str, yertle.most_common(10))), '\n')

print('Least common 10 words:')
print('\n'.join(map(str, yertle.most_common()[:-11:-1])))

Most common 10 words:
('the', 70)
('of', 32)
('a', 30)
('And', 23)
('I', 21)
('king', 18)
('and', 17)
('he', 14)
('all', 14)
('that', 13) 

Least common 10 words:
('groan', 1)
('I!"', 1)
('burp', 1)
('Looked', 1)
('small.', 1)
('stay', 1)
('families', 1)
('back.', 1)
('seeing', 1)
('everything', 1)

collections.namedtuple¶

namedtuples allow you to write self-documenting code. They're most useful when you end up iterating over streams of data (like the rows in a csv) where you don't want to creates tuples and refer to them by their number index but where creating a dictionary may not be necessary. You can combine them with Counters to do cool stuff :)

In [4]:

Person = namedtuple('Person', ['name', 'age'])

amy = Person('Amy', 31)
print(amy, '\n')

bob = Person(name='Bob', age=17)
print(bob, '\n')

susan = Person(**{'name': 'Susan', 'age': 45})
print(susan, '\n')

people = [
    ('Aaron', 56),
    ('Wilfred', 89),
    ('Bertha', 2)
    ]

pprint(list(map(Person._make, people)))

Person(name='Amy', age=31) 

Person(name='Bob', age=17) 

Person(name='Susan', age=45) 

[Person(name='Aaron', age=56),
 Person(name='Wilfred', age=89),
 Person(name='Bertha', age=2)]

In [5]:

def clean_line(line):
    """
    Return a version of the string that is more compatible
    with the creation of namedtuples.
    """
    clean_word = lambda string: string.strip('\n ?').replace(' ', '_').replace('-', '_')
    return list(map(clean_word, line.split(',')))

# This data is from the consumer complaints 
# dataset found on catalog.data.gov
complaints_file = Path('..', 'files', 'Consumer_Complaints.csv')
with complaints_file.open() as infile:
    header = clean_line(infile.readline())
    
    Complaint = namedtuple('Complaint', header)
    complaints = list(map(Complaint._make, csv.reader(infile)))
    
    issues = Counter([complaint.Issue for complaint in complaints])
    print('Most common issues:')
    pprint(issues.most_common(3))
    print()
    
    companies = Counter([complaint.Company for complaint in complaints])
    print('Companies with the most complaints:')
    pprint(companies.most_common(3))
    print()
    
    # namedtuples can act as dictionaries using their ._asdict() method
    first_complaint = complaints[0]
    print(first_complaint._asdict()['Issue'], '\n')
    
    # It's normally better to refer to the namedtuple's attribute, however
    print(first_complaint.Issue, '\n')
    
    # We can also use the ._fields attribute to see all the other attributes
    # of a particular namedtuple
    print('Complaint fields:')
    pprint(first_complaint._fields)

Most common issues:
[('Incorrect information on credit report', 881),
 ("Cont'd attempts collect debt not owed", 523),
 ('Loan modification,collection,foreclosure', 468)]

Companies with the most complaints:
[('Equifax', 526), ('Experian', 367), ('TransUnion', 327)]

Managing the loan or lease 

Managing the loan or lease 

Complaint fields:
('Complaint_ID',
 'Product',
 'Sub_product',
 'Issue',
 'Sub_issue',
 'State',
 'ZIP_code',
 'Submitted_via',
 'Date_received',
 'Date_sent_to_company',
 'Company',
 'Company_response',
 'Timely_response',
 'Consumer_disputed')

In [6]:

# This data is from New York's Leading Causes of Deaths 
# dataset found on catalog.data.gov
death_file = Path('..', 'files', 'New_York_City_Leading_Causes_of_Death.csv')
with death_file.open() as infile:
    header = clean_line(infile.readline())
    Death = namedtuple('Death', header)
    deaths = map(Death._make, csv.reader(infile))
    print('The most common cause of death is:')
    print(Counter([death.Cause_of_Death for death in deaths]).most_common(1))

The most common cause of death is:
[('CEREBROVASCULAR DISEASE', 120)]

In [ ]:

Comments powered by Disqus