Project 1: Baby Names


In this project, we’ll get a peek at this kind of data manipulation/programming. What to Hand In
You will hand in a single file with your Python code. Your file should be structured like the Word Ladder example we have been working on in class; it should be carefully commented, and the functions should contain appropriate doc strings. In short, think: #22c16Pro jec t 1: BabyNames #<you r name goe s he r e #<you r sect ion goes here any import statements... any function definitions... anything you want to happen when file is loaded... This is a very challenging project. It will takealot of thought and effort. But if you start it right awayand work carefully through using the structure provided in this handout, I am confident most of you will be able to successfully complete it. The Data We’llbeusing a dataset of 133 years (1880-2013) of baby names available from the US Census bureau. Youwill find a version of this dataset for download on the ICON website (file names.csv.The data consist of 1,792,091 million lines or records for 92,599 distinct baby names. The first ten records look like: 1951 ,Th e ora ,F, 6 1993 ,Du s tin,M, 6308 2010 ,Ma r c os ,M, 1027 1926 ,Fa y e,M,15 1986 , Ilon a,F,16 1993 ,Ka y la ,F, 15448 1994 ,Kemi ,F, 6 1941 ,Ar lyn ,F, 6 1960 ,Ly net te,F,1212 1922 ,Ze l ia ,F, 18 where each record contains a birth year,aname, a gender and the number of newbabies giventhat name in that year (anybaby name having fewer than fivebirth records in a givenyear is excluded for privacy reasons). Values or fields within records are separated by commas, and the records themselves are not in anyparticular order. Note that, for testing purposes, I’ve also prepared a much shorter file of 134,615 records consisting of only three years of data, 2010-2013 (file names_sample.csv.Using this shorter dataset will makeyour debugging go a little bit faster,since you won’tberepeatedly reading in the whole dataset. [Update: I’ve prepared a still smaller version of the file, names_tiny.csv,consisting of 536 records corresponding to all entries for Mary,Alice, Charles and Patrick in the original set.] readNames() The first function you will write is: de f readName s (infi le= ’name s .csv ’): This function should read in all records and return a list of lists, where sublist corresponds to a record. The first fewrecords would look like: [[1951 , ’Theo ra’ , ’F’ , 6] , [1993 , ’Dus tin’, ’M’ , 6308 ] ,... ] Note that it is fine to assume that there will be no replicated or conflicting records in the input data; also, not the types of each record element. Although not strictly necessary for this assignment, you may wish to explore the Python csv library, which provides a nice set of methods for reading and writing comma separated values files, which can otherwise be quite complex, especially if fields may themselves contain commas or spaces. nameIndex() and yearIndex() Once the data have been read from the file, we will construct twodictionaries to help makeaccess to various views of the data more convenient. You will implement twofunctions, both which takethe product of readNames() as input. The first function: de f name I nd ex(names ): produces a dictionary that uses the baby name as a key,with each value consisting of a list of tuples (year, male, female). So, for example, part of this dictionary might look like: {... ’Al ber to’ : [... , (1950 ,239 , 6) , (1951 ,223 , 0) , ... ],... } Note that the tuples within each name are sorted by birth year; this will be important later. The second function: de f ye arInd ex(names ): produces another dictionary that uses the year as the key and generates a second dictionary,embedded, as the value (the embedded dictionary uses name as the key and has a tuple (male, female) as its value). That is: {... 1950 :{... , ’A lbe rto’:(239 , 6) , ’Er ica ’:(0, 117 ) ,... },... } Access Functions Nowthat we have the data structures out of the way,we’re going to write some functions to access these data in meaningful ways. The first function: de f ge tBi r thsByName ( name , gende r=Non e,start=Non e,end=Non e,int erval=Non e): returns a list of (year, value)tuples that depend on the combination of arguments provided. (1) If only the name is given(gender and year are allowed to default), then the value returned is the total number of births for that name overall years. (2) If gender is provided (where gender will be specified as ’m’ or ’f’) then the values returned will be restricted to that gender only. (3) If start, end and interval are provided, the results will span only the specified years. Note end requires start be specified, and interval requires both start and end be specified. Think of these arguments as inspired by the range() function. The second access function: de f ge tName sByYe a r(N= Non e,pat tern=Non e,gende r=Non e, star t=Non e,end=Non e,int erval=Non e): returns a list of tuples (name, count), sorted in descending order by count, of the N (if specified) most popular names (else the entire list). If pattern,astring, is specified, then the results are filtered to contain only names that contain the givenpattern (independent of case). The other optional parameters also filter the results, and are treated as in getBirthsByName(). Digression: Plotting Results with matplotlib.pyplot Forthis part of the assignment, we’ll need the matplotlib module for Python 3. Your miniconda installation does not by default include matplotlib, so you will need to use the conda command to install matplotlib. Linux Open a terminal windowand type conda; if the system responds conda: command ¬ found you will need to locate the miniconda3directory (probably in your home directory,/miniconda3) and then cd to that directory.Either way,you’re nowready to type bin/condainstallmatplotlib,which installs the software we need for this assignment. Windows Open the command prompt command. com and type conda.Ifthe system doesn’trespond with a long usage message, you will need to cd to the directory where miniconda is installed (probably C:\miniconda3 or perhaps in your own user directory). Either way,you nowtype "conda install matplotlib" (will takeafew minutes) to install the software. Mac Open aterminal windowand type conda followed by enter.Ifyou see usage: conda ...,then simply type condainstallmatplotlib to install the software (will takeafew minutes). If you don’tsee the usage message, you did not modify your path variable. In the terminal, type cdminiconda3/bin followed by enter,then condainstallmatplotlib. Forthis assignment, we’re mostly interested in name trends by year,soour plots will be pretty simple. Here, I will give some sample Python code for using matplotlib in this fashion, so that you have something to based your work on. Consider the number of male births for the name ’Alberto’ for every year I have data. The data look like: albe rto = [(1882 , 6),(1886 , 5), ... (2011 , 660),(2012 , 610),(2013 , 581)] There are a total of 123 years of data; you might recognize these data as being in the format returned by getBirthsByName(). Nowimagine that I would liketoplot these values in a single window. Here’ssimple code to do so: impo r tmatplot l ib.pyp l o t as plt pl t.title( ’Bi r ths byYea r’) pl t.xlabe l(’Ye a r’) pl t.ylabe l(’Bi r ths ’) pl t.plot ( [yfor (y,v)inalber to] , [vfor (y,v)inalber to] ) pl ) producing: 1880 1900 1920 1940 1960 1980 2000 2020 Year 0 200 400 600 800 1000 1200 1400 1600 Births Births by Year Nowlet’ssay I have a similar dataset for Patrick and want to plot both of these together: also, I’dlikethe plot to be a little more sophisticated in terms of color and legends, etc. impo r tmatplot l ib.pyp l o t as plt pl t.title( ’Bi r ths byYea r’) pl t.xlabe l(’Ye a r’) pl t.ylabe l(’Bi r ths ’) pl t.plot ( [yfor (y,v)inalber to] , [vfor (y,v)inalber to], ’r --’ , labe l=’Al ber to(m) ’ ) pl t.plot ( [yfor (y,v)inpat r ick] , [vfor (y,v)inpat r ick], ’g- ’,labe l=’ Pat r ick(m) ’ ) pl t.leg e nd( loc=2 ) pl ) where ’r--’ means "red dashed line" and ’g-’ means "green solid line." 1880 1900 1920 1940 1960 1980 2000 2020 Year 0 2000 4000 6000 8000 10000 12000 14000 16000 Births Births by Year Alberto (m ) Patrick (m ) There are lots more options, but the beauty of this system is that you needn’treally worry about them, as the system generally does the "right thing" on its own. Notice, for example, howthe plotting system adjusts the scale of the axes automatically so that you don’thav e to! There are manyother types of plots available besides the kind of line graphs we’ve seen up to now. For example, assume I have another dataset that corresponds to the most popular girl’snames in a specific year: gi r ls= [(’So p hia’, 85720 ) ,(’Isabe l la’, 79238 ) ,... ,(’Zo e’, 24887 )] Again, you may recognize this as the kind of result that is returned by getNamesByYear(). Todisplay these data in a horizontal bar chart format: pl t.title( ’Bi r ths byName’ ) pl t.xlabe l(’Bi r ths ’) pl t.ytick s(range (len(g irls) , 0, -1) , [nfor (n,t)ingirls]) pl t.barh( range (len(g irls) , 0, -1) , [tfor (n,t)ingirls]) pl ) which yields: 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 Births Sophia Isabella Em m a Olivia Ava Em ily Mia Chloe Lily Zoe Births by Name Note that if I want to display more than one graph in a sequence, I may need to clear out the previous figure’sconfiguration; to do so, I would use the plt. clf (), for "clear figure." Formore info about pyplot, and plot() in particular,see you will find color and format options to satisfy all your plotting needs. User Interaction Nowthat we have all the components, we’re ready to bring this all together.You’re first going to write a top levelfunction: de f name s (infi le= ’name s .csv ’): that supports interactive exploration of our baby name dataset. The function should start by reading in the names from a file and creating the appropriate data structures. It should then prompt the user for a command (a single letter, q, p, c, b, r, s or x), execute the command, and then repeat until instructed to stop. A simple session might look like: names () name sqAl ber to1886 [(1886 , 5) ] name sq Pat r ickm1880 1885 2 [(1880 , 248 ) ,(1882 , 249 ) ,(1884 , 222 )] name sx where the first q query asks the system for the total number of births named Alberto of either gender in 1886, and the second query asks for the number of (male) Patricks born in evennumbered years between 1880 and 1885. Each command typed to the interpreter consists of a single letter, q, p, c, b, r,sor x,where q stands for query, p stands for plot, c stands for count, b stands for bar chart, s stands for show, r stands for reset, and x stands for exit. Each q or p command is followed by a name, an (optional) gender identifier (’m’ or ’f’), an (optional) start year,an(optional) end year,and an (optional) increment. If the command is a q,the data corresponding to the query are simply printed; if the command is a p,the data corresponding to the query are queued for display. Each c or b command is followed by a pattern, an (optional) gender identifier (’m’ or ’f’), an (optional) start year,an(optional) end year,and an (optional) increment. If the command is a c,the data corresponding to the count are simply printed; if the command is a b,the data corresponding to the query are queued for display. The r command clears anydata that is queued for display,while the s command pops up a graphics windowwith the corresponding data in it. Note that you will have tobecareful, because data queued by p and b commands are not comaptible; the former would produce a graph while the latter would produce a bar chart. Attempting to mix the twotypes of data together should result in an appropriate warning being issued to the user. The x command exits the interactive session. Asample session with graphical output might look likethis: names () name sr name spAl ber to name spBr i tney f 1968 2013 name ss name sx which might produce the following graph (depending on your solution’sdefaults for displaying different line types; here, I’ve used the ’bo’ directive toportray Britney’sdata as blue dots): 1880 1900 1920 1940 1960 1980 2000 2020 Year 0 500 1000 1500 2000 2500 Births Births by Year Alberto (m ) Britney (f)
Powered by