HW5: Tweet Wrangling (20 Points)
Due Friday 4/3/2020
Overview / Logistics
The purpose of this assignment is to get you practice with Python dictionaries with a very relevant example. You can start with the Twitter.py that we wrote last week and add methods to it. You will be loading in and examining the file trumpSinceElection.dat
, which holds a list of Donald Trump's tweets since 2016 in dictionary form.
What to submit: When you are finished, you should submit a file Twitter.py
to Canvas with the methods for each task, along with answers to the following as a comment on Canvas:
- Did you work with a buddy on this assignment? If so, who?
- Are you using up any grace points to buy lateness days? If so, how many?
- Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)
- Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
- Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (fine to leave this blank)
JSON Alternative To Pickle
Some students have reported issues loading the list of dictionaries with pickle
. Since it is just a list of dictionaries with text and numeric keys/values only, it is possible to use a simpler, more universal encoding known as JSON
. Click here to download the JSON file. Actually, this link will likely open up the JSON file in your browser, where you can explore the tweets. You will want to switch to "RAW" and save it to your hard drive as trumpSinceElection.json
by right clicking and saying "save file as". Then, you can load the file with this code
The Problem
In class, we showed how to process Python dictionaries, and that the Twitter API organizes tweets in dictionary form. In this assignment, you will be digging into Donald Trump's tweets from November 2016 to answer a few questions
Part 1: The kth Most Popular Tweet (6 Pts)
In the video from last week, we showed how to find Trump's most popular tweet by using numpy's argmin
function (Click here to review that example). Numpy also has a function called argsort
. Look at the documentation for this function, and use it to come up with Trump's kth most popular tweet, as measured by the number of retweets. Put your code in a method called find_kth_popular_tweet(tweets, k)
. This method should find and print out the dictionary for this tweet. For example, the code
should output
Tips
- You sould play around with the
argsort
function using simple examples that you design by hand, before you apply it to the more complicated scenario with tweets. By default, this method sorts things in ascending order. Somehow, you will need to get them in descending order - Be careful with zero-indexing. The 5th most popular tweet would really be at index 4 in a sorted list
Note for the curious
Since we only need the kth largest tweet, technically sorting everything is overkill. For those familiar, sortingN
items can be accomplished in O(N log N)
steps optimally. However, an operation known as a k-partition can be used to separate out the smallest k
elements of a list in only O(N)
time. One can use numpy's argpartition method to separate out the maximum k in this fashion. Though getting comfortable with argsort
will help you in the next task
Part 2: Top k Most Used Words (7 Pts)
Your next task is to loop through all of the tweets and to print out the top k most commonly used words. Create a method get_k_most_popular_words(tweets, k)
to do this. For instance,
should print out the following words in order
1 the 2 to 3 and 4 of 5 a 6 in 7 is 8 rt 9 for 10 on 11 that 12 are 13 i 14 will 15 with 16 our 17 be 18 great 19 we 20 have | 21 & 22 they 23 it 24 this 25 was 26 you 27 at 28 has 29 he 30 not 31 by 32 president 33 all 34 very 35 as 36 my 37 no 38 just 39 so 40 who | 41 from 42 people 43 - 44 thank 45 their 46 democrats 47 but 48 his 49 trump 50 do 51 been 52 an 53 about 54 now 55 new 56 more 57 fake 58 big 59 or 60 what | 61 get 62 would 63 many 64 news 65 if 66 than 67 never 68 out 69 there 70 american 71 should 72 up 73 your 74 u.s. 75 @realdonaldtrump 76 want 77 when 78 much 79 united 80 one | 81 even 82 @realdonaldtrump: 83 time 84 america 85 being 86 me 87 make 88 were 89 like 90 going 91 good 92 can 93 only 94 which 95 must 96 house 97 impeachment 98 after 99 border 100 had | 101 country 102 other 103 doing 104 don’t 105 because 106 media 107 back 108 nothing 109 over 110 into 111 vote 112 how 113 dems 114 state 115 am 116 republican 117 did 118 states 119 working 120 why |
Tips
-
Let's say, for the sake of argument, that I have the following word_counts dictionary
Then, if I say
and then I say
then now I have a list of all words and a corresponding numpy array of all of the counts. You can then argsort
counts
and use that to pick out the top k words
Part 3: COVID Tweets (7 Pts)
Make a function plot_coronavirus_timeline(tweets)
that loops through all of the tweets in the database and picks out all of the tweets that mention either "corona", "virus", or "covid" in the lowercase version of the 'text'
key. Then, it should create a bar chart that shows a bar for each date during which these words were mentioned, with the height of the bar equal to the number of tweets with this mentioned on that particular day.
Since plotting labeled bar charts in matplotlib
is not obvious, you may use the starter code below. You simply need to fill in the counts
dictionary. You should use the provided get_tweet_date(tweet)
to create the key for this dictionary. This function puts the dates into Year/MM/DD
format, which ensures that alphabetical is the order in which they occur in time.
Tips
- To check if a string is contained in another string, simply say