What is this

A python script that grabs content from New York Times and twitter based on the same key word that abstracted from tweet. It aims at giving a glimpse at the difference between traditional media and personal media.  

Motivation and Inspiration

It is undoubted that social media changes our world a lot. It not only changes the way how people communicate and interact with each other, but also changes how people explore and discover the world. It makes people more informative as well as more narrow sighted. It's always been fascinating to think about what's the different effects that personal media and traditional media made towards our society and ourselves. Moreover, what is happening while I am typing this post, this tweet, this status? I think everyone of you should ask yourself this question in case the so-called being "connected (to the internet)" as well as being "disconnected (to the reality)" simultaneously situation happens to us. What kind of resonance there might be when we associate two different type of media together? Is it making so much sense or not making sense at all? We'll see...

In digital era, especially after the advent of the Internet, individualization have been highlighted ever-increasingly. That is one of the reasons why such huge amount of digital information has been created on the internet which has already far more surpassed the size of the whole information recorded in human history has been created before digital era in history. However, every coin has two sides. A huge amount of "trash" information have been created and is being created all the time which is quite not the case in former human history. In my opinion, the "trash" is not trash at all, it is just, sometime, too personalized. 

Approach

I was using Twitter API and New York Times API. The approach was extracting nouns from the tweets from a twitter account as the key word, and searching this key word in New York API (The Article Search API), which will give you articles' titles or headlines back which has the same key word in the articles' body. 

I was using TextBlob for Python for noun phrase extraction, however, as most other common natural language processing (NPL) APIs, it is not working perfectly. 

Problems and Difficulties

First problem I was facing comes from the limitation of New York Time API. In the first place, I was hesitating between using "The Article Search API" and "The Most Popular API". Both of them have the headline extraction function. In Article Search API, you could get articles from fruitful resource ranging from any time from Sept. 18, 1851 to today as you want. The thing is the titles got might barely could ring a bell because they might not be any significant news at all. Moreover, because it is an "Article" API, so it's searching the article's body instead of the title of articles, which means the articles' titles I got usually does not have that key word in it, but the body has it. 

However, in Most Popular API, the pro is that you could get the most hit news titles, the con, however, is it only supports time period from the past 30 days maximum. After weighting a lot, I decided to use the article search API. 

Last but not least, New York Times API does not support the function that apply searching key word limited only to titles and headlines (instead of articles' body). If I could get all the titles have the key word I extracted from tweets, that would be better and making more sense literally.  

The following problem was I have to delete a period of string from each tweet I am extracting from the twitter account. Since for each tweet in the twitter accounts I was following and playing around in this project, has the same format like this: [xxx joke] RT .xxx(author): "tweet's content". The "tweet content" part is the only part I need. So what I need to do was deleting strings before ":". However, what was waiting around the corner was a new one, unicode. The fact is each tweet_text in twitter API is unicode instead of string, which means I have to convert it into string first, which is fine I thought, because our instructor has introduced us the way to do that. What out of my expected was that it did not work. So I have been googling around for more alternatives to deal with this issue. You could see different methods that have been used for this problem in my code later. 

In addition, after testing PrfJocular, I found out that it could not working perfectly with all twitter account, because tweets from an account have different format, like tweets initialized with strange symbols  like # * ^, or capital and lower problem... so I just change the logic in def newtweet.  

In a nutshell, the experience I was trying to craft is a chatting atmosphere in this project, however, it turned out to be almost impossible. In my opinion, there are couple reasons. First of all, conversation could not exist without making sense, which is not for randomly generated poetry (more for making fun I think); secondly, the limitation of API limits how accurate content I can get (I am looking forward to see if the outcome could be more conversation-like if NY Times API could provide function that searching key word in headlines and titles). 


Presentation & Performance

Since there are multiple headlines as output based on the keyword (noun phrase), in order to narrow down the gap between tweet and news' headline sent back in a reasonable and understandable range (the ideal condition is the gap is not too huge to bridge by former experience as well as leaving a space to audiences for imagination). So after running the program, I manually picked the most-making-sense one in the headlines to couple with the tweet. Then, as you see, did some graphic design output as PDF. 

Testing in Terminal. The Python script prints different information out separately. 

Insights and Reflections

This class is called Reading and Writing Electronic Text. Basically, it's a class focusing on "computer generated poetry". Treating the randomness as aesthetics and pursuing making poetry without any limitation of grammar, language structure and even consciousness. 

In the first place I was doubted about this idea, since I could not get the point. After couple classes, I was even more confident about it, because I can tell that the more sense the content generated by computer is making , the much better reaction got from us, which proves that we can never truly jump out. Since after we are jumping out, and letting the computer to making the poetry, we have to jump back (our grammar and background) to interpret what computer has generated to try to making sense that our brain is looking for. If it does, well that is good, but if it does not... 

The fascinating fact is that the thinking of "I want to jump out and make a poetry" is made by our brain, but the brain needs "looking back" to persuade itself what it is receiving.  

Python Script

Below is the Python script I wrote for this project, I divided it into  parts for demonstration. 

1. All the settings to get tweets from a twitter account 

import sys
import twython
import urllib 
import urllib2
import json
import re
import unicodedata
import ast
from textblob import TextBlob
import pprint

allNouns = list()
headlines = set()
dates = list()

#nouns = nounsstring.split("\r\n")
#adjectives = adjectivesstring.split("\n")
results = dict()
nounssss = list()
name = sys.argv[1]
num = sys.argv[2]

api_key = "****" 
api_secret = "****"
access_token = "****"
token_secret = "****"

twitter = twython.Twython(api_key, api_secret, access_token, token_secret)
response = twitter.get_user_timeline(screen_name=name, count=num) # the twitter account and how many tweets you want to grab based on the timeline. 

tweets_to_be_printed = []

2. Getting tweets from twitter

class tweet(object):
    def __init__(self, result, nounss, insideResponse):
        self.results = result
        self.allMyNouns = nounss
        self.insideResponse = insideResponse

    def isNoun(self, word):
        pass
        
    def isAdjective(self, word):
        if word == "":
            return False
        if word.lower() in adjectives:
            return True
        else:
            return False

    def newtweet(self, response):
        self.insideResponse = response
        self.results = list()

        for tweet in self.insideResponse:
            tweet_text = tweet['text']
            detweet = unicodedata.normalize('NFKD', tweet_text) # converting unicode
            real_tweet = ""

         # try to deal with the format of tweets in this account 
            if detweet.find(":") != -1:
                index_of_colon = detweet.index(':')
                real_tweet = detweet[index_of_colon+2:]
                tweets_to_be_printed.append(real_tweet)
            else:
                real_tweet = detweet[0:]
                tweets_to_be_printed.append(real_tweet)
                       
            mynouns = list()
            if tweet['retweeted'] != True and tweet['text'][0:2] != "RT":
                real_tweet = real_tweet.replace("\"", "\'")

                blob = TextBlob(real_tweet)
                for word in blob.noun_phrases:
                    self.allMyNouns.append(word)
        return self.allMyNouns

3. Searching nouns we've got in NY Times API and print the outcome (headlines from the articles which have that nouns in its body) out (as unicode)

    def getTweet(self,listOfNouns):
        for wd in self.allMyNouns:
            searchterm = wd
            request_string = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=' + searchterm + '&facet_field=source&begin_date=19990101&end_date=20130101&api-key=e12e33fafba643a896576df64ba79eeb:18:69211686'
            urlresponse = urllib2.urlopen(request_string)

            for doc in docs:
                headlines.append(doc["headline"]["main"])
                dates.append(doc["pub_date"])

            for headline in headlines:
                print headline

4. print the noun (keyword) and headlines (as string) out as the final outcome

a = tweet(results,nounssss,response)
results = a.newtweet(response)
results_unicode = [x.encode('UTF8') for x in results]
tweet_index = 0

for keyword in results_unicode:    
    for mytweet in tweets_to_be_printed:
        if keyword in mytweet.lower():
            print "\n\n" + mytweet + "\n"
            break

    i=0
    headlines = set()
    if i<num:
       
        searchTerm = list()
        searchterm = str(keyword)
        searchTerm.append(searchterm)
        print "\n\n" + searchTerm[i] + "\n"
        params_dict = {"fq" : searchterm, "facet_field": "source", "begin_date": "19990101", "end_date": "20130101", "api-key": "****"}
        new_param_dict = dict()
        for key,value in params_dict.iteritems():
            new_param_dict[key] = unicode(params_dict[key]).encode('utf-8')

        params = urllib.urlencode(new_param_dict)
       
        request_string = 'http://api.nytimes.com/svc/search/v2/articlesearch.json?' + params
        urlresponse = urllib2.urlopen(request_string)
        tResult = json.load(urlresponse)
        outcome = tResult["response"]
        docs = outcome["docs"]
        i=i+1
            
        for doc in docs:
            headlines.add(doc["headline"]["main"])
            dates.append(doc["pub_date"])
    
        for headline in headlines:
            print headline