I am trying to gather dataset from Twitter accounts who posted a status update in the form of a statement of diagnosis, such as “I was diagnosed with X
today”, where X would represent either depression.
I was being able to use TwitterSearch library but it only searches for the keyword not a full sentence.
from TwitterSearch import *
try:
tso = TwitterSearchOrder() # create a TwitterSearchOrder object
tso.set_keywords(['depression', 'diagnosed']) # let's define all words we would like to have a look for
tso.set_language('en') # we want to see English tweets only
tso.set_include_entities(False) # and don't give us all those entity information
ts = TwitterSearch(
consumer_key = 'x',
consumer_secret = 'y',
access_token = 'z',
access_token_secret = 't'
)
print( tweet['user']['screen_name'], tweet['text'] )
However I would like to use regular expression to get tweets that match the sentence.
You can search full sentences - not only keywords - with set_keywords
from TwitterSearch import *
try:
tso = TwitterSearchOrder() # create a TwitterSearchOrder object
tso.set_keywords(['I was diagnosed with depression today'])
tso.set_language('en') # we want to see English tweets only
tso.set_include_entities(False)
ts = TwitterSearch(
consumer_key = 'c',
consumer_secret = 's',
access_token = 'at',
access_token_secret = 'ats'
)
# this is where the fun actually starts :)
for tweet in ts.search_tweets_iterable(tso):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text'] ) )
except TwitterSearchException as e: # take care of all those ugly errors if there are some
print(e)
So, no need to filter the result with regex.
Related
When I send a query request like ?page_no=5 from the brower:
http://127.0.0.1:8001/article/list/1?page_no=5
I get the output in the debugging terminal
def article_list(request, block_id):
print(request.GET)
<QueryDict: {'page_no': ['5']}>
Django encapsulates the page_no=5 to a dict {'page_no': ['5']}
How Django accomplish such a task, are the regex or str.split employed?
Short answer: we can inspect the source code to see how the QueryDict is constructed. It is done with a combination of regex splitting and unquoting.
This can be done by using the constructor. Ineed, if we call:
>>> QueryDict('page_no=5')
<QueryDict: {u'page_no': [u'5']}>
The constructor uses the limited_parse_qsl helper function. If we look at the source code [src]:
def __init__(self, query_string=None, mutable=False, encoding=None):
super().__init__()
self.encoding = encoding or settings.DEFAULT_CHARSET
query_string = query_string or ''
parse_qsl_kwargs = {
'keep_blank_values': True,
'fields_limit': settings.DATA_UPLOAD_MAX_NUMBER_FIELDS,
'encoding': self.encoding,
}
if isinstance(query_string, bytes):
# query_string normally contains URL-encoded data, a subset of ASCII.
try:
query_string = query_string.decode(self.encoding)
except UnicodeDecodeError:
# ... but some user agents are misbehaving :-(
query_string = query_string.decode('iso-8859-1')
for key, value in limited_parse_qsl(query_string, **parse_qsl_kwargs):
self.appendlist(key, value)
self._mutable = mutable
If we look at the source code of limited_parse_qsl [src], we see that this parser uses a combination of splitting, and decoding:
FIELDS_MATCH = re.compile('[&;]')
# ...
def limited_parse_qsl(qs, keep_blank_values=False, encoding='utf-8',
errors='replace', fields_limit=None):
"""
Return a list of key/value tuples parsed from query string.
Copied from urlparse with an additional "fields_limit" argument.
Copyright (C) 2013 Python Software Foundation (see LICENSE.python).
Arguments:
qs: percent-encoded query string to be parsed
keep_blank_values: flag indicating whether blank values in
percent-encoded queries should be treated as blank strings. A
true value indicates that blanks should be retained as blank
strings. The default false value indicates that blank values
are to be ignored and treated as if they were not included.
encoding and errors: specify how to decode percent-encoded sequences
into Unicode characters, as accepted by the bytes.decode() method.
fields_limit: maximum number of fields parsed or an exception
is raised. None means no limit and is the default.
"""
if fields_limit:
pairs = FIELDS_MATCH.split(qs, fields_limit)
if len(pairs) > fields_limit:
raise TooManyFieldsSent(
'The number of GET/POST parameters exceeded '
'settings.DATA_UPLOAD_MAX_NUMBER_FIELDS.'
)
else:
pairs = FIELDS_MATCH.split(qs)
r = []
for name_value in pairs:
if not name_value:
continue
nv = name_value.split('=', 1)
if len(nv) != 2:
# Handle case of a control-name with no equal sign
if keep_blank_values:
nv.append('')
else:
continue
if nv[1] or keep_blank_values:
name = nv[0].replace('+', ' ')
name = unquote(name, encoding=encoding, errors=errors)
value = nv[1].replace('+', ' ')
value = unquote(value, encoding=encoding, errors=errors)
r.append((name, value))
return r
So it splits with the regex [&;], and uses unquote to decode the elements in the key-value encoding.
For the unquote(..) function, the urllib is used.
I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.
Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')
Dont know why the below python script would not crawl glassdoor.com website
from bs4 import BeautifulSoup # documentation available at :` #www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import NavigableString, Tag
import requests # To send http requests and access the page : docs.python-requests.org/en/latest/
import csv # To create the output csv file
import unicodedata # To work with the string encoding of the data
entries = []
entry = []
urlnumber = 1 # Give the page number to start with
while urlnumber<100: # Give the page number to end with
#print type(urlnumber), urlnumber
url = 'http://www.glassdoor.com/p%d' % (urlnumber,) #Give the url of the forum, excluding the page number in the hyperlink
#print url
try:
r = requests.get(url, timeout = 10) #Sending a request to access the page
except Exception,e:
print e.message
break
if r.status_code == 200:
data = r.text
else:
print str(r.status_code) + " " + url
soup = BeautifulSoup(data) # Getting the page source into the soup
for div in soup.find_all('div'):
entry = []
if(div.get('class') != None and div.get('class')[0] == 'Comment'): # A single post is referred to as a comment. Each comment is a block denoted in a div tag which has a class called comment.
ps = div.find_all('p') #gets all the tags called p to a variable ps
aas = div.find_all('a') # gets all the tags called a to a variable aas
spans = div.find_all('span') #
times = div.find_all('time') # used to extract the time tage which gives the iDate of the post
concat_str = ''
for str in aas[1].contents: # prints the contents that is between the tag start and end
if str != "<br>" or str != "<br/>": # This denotes breaks in post which we need to work around.
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip() # The encoding is because the format exracted is a unicode. We need a uniform structure to work with the strings.
entry.append(concat_str)
concat_str = ''
for str in times[0].contents:
if str != "<br>" or str != "<br/>":
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip()
entry.append(concat_str)
#print "-------------------------"
for div in div.find_all('div'):
if (div.get('class') != None and div.get('class')[0] == 'Message'): # Extracting the div tag witht the class attribute as message.
blockqoutes = []
x = div.get_text()
for bl in div.find_all('blockquote'):
blockqoutes.append(bl.get_text()) #Block quote is used to get the quote made by a person. get text helps to elimiate the hyperlinks and pulls out only the data.
bl.decompose()
entry.append(div.get_text().replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
for bl in blockqoutes:
#print bl
entry.append(bl.replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
#print entry
entries.append(entry)
urlnumber = urlnumber + 1 # increment so that we can extract the next page
with open('gd1.csv', 'w') as output:
writer = csv.writer(output, delimiter= ',', lineterminator = '\n')
writer.writerows(entries)
print "Wrote to gd1.csv"
I fixed some errors in your script, but I guess that it doesn't print anything because you only get 405(!)-pages!
Also your previous try/catch-block didn't print the error message. Was it on purpose?
I'm trying to set filter to get all groups that specific user is member of.
I'm using Python, Currently
import traceback
import ldap
try:
l = ldap.open("192.168.1.1")
.
.
.
l.simple_bind_s(username, password)
#######################################################################
f_filterStr = '(objectclass=group)' # Would like to modify this, so I'll not have to make the next loop ...
#######################################################################
# the next command take some seconds
results = l.search_s(dn_recs, ldap.SCOPE_SUBTREE, f_filterStr)
for i in results:
if dict == type(i[1]):
group_name = i[1].get('name')
if list == type(group_name):
group_name = group_name[0];
search_str = "CN=%s," % username_bare
if -1 != ("%s" % i[1].get('member')).find (search_str):
print "User belong to this group! %s" % group_name
except Exception,e :
pass # handle as you wish
I think you are making this much too hard.
No python expert, but you can easily query Microsoft Active Directory for all groups a user is a member of using a filter like:
(member:1.2.840.113556.1.4.1941:=(CN=UserName,CN=Users,DC=YOURDOMAIN,DC=NET))\
-jim
I am a newbie to programming. Learning from Udacity. In unit 2, I studied the following code to fetch links from a particular url:
import urllib2
def get_page(url):
return urllib2.urlopen(url).read()
def get_next_target(page):
start_link = page.find('<a href=')
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
def print_all_links(page):
while True:
url, endpos = get_next_target(page)
if url:
print url
page = page[endpos:]
else:
break
print_all_links(get_page('http://en.wikipedia.org'))
It worked perfectly. Today I wanted to modify this code so the script could crawl for a particular word in a webpage rather than URLs. Here is what I came up with:
import urllib2
def get_web(url):
return urllib2.urlopen(url).read()
def get_links_from(page):
start_at = page.find('america')
if start_at == -1:
return None, 0
start_word = page.find('a', start_at)
end_word = page.find('a', start_word + 1)
word = page[start_word + 1:end_word]
return word, end_word
def print_words_from(page):
while True:
word, endlet = get_links_from(page)
if word:
print word
page = page[endlet:]
else:
break
print_words_from(get_web('http://en.wikipedia.org/wiki/America'))
When I run the above, I get no errors, but nothing prints out either. So I added the print keyword -
print print_words_from(get_web('http://en.wikipedia.org/wiki/America'))
When I run, I get None as result. I am unable to understand where am I going wrong. My code probably is messed up, but because there is no error coming up, I am unable to figure out where it is messed up.
Seeking help.
I understand this as you are trying to get it to print the word America for every instance of the word on the Wikipedia page.
You are searching for "america" but the word is written "America". "a" is not equal to "A" which is causing you to find no results.
Also, start_word is searing for 'a', so I adjusted that to search for 'A' instead.
At this point, it was printing 'meric' over and over. I edited your 'word' to begin at 'start_word' rather than 'start_word + 1'. I also adjusted your 'end_word' to be 'end_word+1' so that it prints that last letter.
It is now working on my machine. Let me know if you need any clarification.
def get_web(url):
return urllib2.urlopen(url).read()
def get_links_from(page):
start_at = page.find('America')
if start_at == -1:
return None, 0
start_word = page.find('A', start_at)
end_word = page.find('a', start_word + 1)
word = page[start_word:end_word+1]
return word, end_word
def print_words_from(page):
while True:
word, endlet = get_links_from(page)
if word:
print word
page = page[endlet:]
else:
break
print_words_from(get_web('http://en.wikipedia.org/wiki/America'))