How to parse thorugh a list of links using a loop using beautiful soup? - list

I am newbie in coding so please go easy. I have this work thing that I want to automate a bit.
I have to collect data from this government website.
But I think it is designed this way to prevent it from bots and DDOS attacks. What I have to do as part of my work is to click individually on these links, record the case
Filling date
Last listed date
Party name
Case status
If case status ='disposed' then I read and check the pdfs for reason of disposal. (which can't automated).
Now, I have to go through many pages of this and its hard to even copy paste this much info. So for the first 4 details, I tried to create a script. One of them retrieves the hyperlinks from table page, and the second script to go through the list of links and get the listed above details. Its the second script where I am facing the problems.
List of changes in the URL:
case_list = [
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjA4MjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjA5MjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjEwMjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjExMjAyMQ==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjEyMjAyMQ==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjEzMjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjE0MjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjE2MjAyMQ==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjE3MjAyMA==",
"case-details?bench=YW1yYXZhdGk=&filing_no=MjgxMjEyOTAwMjIwMjAyMA=="]
URL:
response= requests.get(url)
soup= BeautifulSoup(response, 'html.parser').text
tr= soup.find_all('td')
status = tr[19].text
#gets Filing date
filing_date = tr[3].text
#gets title
case_title = tr[5].text
#gets case disposed date
disposal_date = tr[15].text
Function for grabbing details from the URL:
def get_case_components(url):
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
tr = soup.find_all('td')
status = tr[19].text
#gets Filing date
filing_date = tr[3].text
#gets title
case_title = tr[5].text
#gets disposed date
disposal_date = tr[15].text
return filing_date, case_title, status, disposal_date
Function for appending the DataFrame:
def get_case(df):
# loop for going through the case_list
for links in url_list :
url = links
#putting the URL in the function
get_case_components(url)
df = df.append( {"filing_date": filing_date,
"case_title" : case_title, "status": status,
"disposal_date": disposal_date}, ignore_index=True)
time.sleep(1)
return df
Calling the get_info() function using Dataframe.
df= pd.DataFrame(columns = ["filing_date", "case_title", "status", "disposal_date"])
df = get_case(df)
df.head()
For some reason I just keep getting the same thing over and over again as output, like only one case fills the entire Dataframe.
0 14-12-2020 Rajani Jagarlamudi VS Sharadakrupa Cold Storag... Pending 16-03-2022 \t
1 14-12-2020 Rajani Jagarlamudi VS Sharadakrupa Cold Storag... Pending 16-03-2022 \t
2 14-12-2020 Rajani Jagarlamudi VS Sharadakrupa Cold Storag... Pending 16-03-2022 \t
3 14-12-2020 Rajani Jagarlamudi VS Sharadakrupa Cold Storag... Pending 16-03-2022 \t
4 14-12-2020 Rajani Jagarlamudi VS Sharadakrupa Cold Storag... Pending 16-03-2022 \t

I made a script that gets the cases and stores them in an array.
The problem in your script was that you were manually selecting the cases when you were doing:
tr = soup.find_all('td')
status = tr[19].text
#gets Filing date
filing_date = tr[3].text
#gets title
case_title = tr[5].text
#gets disposed date
disposal_date = tr[15].text
My code dynamically scrapes them, so you don't have to worry about it.
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
HOST = "https://nclt.gov.in/"
LINK = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ%3D%3D&start_date=MDEvMDEvMjAyMQ%3D%3D&end_date=MDEvMDEvMjAyMg%3D%3D"
#dataclass
class Case:
number: str
filing_num: str
case_no: str
pvr: str
listing_date: str
status: str
def get_cases(url):
res = requests.get(url)
if res.status_code == 200:
print("getting cases")
return res.text
def extract_case(html):
soup = BeautifulSoup(html, "html.parser")
cases = [Case(*[td.text for td in tr.select("td")]) for tr in soup.select("table tbody tr")]
next, *rest = [link["href"] for link in soup.select(".page-link") if link.text.strip() == "Next"]
if len(next):
next = HOST + next
return cases, next
def main():
total_cases = []
next = ""
amount = int(input("How many pages do you want to scrape?"))
oamount = amount
while amount:
if len(next) and oamount != amount:
html = get_cases(next)
else:
html = get_cases(LINK)
cases, next = extract_case(html)
total_cases.extend(cases)
amount -= 1
print(cases, next)
print("Scraped", len(total_cases), "cases")
if __name__ == "__main__":
main()

Related

Bigquery : job is done but job.query_results().total_bytes_processed returns None

The following code :
import time
from google.cloud import bigquery
client = bigquery.Client()
query = """\
select 3 as x
"""
dataset = client.dataset('dataset_name')
table = dataset.table(name='table_name')
job = client.run_async_query('job_name_76', query)
job.write_disposition = 'WRITE_TRUNCATE'
job.destination = table
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
time.sleep(10)
job.reload()
print job.state
print job.query_results().name
print job.query_results().total_bytes_processed
prints :
DONE
job_name_76
None
I do not understand why total_bytes_processed returns None because the job is done and the documentation says :
total_bytes_processed:
Total number of bytes processed by the query.
See
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#totalBytesProcessed
Return type: int, or NoneType
Returns: Count generated on the server (None until set by the server).
Looks like you are right. As you can see in the code, the current API does not process data regarding bytes processed.
This has been reported in this issue and as you can see in this tseaver's PR this feature has already been implemented and awaits review /merging so probably we'll have this code in production quite soon.
In the mean time you could get the result from the _properties attribute of job, like:
from google.cloud.bigquery import Client
import types
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/key.json'
bc = Client()
query = 'your query'
job = bc.run_async_query('name', query)
job.begin()
wait_job(job)
query_results = job._properties['statistics'].get('query')
query_results should have the totalBytesProcessed you are looking for.

How to get and save into file the full list of twitter account followers with Tweepy

I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.
Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')

Django - Update or create syntax assistance (error)

I've followed the guide in the queryset documentation as per (https://docs.djangoproject.com/en/1.10/ref/models/querysets/#update-or-create) but I think im getting something wrong:
my script checks against an inbox for maintenance emails from our ISP, and then sends us a calendar invite if you are subscribed and adds maintenance to the database.
Sometimes we get updates on already planned maintenance, of which i then need to update the database with the new date and time, so im trying to use "update or create" for the queryset, and need to use the ref no from the email to update or create the record
#Maintenance
if sender.lower() == 'maintenance#isp.com':
print 'Found maintenance in mail: {0}'.format(subject)
content = Message.getBody(mail)
postcodes = re.findall(r"[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}", content)
if postcodes:
print 'Found Postcodes'
else:
error_body = """
Email titled: {0}
With content: {1}
Failed processing, could not find any postcodes in the email
""".format(subject,content)
SendMail(authentication,site_admins,'Unprocessed Email',error_body)
Message.markAsRead(mail)
continue
times = re.findall("\d{2}/\d{2}/\d{4} \d{2}:\d{2}", content)
if times:
print 'Found event Times'
e_start_time = datetime.strftime(datetime.strptime(times[0], "%d/%m/%Y %H:%M"),"%Y-%m-%dT%H:%M:%SZ")
e_end_time = datetime.strftime(datetime.strptime(times[1], "%d/%m/%Y %H:%M"),"%Y-%m-%dT%H:%M:%SZ")
subscribers = []
clauses = (Q(site_data__address__icontains=p) for p in postcodes)
query = reduce(operator.or_, clauses)
sites = Circuits.objects.filter(query).filter(circuit_type='MPLS', provider='KCOM')
subject_text = "Maintenance: "
m_ref = re.search('\[(.*?)\]',subject).group(1)
if not len(sites):
#try use first part of postcode
h_pcode = postcodes[0].split(' ')
sites = Circuits.objects.filter(site_data__postcode__startswith=h_pcode[0]).filter(circuit_type='MPLS', provider='KCOM')
if not len(sites):
#still cant find a site, send error
error_body = """
Email titled: {0}
With content: {1}
I have found a postcode, but could not find any matching sites to assign this maintenance too, therefore no meeting has been sent
""".format(subject,content)
SendMail(authentication,site_admins,'Unprocessed Email',error_body)
Message.markAsRead(mail)
continue
else:
#have site(s) send an invite and create record
for s in sites:
create record in circuit maintenance
maint = CircuitMaintenance(
circuit = s,
ref = m_ref,
start_time = e_start_time,
end_time = e_end_time,
notes = content
)
maint, CircuitMaintenance.objects.update_or_create(ref=m_ref)
#create subscribers for maintenance
m_ref, is the unique field that will match the update, but everytime I run this in tests I get
sites_circuitmaintenance.start_time may not be NULL
but I've set it?
If you want to update certain fields provided that a record with certain values exists, you need to explicitly provide the defaults as well as the field names.
Your code should look like this:
CircuitMaintenance.objects.update_or_create(default=
{'circuit' : s,'start_time' : e_start_time,'end_time' : e_end_time,'notes' : content}, ref=m_ref)
The particular error you are seeing is because update_or_create is creating an object because one with rer=m_ref does not exist. But you are not passing in values for all the not null fields. The above code will fi that.

Dont know why the below script would not crawl glassdoor.com

Dont know why the below python script would not crawl glassdoor.com website
from bs4 import BeautifulSoup # documentation available at :` #www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import NavigableString, Tag
import requests # To send http requests and access the page : docs.python-requests.org/en/latest/
import csv # To create the output csv file
import unicodedata # To work with the string encoding of the data
entries = []
entry = []
urlnumber = 1 # Give the page number to start with
while urlnumber<100: # Give the page number to end with
#print type(urlnumber), urlnumber
url = 'http://www.glassdoor.com/p%d' % (urlnumber,) #Give the url of the forum, excluding the page number in the hyperlink
#print url
try:
r = requests.get(url, timeout = 10) #Sending a request to access the page
except Exception,e:
print e.message
break
if r.status_code == 200:
data = r.text
else:
print str(r.status_code) + " " + url
soup = BeautifulSoup(data) # Getting the page source into the soup
for div in soup.find_all('div'):
entry = []
if(div.get('class') != None and div.get('class')[0] == 'Comment'): # A single post is referred to as a comment. Each comment is a block denoted in a div tag which has a class called comment.
ps = div.find_all('p') #gets all the tags called p to a variable ps
aas = div.find_all('a') # gets all the tags called a to a variable aas
spans = div.find_all('span') #
times = div.find_all('time') # used to extract the time tage which gives the iDate of the post
concat_str = ''
for str in aas[1].contents: # prints the contents that is between the tag start and end
if str != "<br>" or str != "<br/>": # This denotes breaks in post which we need to work around.
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip() # The encoding is because the format exracted is a unicode. We need a uniform structure to work with the strings.
entry.append(concat_str)
concat_str = ''
for str in times[0].contents:
if str != "<br>" or str != "<br/>":
concat_str = (concat_str + ' '+ str.encode('iso-8859-1')).strip()
entry.append(concat_str)
#print "-------------------------"
for div in div.find_all('div'):
if (div.get('class') != None and div.get('class')[0] == 'Message'): # Extracting the div tag witht the class attribute as message.
blockqoutes = []
x = div.get_text()
for bl in div.find_all('blockquote'):
blockqoutes.append(bl.get_text()) #Block quote is used to get the quote made by a person. get text helps to elimiate the hyperlinks and pulls out only the data.
bl.decompose()
entry.append(div.get_text().replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
for bl in blockqoutes:
#print bl
entry.append(bl.replace("\n"," ").replace("<br/>","").encode('ascii','replace').encode('iso-8859-1'))
#print entry
entries.append(entry)
urlnumber = urlnumber + 1 # increment so that we can extract the next page
with open('gd1.csv', 'w') as output:
writer = csv.writer(output, delimiter= ',', lineterminator = '\n')
writer.writerows(entries)
print "Wrote to gd1.csv"
I fixed some errors in your script, but I guess that it doesn't print anything because you only get 405(!)-pages!
Also your previous try/catch-block didn't print the error message. Was it on purpose?

Iterating tweets and saving them in a shapefile with tweepy and shapefile

I am searching for tweets and want to save them in a shapefile. Iterating through tweets is going well and when I use print statements I can get exactly what I want. I am attempting to put these tweets in a point shapefile. For some reason it does not accept the iterating of the if statement. So how do I iterate through my tweets and save them one by one in my point shapefile with only the tweet.text and tweet.id attached?
I got inspired by looking at the following link: https://code.google.com/p/pyshp/
import tweepy
import shapefile
consumer_key = "..."
consumer_secret = "..."
access_token = "..."
access_token_secret = "..."
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweetsaspoints = shapefile.Writer(shapefile.POINT)
page = 1
while True:
statuses = api.search(q="*",count=1000, geocode="52.015106,5.394287,150km")
if statuses:
for status in statuses:
print status.geo
tweetsaspoints._shapes.extend([status.geo['coordinates']])
tweetsaspoints.records.extend([("TEXT","Test")])
else:
# All done
break
page += 1 # next page
tweetsaspoints.save('shapefiles/test/point')
I do not understand the page part. I seem to iterate through the same tweets over and over again. Also, I am not succeeding in writing my coordinates and data to a point shapefile.
As per documentation, try to use:
for status in tweepy.Cursor(api.user_timeline).items():
# process status here
process_status(status)
Alternative:
page = 1
while True:
statuses = api.user_timeline(page=page)
if statuses:
for status in statuses:
# process status here
process_status(status)
else:
# All done
break
page += 1 # next page
reference: http://docs.tweepy.org/en/latest/cursor_tutorial.html