Actually I want to modify PDF which is Certificate need to change name, date, course name and course description. I have tried the pypdf2 but doesn't work for me. The "change name" data doesn't modify.
Here is What i tried so far
pdfFileObj = open('/home/nawaf/Documents/5_dec_truck_courses/29_Nov_Updated/truck_courses/LMS_Online_App/static/certificate.pdf', 'rb')
pdf_reader = PdfFileReader(io.BytesIO(pdfFileObj.read()))
pdf_page = pdf_reader.pages[0]
pdf_text = pdf_page.extractText()
change_name = pdf_text.replace("John Doe", "Nawaf").replace("NAME OF THE
COURSETO","Rent Everything").replace("The course description will go her.","My Own
Description")
pdfFileObj.close()
pdf_writer = PdfFileWriter(pdfFileObj)
with open('/home/nawaf/Documents/5_dec_truck_courses/29_Nov_Updated/truck_courses/test/test.pdf', 'wb') as f:
pdf_writer.addPage(page=pdf_page)
pdf_writer.add_annotation(page_number=0, annotation=pdf_reader.metadata)
pdf_writer.write(f)
f.close()
Related
I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files.
A sample URL with a file can be found here
The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:
The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)
Different variants of the expressions are taken into consideration
I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea).
EDIT: Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)
My code:
import requests
import re
import csv
from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]', '', line[1])
fn3 = re.sub(r'[/\\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
f = open(saveas + ".txt", "w+",encoding="utf-8")
url = 'https://www.sec.gov/Archives/' + line[4].strip()
print(url)
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
for risk in risks:
for i in risk.findAllNext():
i.get_text()
sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
for section in sections:
clean = re.compile('<.*?>')
# section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
# section = section.strip()
# section = re.sub('\s+', '', section).strip()
print(re.sub(clean, '', section))
The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.
In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.
import requests
import csv
from pathlib import Path
from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
print(line[9])
url = line[9]
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
name = line[1]
name = name.replace('/', '')
name = name.replace("/PA/", "")
name = name.replace("/DE/", "")
dir = Path(name + line[4] + ".txt")
f = open(dir, "w+", encoding="utf-8")
if dir.is_dir():
break
else: f.write(soup.get_text())
I wrote this code to get the full list of twitter account followers using Tweepy:
# ... twitter connection and streaming
fulldf = pd.DataFrame()
line = {}
ids = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
line = [{'id': user.id,
'Name': user.name,
'Statuses Count':user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name':user.screen_name,
'Followers Count':user.followers_count,
'Location':user.location,
'Language':user.lang,
'Created at':user.created_at,
'Time zone':user.time_zone,
'Geo enable':user.geo_enabled,
'Description':user.description.encode(sys.stdout.encoding, errors='replace')}]
df = pd.DataFrame(line)
fulldf = fulldf.append(df)
del df
fulldf.to_csv('out.csv', sep=',', index=False)
print i ,len(ids)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except tweepy.TweepError as e2:
print "exception global block"
print e2.message[0]['code']
print e2.args[0][0]['code']
At the end I have only 1000 line in the csv file, It's not best solution to save everything on memory (dataframe) and save it to file in the same loop. But at least I have something that works but not getting the full list just 1000 out of 15000 followers.
Any help with this will be appreciated.
Consider the following part of your code:
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
df = pd.DataFrame()
ids.extend(page)
try:
for i in ids:
user = api.get_user(i)
As you use extend for each page, you simply add the new set of ids onto the end of your list of ids. The way you have nested your for statements means that with every new page you return, you get_user for all of the previous pages first - as such, when you hit the final page of ids you'd still be looking at the first 1000 or so when you hit the rate limit and have no more pages to browse. You're also likely hitting the rate limit for your cursor, hich would be why you're seeing the exception.
Let's start over a bit.
Firstly, tweepy can deal with rate limits (one of the main error sources) for you when you create your API if you use wait_on_rate_limit. This solves a whole bunch of problems, so we'll do that.
Secondly, if you use lookup_users, you can look up 100 user objects per request. I've written about this in another answer so I've taken the method from there.
Finally, we don't need to create a dataframe or export to a csv until the very end. If we get a list of user information dictionaries, this can quickly change to a DataFrame with no real effort from us.
Here is the full code - you'll need to sub in your keys and the username of the user you actually want to look up, but other than that it hopefully will work!
import tweepy
import pandas as pd
def lookup_user_list(user_id_list, api):
full_users = []
users_count = len(user_id_list)
try:
for i in range((users_count / 100) + 1):
print i
full_users.extend(api.lookup_users(user_ids=user_id_list[i * 100:min((i + 1) * 100, users_count)]))
return full_users
except tweepy.TweepError:
print 'Something went wrong, quitting...'
consumer_key = 'XXX'
consumer_secret = 'XXX'
access_token = 'XXX'
access_token_secret = 'XXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="twittername").pages():
ids.extend(page)
results = lookup_user_list(ids, api)
all_users = [{'id': user.id,
'Name': user.name,
'Statuses Count': user.statuses_count,
'Friends Count': user.friends_count,
'Screen Name': user.screen_name,
'Followers Count': user.followers_count,
'Location': user.location,
'Language': user.lang,
'Created at': user.created_at,
'Time zone': user.time_zone,
'Geo enable': user.geo_enabled,
'Description': user.description}
for user in results]
df = pd.DataFrame(all_users)
df.to_csv('All followers.csv', index=False, encoding='utf-8')
I am trying to scrape webpage given in the this link -
http://new-york.eat24hours.com/picasso-pizza/19053
Here I am trying to get all the possible details like address and phone etc..
So, Far I have extracted the name, phone, address, reviews, rating.
But I also want to extract the the full menu of restaurant here(name of item with price).
So, far I have no idea how to manage this data into output of csv.
The rest of the data for a single url will be single but the items in menu will always be of different amount.
here below is my code so far-
import scrapy
from urls import start_urls
class eat24Spider(scrapy.Spider):
AUTOTHROTTLE_ENABLED = True
name = 'eat24'
def start_requests(self):
for x in start_urls:
yield scrapy.Request(x, self.parse)
def parse(self, response):
brickset = response
NAME_SELECTOR = 'normalize-space(.//h1[#id="restaurant_name"]/a/text())'
ADDRESS_SELECTION = 'normalize-space(.//span[#itemprop="streetAddress"]/text())'
LOCALITY = 'normalize-space(.//span[#itemprop="addressLocality"]/text())'
REGION = 'normalize-space(.//span[#itemprop="addressRegion"]/text())'
ZIP = 'normalize-space(.//span[#itemprop="postalCode"]/text())'
PHONE_SELECTOR = 'normalize-space(.//span[#itemprop="telephone"]/text())'
RATING = './/meta[#itemprop="ratingValue"]/#content'
NO_OF_REVIEWS = './/meta[#itemprop="reviewCount"]/#content'
OPENING_HOURS = './/div[#class="hours_info"]//nobr/text()'
EMAIL_SELECTOR = './/div[#class="company-info__block"]/div[#class="business-buttons"]/a[span]/#href[substring-after(.,"mailto:")]'
yield {
'name': brickset.xpath(NAME_SELECTOR).extract_first().encode('utf8'),
'pagelink': response.url,
'address' : str(brickset.xpath(ADDRESS_SELECTION).extract_first().encode('utf8')+', '+brickset.xpath(LOCALITY).extract_first().encode('utf8')+', '+brickset.xpath(REGION).extract_first().encode('utf8')+', '+brickset.xpath(ZIP).extract_first().encode('utf8')),
'phone' : str(brickset.xpath(PHONE_SELECTOR).extract_first()),
'reviews' : str(brickset.xpath(NO_OF_REVIEWS).extract_first()),
'rating' : str(brickset.xpath(RATING).extract_first()),
'opening_hours' : str(brickset.xpath(OPENING_HOURS).extract_first())
}
I am sorry if I am making this confusing but any kind of help will be appreciated.
Thank you in advance!!
If you want to extract full restaurant menu, first of all, you need to locate element who contains both name and price:
menu_items = response.xpath('//tr[#itemscope]')
After that, you can simply make for loop and iterate over restaurant items appending name and price to list:
menu = []
for item in menu_items:
menu.append({
'name': item.xpath('.//a[#class="cpa"]/text()').extract_first(),
'price': item.xpath('.//span[#itemprop="price"]/text()').extract_first()
})
Finally you can add new 'menu' key to your dict:
yield {'menu': menu}
Also, I suggest you use scrapy Items for storing scraped data:
https://doc.scrapy.org/en/latest/topics/items.html
For outputting data in csv file use scrapy Feed exports, type in console:
scrapy crawl yourspidername -o restaurants.csv
I'm developing a simple webapp in web2py and I want to create a link that let's the user download a file. Like this:
<a href="{{=URL('download',args = FILE)}}" download>
However, I want to do this without having to pass the FILE to the user in the page handler. I want to retrieve an ID from the server asynchronously that will correspond to the file I want to download and then pass it to a custom download function like this:
<a href="{{=URL('custom_download',args = FILEID)}}" download>
This way, I will be able to upload files to the server asynchronously, (I already figured out how to do that) and the download link on the page for that file will work right away without having to reload the page.
So, on the server side, I would do something like this:
def custom_download():
download_row = db(db.computers.FILEID == request.args(0)).select()
download_file = download_row.filefield
return download_file
However, I'm not entirely sure what I need to write in order for this to work.
I assumed that your files are stored in uploads folder, then your custom download function will be:
def custom_download():
download_row = db(db.computers.FILEID == request.args(0)).select().first()
download_file = download_row.filefield
# Name of file is table_name.field.XXXXX.ext, so retrieve original file name
org_file_name = db.computers.filefield.retrieve(download_file)[0]
file_header = "attachment; filename=" + org_file_name
response.headers['ContentType'] = "application/octet-stream"
response.headers['Content-Disposition'] = file_header
file_full_path = os.path.join(request.folder, 'uploads', download_file)
fh = open(file_full_path, 'rb')
return response.stream(fh)
I'm trying to write a script that will save a pdf created by xhtml2pdf directly to the server, without doing the usual route of prompting the user to download it to their computer. Documents() is the Model I am trying to save to, and the new_project and output_filename variables are set elsewhere.
html = render_to_string(template, RequestContext(request, context)).encode('utf8')
result = open(output_filename, "wb")
pdf = CreatePDF(src=html, dest=results, path = "", encoding = 'UTF-8', link_callback=link_callback) #link callback was originally set to link_callback, defined below
result.close()
if not pdf.err:
new_doc=Documents()
new_doc.project=new_project
new_doc.posted_by=old_mess[0].from_user_fk.username
new_doc.documents = result
new_doc.save()
With this configuration when it reaches new_doc.save() I get the error: 'file' object has no attribute '_committed'
Does anyone know how I can fix this? Thanks!
After playing around with it I found a working solution. The issue was I was not creating the new Document while result (the pdf) was still open.
"+" needed to be added to open() so that the pdf file was available for reading and writing, and not just writing.
Note that this does save the pdf in a different folder first (Files). If that is not the desired outcome for your application you will need to delete it.
html = render_to_string(template, RequestContext(request, context)).encode('utf8')
results = StringIO()
result = open("Files/"+output_filename, "w+b")
pdf = CreatePDF(src=html, dest=results, path = "", encoding = 'UTF-8', link_callback=link_callback) #link callback was originally set to link_callback, defined below
if not pdf.err:
result.write(results.getvalue())
new_doc=Documents()
new_doc.project=new_project
new_doc.documents.save(output_filename, File(result))
new_doc.posted_by=old_mess[0].from_user_fk.username
new_doc.save()
result.close()