I am getting an index range error when I try to use multiple findalls, but if I just use one, then the code works.
from re import findall
news = open('download7.html', 'r')
title = findall('<item>[^<]+<title>(.*)</title>', news.read())
link = findall('<item>[^<]+<title>[^<]+</title>[^<]+<link>(.*)</link>', news.read())
description = findall('<!\[CDATA\[[^<]+<p>(.*)</p>', news.read())
pubdate = findall('<pubDate>([^<]+)</pubDate>', news.read())
image_regex = findall('url="([^"]+627.jpg)', news.read())
print(image_regex[0])
Calling .read() on a file object reads all remaining data from the file, and leaves the file pointer at the end of the file (so subsequent calls to .read() return the empty string).
Cache the file contents once, and reuse it:
from re import findall
with open('download7.html', 'r') as news:
newsdata = news.read()
title = findall('<item>[^<]+<title>(.*)</title>', newsdata)
link = findall('<item>[^<]+<title>[^<]+</title>[^<]+<link>(.*)</link>', newsdata)
description = findall('<!\[CDATA\[[^<]+<p>(.*)</p>', newsdata)
pubdate = findall('<pubDate>([^<]+)</pubDate>', newsdata)
image_regex = findall('url="([^"]+627.jpg)', newsdata)
print(image_regex[0])
Note: You could re-read from the file object by seeking back to the beginning after each read (calling news.seek(0)), but that's far less efficient when you need the complete file data over and over.
Related
I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files.
A sample URL with a file can be found here
The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:
The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)
Different variants of the expressions are taken into consideration
I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea).
EDIT: Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)
My code:
import requests
import re
import csv
from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]', '', line[1])
fn3 = re.sub(r'[/\\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
f = open(saveas + ".txt", "w+",encoding="utf-8")
url = 'https://www.sec.gov/Archives/' + line[4].strip()
print(url)
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
for risk in risks:
for i in risk.findAllNext():
i.get_text()
sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
for section in sections:
clean = re.compile('<.*?>')
# section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
# section = section.strip()
# section = re.sub('\s+', '', section).strip()
print(re.sub(clean, '', section))
The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.
In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.
import requests
import csv
from pathlib import Path
from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
print(line[9])
url = line[9]
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
name = line[1]
name = name.replace('/', '')
name = name.replace("/PA/", "")
name = name.replace("/DE/", "")
dir = Path(name + line[4] + ".txt")
f = open(dir, "w+", encoding="utf-8")
if dir.is_dir():
break
else: f.write(soup.get_text())
I'm trying to write a small crawler to crawl multiple wikipedia pages.
I want to make the crawl somewhat dynamic by concatenating the hyperlink for the exact wikipage from a file which contains a list of names.
For example, the first line of "deutsche_Schauspieler.txt" says "Alfred Abel" and the concatenated string would be "https://de.wikipedia.org/wiki/Alfred Abel". Using the txt file will result in heading being none, yet when I complete the link with a string inside the script, it works.
This is for python 2.x.
I already tried to switch from " to ',
tried + instead of %s
tried to put the whole string into the txt file (so that first line reads "http://..." instead of "Alfred Abel"
tried to switch from "Alfred Abel" to "Alfred_Abel
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
for line in content:
link = "https://de.wikipedia.org/wiki/%s" % (str(line))
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html)
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
f.close()
file.close()
I expect to get the content of the table "Vorlage_Personendaten" which actually works if i change line 10 to
link = "https://de.wikipedia.org/wiki/Alfred Abel"
# link = "https://de.wikipedia.org/wiki/Alfred_Abel" also works
But I want it to work using the textfile
Looks like the problem in your text file where you have used "Alfred Abel" that is why you are getting the following exceptions
uls = heading.find_all('td')
AttributeError: 'NoneType' object has no attribute 'find_all'
Please remove the string quotes "Alfred Abel" and use Alfred Abel inside the text file deutsche_Schauspieler.txt . it will work as expected.
I found the solution myself.
Although there are no additionaly lines on the file, the content array displays like
['Alfred Abel\n'], but printing out the first index of the array will result in 'Alfred Abel'. It still gets interpreted like the string in the array, thus forming a false link.
So you want to move the last(!) character from the current line.
A solution could look like so:
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
print (content)
for line in content:
line=line[:-1] #Note how this removes \n which are technically two characters
link = "https://de.wikipedia.org/wiki/%s" % str(line)
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,"html.parser")
try:
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
except:
print ("That did not work")
pass
f.close()
file.close()
I want to edit an uploaded file on byte level (i.e. searching and removing a certain byte sequence) before saving it.
I have a pre_save signal set up in the following way:
class Snippet(models.Model):
name = models.CharField(max_length=256, unique=True)
audio_file = models.FileField(upload_to=generate_file_name, blank=True, null=True)
#receiver(models.signals.pre_save, sender=Snippet)
def prepare_save(sender, instance, **kwargs):
if instance.audio_file:
remove_headers(instance)
Now I have had problems implementing the remove_headers function in a way that I can edit the file while it is still in memory and have it stored afterwards. I tried among others the following:
def remove_headers(instance):
byte_sequence = b'bytestoremove'
f = instance.audio_file.read()
file_in_hex = f.hex()
file_in_hex = re.sub(byte_sequence.hex(), '', file_in_hex)
x = b''
x = x.fromhex(file_in_hex)
tmp_file = TemporaryFile()
tmp_file.write(x)
tmp_file.flush()
tmp_file.seek(0)
instance.audio_file.save(instance.audio_file.name, tmp_file, save=True)
This first of all would result in an infinite loop. But this can be mitigated by e.g. only calling the remove_headers method on create or so. It did however not work, the file was unchanged. I also tried replacing the last line with:
instance.audio_file = File(tmp_file, name=instance.audio_file.name)
This however resulted in an empty file to be written/saved.
Curiously when writing a test, this method seems to work:
def test_header_removed(self):
snippet = mommy.make(Snippet)
snippet.audio_file.save('newname.mp3', ContentFile('contentbytestoremovecontent'))
snippet.save()
self.assertEqual(snippet.audio_file.read(), b'contentcontent')
This test does not fail, despite the file being zero bytes in the end.
What am I missing here?
The second solution was almost correct. The reason the files ended up being empty (actually this only happened to bigger files) was, that sometimes you have to seek to the beginning of the file after opening it. So the beginngni of remove_headers needs to be changed:
def remove_headers(instance):
byte_sequence = b'bytestoremove'
instance.audio_file.seek(0)
f = instance.audio_file.read()
file_in_hex = f.hex()
I'm reading a file outputfromextractand I want to split the contents of that file with the delimiter ',' which I have done.
When reading the contents into a list there's two 'faff' entries at the beginning that I'm just trying to remove however I find myself unable to remove the index
import json
class device:
ipaddress = None
macaddress = None
def __init__(self, ipaddress, macaddress):
self.ipaddress = ipaddress
self.macaddress = macaddress
listofItems = []
listofdevices = []
def format_the_data():
file = open("outputfromextract")
contentsofFile = file.read()
individualItem = contentsofFile.split(',')
listofItems.append(individualItem)
print(listofItems[0][0:2]) #this here displays the entries I want to remove
listofItems.remove[0[0:2]] # fails here and raises a TypeError (int object not subscriptable)
In the file I have created the first three lines are enclosed below for reference:
[u' #created by system\n', u'time at 12:05\n', u'192.168.1.1\n',...
I'm wanting to simply remove those two items from the list and the rest will be put into an constructor
I believe listofItems.remove[0[0:2]] should be listofItems.remove[0][0:2].
But, slicing will be much easier, for example:
with open("outputfromextract") as f:
contentsofFile = f.read()
individualItem = contentsofFile.split(',')[2:]
I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?
timestamp = time.asctime()
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')
txt.close()
You are appending a new line and text to the start of the data for every image, essentially corrupting it.
Also, you are writing every image into the same file, again corrupting them.
Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.
# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
timestamp = time.asctime()
txt = open('%s.jpg' % timestamp, "wb")
link = link["src"].split("src=")[-1]
download_img = urllib2.urlopen(link)
txt.write(download_img.read())
txt.close()