I'm trying to write a small crawler to crawl multiple wikipedia pages.
I want to make the crawl somewhat dynamic by concatenating the hyperlink for the exact wikipage from a file which contains a list of names.
For example, the first line of "deutsche_Schauspieler.txt" says "Alfred Abel" and the concatenated string would be "https://de.wikipedia.org/wiki/Alfred Abel". Using the txt file will result in heading being none, yet when I complete the link with a string inside the script, it works.
This is for python 2.x.
I already tried to switch from " to ',
tried + instead of %s
tried to put the whole string into the txt file (so that first line reads "http://..." instead of "Alfred Abel"
tried to switch from "Alfred Abel" to "Alfred_Abel
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
for line in content:
link = "https://de.wikipedia.org/wiki/%s" % (str(line))
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html)
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
f.close()
file.close()
I expect to get the content of the table "Vorlage_Personendaten" which actually works if i change line 10 to
link = "https://de.wikipedia.org/wiki/Alfred Abel"
# link = "https://de.wikipedia.org/wiki/Alfred_Abel" also works
But I want it to work using the textfile
Looks like the problem in your text file where you have used "Alfred Abel" that is why you are getting the following exceptions
uls = heading.find_all('td')
AttributeError: 'NoneType' object has no attribute 'find_all'
Please remove the string quotes "Alfred Abel" and use Alfred Abel inside the text file deutsche_Schauspieler.txt . it will work as expected.
I found the solution myself.
Although there are no additionaly lines on the file, the content array displays like
['Alfred Abel\n'], but printing out the first index of the array will result in 'Alfred Abel'. It still gets interpreted like the string in the array, thus forming a false link.
So you want to move the last(!) character from the current line.
A solution could look like so:
from bs4 import BeautifulSoup
import requests
file = open("test.txt","w")
f = open("deutsche_Schauspieler.txt","r")
content = f.readlines()
print (content)
for line in content:
line=line[:-1] #Note how this removes \n which are technically two characters
link = "https://de.wikipedia.org/wiki/%s" % str(line)
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,"html.parser")
try:
heading = soup.find(id='Vorlage_Personendaten')
uls = heading.find_all('td')
for item in uls:
file.write(item.text.encode('utf-8') + "\n")
except:
print ("That did not work")
pass
f.close()
file.close()
Related
I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files.
A sample URL with a file can be found here
The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:
The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)
Different variants of the expressions are taken into consideration
I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea).
EDIT: Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)
My code:
import requests
import re
import csv
from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
fn1 = line[0]
fn2 = re.sub(r'[/\\]', '', line[1])
fn3 = re.sub(r'[/\\]', '', line[2])
fn4 = line[3]
saveas = '-'.join([fn1, fn2, fn3, fn4])
f = open(saveas + ".txt", "w+",encoding="utf-8")
url = 'https://www.sec.gov/Archives/' + line[4].strip()
print(url)
response = requests.get(url)
soup = bs(response.content, 'html.parser')
risks = soup.find_all('a')
regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
for risk in risks:
for i in risk.findAllNext():
i.get_text()
sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
for section in sections:
clean = re.compile('<.*?>')
# section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
# section = section.strip()
# section = re.sub('\s+', '', section).strip()
print(re.sub(clean, '', section))
The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.
In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.
import requests
import csv
from pathlib import Path
from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for line in reader:
print(line[9])
url = line[9]
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
name = line[1]
name = name.replace('/', '')
name = name.replace("/PA/", "")
name = name.replace("/DE/", "")
dir = Path(name + line[4] + ".txt")
f = open(dir, "w+", encoding="utf-8")
if dir.is_dir():
break
else: f.write(soup.get_text())
My code grabs information and stores it in a list. The sorted list is:
example/
example/text1.txt
example/text2.txt
example/text3.txt
I would like to refer to text1.txt and perform a function to it, then move on to the next entry in the list (in this case, text2.txt).
I was able to see a bit of what I can do with regex, but it outputs nothing.
Here's a portion of my code so far:
FileNames = name in sorted(zip_file.namelist())
regex = r"[1-9]+ \d"
matches = re.findall(regex, str(FileNames))
for match in matches:
print("%s" % (match))
EDIT:
utilizing a different technique, here's what I got so far:
import zipfile
import re
zip_file = zipfile.ZipFile('/example.zip','r')
#zip_file = zipfile.ZipFIle(input("What's the filepath?: "))
for name in sorted(zip_file.namelist()):
#print(name)
for file_path in name:
file_name = file_path.split("/")[-1]
if "1" in file_name:
print(file_name)
else:
print("This line does not contain a valid path to a text file.")
zip_file.close()
It gives me a really gross output, something along the lines of
example/text1.txt
The Line does not contain a valid path to a text file.
^repeated a ton of times
I would not use regex but just a simple split since you are dealing with paths. For each line, you can take the rightmost string of the list after "/" if it exists. If it contains ".txt"it's your file name, otherwise ignore that line.
files_paths = ["example/",
"example/text1.txt",
"example/text2.txt",
"example/text3.txt"]
for file_path in files_paths :
file_name = file_path.split("/")[-1]
if ".txt" in file_name:
... # call a function with file_name as an argument.
else:
print("This line does not contain a valid path to a text file.")
The print is just there for testing of course, feel free to delete the else clause if you want your script to stay silent.
Assuming that every entry in the list is a path to a file with which you want to perform some work on, you can use the os.path library to get the basename of the file path.
import os
path_str = "example/text1.txt"
file_name = os.path.basename(os.path.normpath(path_str))
I'm reading a file outputfromextractand I want to split the contents of that file with the delimiter ',' which I have done.
When reading the contents into a list there's two 'faff' entries at the beginning that I'm just trying to remove however I find myself unable to remove the index
import json
class device:
ipaddress = None
macaddress = None
def __init__(self, ipaddress, macaddress):
self.ipaddress = ipaddress
self.macaddress = macaddress
listofItems = []
listofdevices = []
def format_the_data():
file = open("outputfromextract")
contentsofFile = file.read()
individualItem = contentsofFile.split(',')
listofItems.append(individualItem)
print(listofItems[0][0:2]) #this here displays the entries I want to remove
listofItems.remove[0[0:2]] # fails here and raises a TypeError (int object not subscriptable)
In the file I have created the first three lines are enclosed below for reference:
[u' #created by system\n', u'time at 12:05\n', u'192.168.1.1\n',...
I'm wanting to simply remove those two items from the list and the rest will be put into an constructor
I believe listofItems.remove[0[0:2]] should be listofItems.remove[0][0:2].
But, slicing will be much easier, for example:
with open("outputfromextract") as f:
contentsofFile = f.read()
individualItem = contentsofFile.split(',')[2:]
this is my first attempt at coding so please forgive my daftness. I'm trying to learn web scraping by practising with this link:
https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0
I've honestly spent hours trying to figure out what's wrong with my code here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody')
list_of_rows = []
for row in table.find('tr'):
list_of_cells = []
for cell in row.findAll('td'):
list_of_cells.append()
list_of_rows.append(list_of_cells)
outfile = open("./indarb.csv","wb")
writer = csv.writer(outfile)
My terminal then spits out this: 'NoneType' object has no attribute 'find', saying there's an error in line 13. Not sure if it helps in queries but this is a list of what I've tried:
Different permutations of 'find'/'findAll'
Instead of '.find', used '.findAll'
Instead of '.findAll', used '.find'
Different permutations for line 10
Tried soup.find('tbody')
Tried soup.find('table')
Opened source code, tried soup.find('table', attrs={'class':'table table-condensed'})
Different permutations for line 13
similarly tried with just 'tr' tag; or
tried adding 'attrs={}' stuff
I've really tried but can't figure out why I can't scrape that simple 10 row table. If anyone could post code that works, that'd be phenomenal. Thank you for your patience!
The URL you request in your code is not HTML but JSON.
You have a few mistakes, the biggest is you are using BeautifulSoup3 which has not been developed for years, you should be use bs4, you also need to use find_all when you want want multiple tags. Also you have not passed cell to list_of_cells.append() on line 13 so that is the cause of your other error:
from bs4 import BeautifulSoup
url = 'https://data.gov.sg/dataset/industrial-arbitration-court-awards-by-nature-of-trade-disputes?view_id=d3e444ef-54ed-4d0b-b715-1ee465f6d882&resource_id=c24d0d00-2d12-4f68-8fc9-4121433332e0%27'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table')
list_of_rows = []
for row in table.find_all('tr'):
list_of_cells = []
for cell in row.find_all('td'):
list_of_cells.append(cell)
list_of_rows.append(list_of_cells)
I am not sure exactly what you want but that appends the tds from the first table on the page. There is also and api you can use and adownloadable csv if you do actually want the data.
I'm writing a code where it fetches some text from a site and then, with a for-loop I take the part of the text of my interest. I can print this text but I would like to know how can I send it to a list for latter use. So far the code I've wrote is this one.
import urllib2
keyword = raw_input('keyword: ')
URL = "http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=%s&fil=&limit=10&force=no&preview=true&format=fasta" % keyword
filehandle = urllib2.urlopen(URL)
url_text = filehandle.readlines()
for line in url_text:
if line.startswith('>'):
print line[line.index(' ') : line.index('OS')]
Just use append:
lines = []
for line in url_text:
if line.startswith('>'):
lines.append(line) # or whatever else you wanted to add to the list
print line[line.index(' ') : line.index('OS')]
Edit: on a side note, python can for loop directly over a file - as in:
url_text = filehandle.readlines()
for line in url_text:
pass
# can be shortened to:
for line in filehandle:
pass