I have created a small wiki page with Bottle: Python web framework. Every thing is working fine right now. You create a article by going to "Create a new article" and giving it a Title and write down some Text. Then all of the created article shows on the index page in a list, and you can click on them to open and read.
But the thing is when i click on the article with the purpose to edit the article by giving it a new title and add some new words to the text. It doesn't change the name on the orginal textfile, instead i get a new textfile with the new title and the orginal textfile still remains.
This is the code:
from bottle import route, run, template, request, static_file
from os import listdir
import sys
host='localhost'
#route('/static/<filname>')
def serce_static(filname):
return static_file(filname, root="static")
#route("/")
def list_articles():
'''
This is the home page, which shows a list of links to all articles
in the wiki.
'''
files = listdir("wiki")
articles = []
for i in files:
lista = i.split('.')
word = lista[0]
lista1 = word.split('/')
articles.append(lista1[0])
return template("index", articles=articles)
#route('/wiki/<article>',)
def show_article(article):
'''
Displays the user´s text for the user
'''
wikifile = open('wiki/' + article + '.txt', 'r')
text = wikifile.read()
wikifile.close()
return template('page', title = article, text = text)
#route('/edit/')
def edit_form():
'''
Shows a form which allows the user to input a title and content
for an article. This form should be sent via POST to /update/.
'''
return template('edit')
#route('/update/', method='POST')
def update_article():
'''
Receives page title and contents from a form, and creates/updates a
text file for that page.
'''
title = request.forms.title
text = request.forms.text
tx = open('wiki/' + title + '.txt', 'w')
tx.write(text)
tx.close()
return template('thanks', title=title, text=text)
run(host='localhost', port=8080, debug=True, reloader=True)
The article object is too simple to list, edit or update.
The article should have an ID as its file name.
The article file should have two fields: title and text.
For example: 10022.txt
title:
bottle
text:
Bottle is a fast, simple and lightweight WSGI micro web-framework for Python.
You should retrieve the article by ID.
You can open the file by ID and change its title and text.
Related
I am attempting to produce a csv output of select items contained in a particular class (title, link, price) that parses out each item in its own column, and each instance in its own row using itemloaders and the items module.
I can produce the output using a self-contained spider (without use of items module), however, I'm trying to learn the proper way of detailing the items in the items module, so that I can eventually scale up projects using the proper structure. (I will detail this code as 'Working Row Output Spider Code' below)
I have also attempted to incorporate solutions determined or discussed in related posts; in particular:
Writing Itemloader By Item to XML or CSV Using Scrapy posted by Sam
Scrapy Return Multiple Items posted by Zana Daniel
by using a for loop as he notes at the bottom of the comments section. However, I can get scrapy to accept the for loop, it just doesn't result in any change, that is the items are still grouped in single fields rather than being output into independent rows.
Below is a detail of the code contained in two project attempts --'Working Row Output Spider Code' that does not incorporate items module and items loader, and 'Non Working Row Output Spider Code'-- and the corresponding output of each.
Working Row Output Spider Code: btobasics.py
import scrapy
import urlparse
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
titles = response.xpath('//*[#class="product_pod"]/h3//text()').extract()
links = response.xpath('//*[#class="product_pod"]/h3/a/#href').extract()
prices = response.xpath('//*[#class="product_pod"]/div[2]/p[1]/text()').extract()
for item in zip(titles, links, prices):
# create a dictionary to store the scraped info
scraped_info = {
'title': item[0],
'link': item[1],
'price': item[2],
}
# yield or give the scraped info to scrapy
yield scraped_info
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Working Row Output WITHOUT STRUCTURED ITEM LOADERS
Non Working Row Output Spider Code: btobasictwo.py
import datetime
import urlparse
import scrapy
from btobasictwo.items import BtobasictwoItem
from scrapy.loader.processors import MapCompose
from scrapy.loader import ItemLoader
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['http://http://books.toscrape.com/']
start_urls = ['http://books.toscrape.com//']
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[#class="product_pod"]/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', '//*[#class="product_pod"]/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', '//*[#class="product_pod"]/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
return l.load_item()
Non Working Row Output Items Code: btobasictwo.items.py
from scrapy.item import Item, Field
class BtobasictwoItem(Item):
# Primary fields
title = Field()
link = Field()
price = Field()
# Log fields
url = Field()
date = Field()
Run Command to produce CSV: $ scrapy crawl basic -o output.csv
Non Working Row Code Output WITH STRUCTURED ITEM LOADERS
As you can see, when attempting to incorporate the items module, itemloaders and a for loop to structure the data, it does not seperate the instances by row, but rather puts all instances of a particular item (title, link, price) in 3 fields.
I would greatly appreciate any help on this, and apologize for the lengthy post. I just wanted to document as much as possible so that anyone wanting to assist could run the code themselves, and/or fully appreciate the problem from my documentation. (please leave a comment instructing on length of post if you feel it is not appropriate to be this lengthly).
Thanks very much
You need to tell your ItemLoader to use another selector:
def parse(self, response):
# Create the loader using the response
links = response.xpath('//*[#class="product_pod"]')
for link in links:
l = ItemLoader(item=BtobasictwoItem(), selector=link)
# Load fields using XPath expressions
l.add_xpath('title', './/h3//text()',
MapCompose(unicode.strip))
l.add_xpath('link', './/h3/a/#href',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_xpath('price', './/div[2]/p[1]/text()',
MapCompose(unicode.strip))
# Log fields
l.add_value('url', response.url)
l.add_value('date', datetime.datetime.now())
yield l.load_item()
So I've managed to write a spider that extracts the download links of "Videos" and "English Transcripts" from this site . Looking at the cmd window i can see that all the correct information has been scraped.
The issue I am having is that the output csv file only contains the "Video" links and not the "English Transcripts" links (even though you can see that it's been scraped in the cmd window).
I've tried a few suggestions from other posts but none of them seem to work.
The following picture is how I'd like the output to look like:
CSV Output Picture
this is my current spider code:
import scrapy
class SuhbaSpider(scrapy.Spider):
name = "suhba2"
start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb)
for numb in range(1,3)]
def parse(self, response):
yield{
"video" : response.xpath("//span[#class='download make-cursor']/a/#href").extract(),
}
fullvideoid = response.xpath("//span[#class='media-info make-cursor']/#onclick").extract()
for videoid in fullvideoid:
url = ("http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2])
yield scrapy.Request(url, callback=self.parse_transcript)
def parse_transcript(self, response):
yield{
"transcript" : response.xpath("//a[contains(#href,'english')]/#href").extract(),
}
You are yielding two different kinds of items - one containing just video attribute and one containing just transcript attribute. You have to yield one kind of item composed of both attributes. For that, you have to create item in parse and pass it to second level request using meta. Then, in the parse_transcript, you take it from meta, fill additional data and finally yield the item. The general pattern is described in Scrapy documentation.
The second thing is that you extract all videos at once using extract() method. This yields a list where it's hard afterwards to link each individual element with corresponding transcript. Better approach is to loop over each individual video element in the HTML and yield item for each video.
Applied to your example:
import scrapy
class SuhbaSpider(scrapy.Spider):
name = "suhba2"
start_urls = ["http://saltanat.org/videos.php?topic=SheikhBahauddin&gopage={numb}".format(numb=numb) for numb in range(1,3)]
def parse(self, response):
for video in response.xpath("//tr[#class='video-doclet-row']"):
item = dict()
item["video"] = video.xpath(".//span[#class='download make-cursor']/a/#href").extract_first()
videoid = video.xpath(".//span[#class='media-info make-cursor']/#onclick").extract_first()
url = "http://saltanat.org/ajax_transcription.php?vid=" + videoid[21:-2]
request = scrapy.Request(url, callback=self.parse_transcript)
request.meta['item'] = item
yield request
def parse_transcript(self, response):
item = response.meta['item']
item["transcript"] = response.xpath("//a[contains(#href,'english')]/#href").extract_first()
yield item
I'm a newbie of Scrapy & Python. I try to get the comment from the following URL but the result always null : http://vnexpress.net/tin-tuc/oto-xe-may/toyota-camry-2016-dinh-loi-tui-khi-khong-bung-3386676.html
Here is my code :
from scrapy.spiders import Spider
from scrapy.selector import Selector
from tutorial.items import TutorialItem
import logging
class TutorialSpider(Spider):
name = "vnexpress"
allowed_domains = ["vnexpress.net"]
start_urls = [
"http://vnexpress.net/tin-tuc/oto-xe-may/toyota-camry-2016-dinh-loi-tui-khi-khong-bung-3386676.html"
]
def parse(self, response):
sel = Selector(response)
commentList = sel.xpath('//div[#class="comment_item"]')
items = []
id = 0;
logging.log(logging.INFO, "TOTAL COMMENT : " + str(len(commentList)))
for comment in commentList:
item = TutorialItem()
id = id + 1
item['id'] = id
item['mainId'] = 0
item['user'] = comment.xpath('//span[#class="left txt_666 txt_11"]/b').extract()
item['time'] = 'N/A'
item['content'] = comment.xpath('//p[#class="full_content"]').extract()
item['like'] = comment.xpath('//span[#class="txt_666 txt_11 right block_like_web"]/a[#class="txt_666 txt_11 total_like"]').extract()
items.append(item)
return items
Thanks for reading
Looks like the comments are loaded into the page with some JavaScript code.
Scrapy does not execute JavaScript on a page, it only downloads HTML pages. Try opening the page with JavaScript disabled in your browser, and you should see the page as Scrapy sees it.
You have a handful of options:
reverse-engineer how the comments are loaded into the page, using your browser's developer tools panel, in "network" tab (it could be some XHR call loading HTML or JSON data)
use a (headless)browser to render the page (selenium, casper.js, splash...);
e.g. you may want to try this page with Splash (one of the JavaScript rendering options for web scraping). This is the HTML you get back from Splash (it contains the comments): http://pastebin.com/njgCsM9w
I’m trying to figure out how to rename an existing text file when I change the title of the text file. If I change the title now, it’s going to create a new text file with the new title. The "old text file" that I wanted to save with a new name still exists but with the orginal name. So i got two files with the same content.
I’m creating new articles (text files) through #route('/update/', method='POST') in my ”edit templet” where title=title, text=text. Let’s say after I have created a new article with the name(title) = ”Key” and wrote a bit in that text file. Then If I want to edit/change my ”Key” article I click on that article and present the article in #route('/wiki/',)def show_article(article):. title = article, text = text)
In this template I can change my ”Key” name(title) to ”Lock”. I’m still using the same form #route('/update/', method='POST') to post my changes.
Here is the problem, it creates a new text file instead of renaming the ”Key” article to ”Lock”.
How can I change the #route('/update/', method='POST') to make it realise that I’m working with an already existing text file and only wants to rename that file.
I have tried to use two different method=’POST’ but only gets method not allowed error all the time.
from bottle import route, run, template, request, static_file
from os import listdir
import sys
host='localhost'
#route('/static/<filname>')
def serce_static(filname):
return static_file(filname, root="static")
#route("/")
def list_articles():
files = listdir("wiki")
articles = []
for i in files:
lista = i.split('.')
word = lista[0]
lista1 = word.split('/')
articles.append(lista1[0])
return template("index", articles=articles)
#route('/wiki/<article>',)
def show_article(article):
wikifile = open('wiki/' + article + '.txt', 'r')
text = wikifile.read()
wikifile.close()
return template('page', title = article, text = text)
#route('/edit/')
def edit_form():
return template('edit')
#route('/update/', method='POST')
def update_article():
title = request.forms.title
text = request.forms.text
tx = open('wiki/' + title + '.txt', 'w')
tx.write(text)
tx.close()
return template('thanks', title=title, text=text)
run(host='localhost', port=8080, debug=True, reloader=True)
You can use os.replace('old_name', 'new_name'):
import os
...
tx = open('wiki/' + title + '.txt', 'w')
tx.write(text)
os.replace(tx.name, 'name_you_want.txt') # use os.replace()
tx.close()
I have a web app that needs to do the following:
Present a form to request a client side file for CSV import.
Validate the data in the CSV file or ask for another filename.
At one point, I was doing the CSV data validation in the view, after the form.is_valid() call from getting the filename (i.e. I have the imported CSV file into memory in a dictionary using csv.DictReader). After running into problems trying to pass errors back to the original form, I'm now trying to validate the CONTENTS of the CSV file in the form's clean() method.
I'm currently stumped on how to access the in memory file from clean() as the request.FILES object isn't valid. Note that I have no problems presenting the form to the client browser and then manipulating the resulting CSV file. The real issue is how to validate the contents of the CSV file - if I assume the data format is correct I can import it to my models. I'll post my forms.py file to show where I currently am after moving the code from the view to the form:
forms.py
import csv
from django import forms
from io import TextIOWrapper
class CSVImportForm(forms.Form):
filename = forms.FileField(label='Select a CSV file to import:',)
def clean(self):
cleaned_data = super(CSVImportForm, self).clean()
f = TextIOWrapper(request.FILES['filename'].file, encoding='ASCII')
result_csvlist = csv.DictReader(f)
# first line (only) contains additional information about the event
# let's validate that against its form definition
event_info = next(result_csvlist)
f_eventinfo = ResultsForm(event_info)
if not f_eventinfo.is_valid():
raise forms.ValidationError("Error validating 1st line of data (after header) in CSV")
return cleaned_data
class ResultsForm(forms.Form):
RESULT_CHOICES = (('Won', 'Won'),
('Lost', 'Lost'),
('Tie', 'Tie'),
('WonByForfeit', 'WonByForfeit'),
('LostByForfeit', 'LostByForfeit'))
Team1 = forms.CharField(min_length=10, max_length=11)
Team2 = forms.CharField(min_length=10, max_length=11)
Result = forms.ChoiceField(choices=RESULT_CHOICES)
Score = forms.CharField()
Event = forms.CharField()
Venue = forms.CharField()
Date = forms.DateField()
Div = forms.CharField()
Website = forms.URLField(required=False)
TD = forms.CharField(required=False)
I'd love input on what's the "best" method to validate the contents of an uploaded CSV file and present that information back to the client browser!
I assume that when you want to access that file is in this line inside the clean method:
f = TextIOWrapper(request.FILES['filename'].file, encoding='ASCII')
You can't use that line because request doesn't exist but you can access your form's fields so you can try this instead:
f = TextIOWrapper(self.cleaned_data.get('filename'), encoding='ASCII')
Since you have done super.clean in the first line in your method, that should work. Then, if you want to add custom error message to you form you can do it like this:
from django.forms.util import ErrorList
errors = form._errors.setdefault("filename", ErrorList())
errors.append(u"CSV file incorrect")
Hope it helps.