Scrapy:: issue with encoding when dumping to the json file - python-2.7

Here is the web-site, I would like to parse: [web-site in russian][1]
Here is the code that extracts the info that I need:
# -*- coding: utf-8 -*-
from scrapy.spider import Spider
from scrapy.selector import Selector
from flats.items import FlatsItem
class DmozSpider(Spider):
name = "dmoz"
start_urls = ['http://rieltor.ua/flats-sale/?ncrnd=6510']
def parse(self, response):
sel=Selector(response)
flats=sel.xpath('//*[#id="content"]')
flats_stored_info=[]
flat_item=FlatsItem()
for flat in flats:
flat_item['square']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][1]/text()').extract()]
flat_item['rooms_floor_floors']=[s.encode("utf-8") for s in sel.xpath('//div/strong[#class="param"][2]/text()').extract()]
flat_item['address']=[s.encode("utf-8") for s in flat.xpath('//*[#id="content"]//h2/a/text()').extract()]
flat_item['price']=[s.encode("utf-8") for s in flat.xpath('//div[#class="cost"]/strong/text()').extract()]
flat_item['subway']=[s.encode("utf-8") for s in flat.xpath('//span[#class="flag flag-location"]/a/text()').extract()]
flats_stored_info.append(flat_item)
return flats_stored_info
How I dump to json file
scrapy crawl dmoz -o items.json -t json
The problem is when I replace the code above to print in console the extracted info i.e. like this:
flat_item['square']=sel.xpath('//div/strong[#class="param"][1]/text()').extract()
for bla in flat_item['square']:
print bla
the script properly displays the information in russian.
But, when I use to dump the scraped information using the first version of the script (with encoding to utf-8), it writes to the json file something like this:
[{"square": ["2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c", "1-\u043a\u043e\u043c\u043d.,
How can I dump information into json file in russian? Thank you for your advises.
[1]: http://rieltor.ua/flats-sale/?ncrnd=6510

It is correctly encoded, it's just that the json library escapes non-ascii characters by default.
You can load the data and use it (copying data from your example):
>>> import json
>>> print json.loads('"2-\u043a\u043e\u043c\u043d., 16 \u044d\u0442\u0430\u0436 16-\u044d\u0442. \u0434\u043e\u043c"')
2-комн., 16 этаж 16-эт. дом

Related

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?
I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename
There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

Format string to XML file

I want to reformat a string to the XML structure, but my string is not on an XML format (using Python 2.7).
I believe the correct way is to first create an XML format of the input in one line and then use XML Pretty Print for making it an XML file with multi rows and indentation (
Pretty printing XML in Python).
Below there is an example of an input after a History Server REST API's call to Hadoop server 1.
Input:
'{"jobAttempts":{"jobAttempt":[{"nodeHttpAddress":"slave2:8042","nodeId":"slave2:39637","id":1,"startTime":1544691730439,"containerId":"container_1544631848492_0013_01_000001","logsLink":"http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2"}]}}'
Output:
'<jobAttempts><jobAttempt><nodeHttpAddress>slave2:8042</nodeHttpAddress><nodeId>slave2:39637</nodeId><id>1</id><startTime>1544691730439</startTime><containerId>container_1544631848492_0013_01_000001</containerId><logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink></jobAttempt></jobAttempts>'
Final Output
<jobAttempts>
<jobAttempt>
<nodeHttpAddress>slave2:8042</nodeHttpAddress>
<nodeId>slave2:39637</nodeId>
<id>1</id>
<startTime>1544691730439</startTime>
<containerId>container_1544631848492_0013_01_000001</containerId>
<logsLink>http://23.22.43.90:19888/jobhistory/logs/slave2:39637/container_1544631848492_0013_01_000001/job_1544631848492_0013/hadoop2</logsLink>
</jobAttempts>
</jobAttempt>
*This string is actually an XML file which does not appear to have any style information associated with it.
I have found out that the source view of the History Server REST API's is indeed an XML file in one line. Thus, I had to read the source view and not the old problematic view with python.
Before I used
import urllib2
contents = urllib2.urlopen("http://http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013//jobattempts").read()
Now, I am downloading the source view of the html page with selenium and BeautifulSoup and I save it locally.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import xml.dom.minidom
driver = webdriver.Firefox()
driver.get("http://23.22.43.90:19888/ws/v1/history/mapreduce/jobs/job_1544631848492_0013/jobattempts")
page_source = driver.page_source
driver.close()
soup = BeautifulSoup(page_source, "html.parser")
print(soup)
xml = xml.dom.minidom.parseString(str(soup))
pretty_xml_as_string = xml.toprettyxml()
file = open("./content_new_2.xml", 'w')
file.write(pretty_xml_as_string)
file.close()

Why OleFileIO_PL only works with .doc file types and not .docx Python?

right so I'm working on a Python script (Python 2.7) that will extract the metadata from OLE files. I am using OleFileIO_PL and it work perfectly file with OLE files 97 - 2003, but any later then that it just says that it is not an OLE2 file type.
Any way I can modify my code to support both .doc and .docx ? Same with .ppt and .pptx etc.
Thank you in advance
Source Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import OleFileIO_PL
import StringIO
import optparse
import sys
import os
def printMetadata(fileName):
data = open(fileName, 'rb').read()
f = StringIO.StringIO(data)
OLEFile = OleFileIO_PL.OleFileIO(f)
meta = OLEFile.get_metadata()
print('Author:', meta.author)
print('Title:', meta.title)
print('Creation date:', meta.create_time)
meta.dump()
OLEFile.close()
def main():
parser = optparse.OptionParser('usage = -F + Name of the OLE file with the extention For example: python Ms Office Metadata Extraction Script.py -F myfile.docx ')
parser.add_option('-F', dest='fileName', type='string',\
help='specify OLE (MS Office) file name')
(options, args) = parser.parse_args()
fileName = options.fileName
if fileName == None:
print parser.usage
exit(0)
else:
printMetadata(fileName)
if __name__ == '__main__':
main()
To answer your question, this is because the newer MS Office 2007+ files (docx, xlsx, xlsb, pptx, etc) have a completely different structure from the legacy MS Office 97-2003 formats.
It is mainly a collection of XML files within a Zip archive. So with a little bit of work, you can extract everything you need using zipfile and ElementTree from the standard library.
If openxmllib does not work for you, you may try other solutions:
officedissector: https://www.officedissector.com/
python-opc: https://pypi.python.org/pypi/python-opc
openpack: https://pypi.python.org/pypi/openpack
paradocx: https://pypi.python.org/pypi/paradocx
BTW, OleFileIO_PL has been renamed to olefile, and the new project page is https://github.com/decalage2/olefile

Selecting nodes with non-ASCII characters in Scrapy

I have the following simple web scraper written in Scrapy:
#!/usr/bin/env python
# -*- coding: latin-1 -*-
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpiderTest(BaseSpider):
name = 'MySpiderTest'
allowed_domains = ["boliga.dk"]
start_urls = ["http://www.boliga.dk/bbrinfo/3B71489C-AEA0-44CA-A0B2-7BD909B35618",]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = bbrItem()
print hxs.select("id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
but when I run the spider I get the following syntax error:
SyntaxError: Non-ASCII character '\xe6' in file... on line 32, but no encoding declared
because of the æ in the xpath. The xpath is working in Xpath Checker for Firefox. I tried URL-encoding the æ, but that didn't work. What am I missing?
thanks!
UPDATE: I have added the encoding declaration in the beginning of the code (Latin-1 should support Danish characters)
Use a unicode string for your XPath expression
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badeværelser')]]/td[2]/text()").extract()
or
hxs.select(u"id('unitControl')/div[2]/table/tbody/tr[td//text()[contains(.,'Antal Badev\u00e6relser')]]/td[2]/text()").extract()
See Unicode Literals in Python Source Code
SyntaxError: Non-ASCII character ‘\xe2′ in file … on line 40,
but no decoding declared …
This is caused by the replacing standard characters like apostrophe (‘) by non-standard characters like quotation mark (`) during copying.
Try to edit the text copied from pdf.
repsonse.xpath("//tr[contains(., '" + u'中文字符' + "')]").extract()

How do you convert the multi-line content scraped into a list?

I was trying to convert the content scraped into a list for data manipulation, but got the following error: TypeError: 'NoneType' object is not callable
#! /usr/bin/python
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import os
import re
# Copy all of the content from the provided web page
webpage = urlopen("http://www.optionstrategist.com/calculators/free-volatility- data").read()
# Grab everything that lies between the title tags using a REGEX
preBegin = webpage.find('<pre>') # Locate the pre provided
preEnd = webpage.find('</pre>') # Locate the /pre provided
# Copy the content between the pre tags
voltable = webpage[preBegin:preEnd]
# Pass the content to the Beautiful Soup Module
raw_data = BeautifulSoup(voltable).splitline()
The code is very simple. This is the code for BeautifulSoup4:
# Find all <pre> tag in the HTML page
preTags = webpage.find_all('pre')
for tag in preTags:
# Get the text inside the tag
print(tag.get_text())
Reference:
find_all()
Kinds of filters to put into name field of find()/findall()
get_text()
To get the text from the first pre element:
#!/usr/bin/env python
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = "http://www.optionstrategist.com/calculators/free-volatility-data"
soup = BeautifulSoup(urlopen(url))
print soup.pre.string
To extract lines with data:
from itertools import dropwhile
lines = soup.pre.string.splitlines()
# drop lines before the data table header
lines = dropwhile(lambda line: not line.startswith("Symbol"), lines)
# extract lines with data
lines = (line for line in lines if '%ile' in line)
Now each line contains data in a fixed-column format. You could use slicing and/or regex to parse/validate individual fields in each row.