how to fix bad decoding? - python-2.7

i am using shown below procedure to write (spidered from the web) a list of persons data.
I think the code itself is ok, but i am confused with results. Some chars are decoded properly, some not. For example:
STANIS?AWurodzony/a 01.01.1888, ?Ԅ?
HALINAurodzony/a 05.07.1927, ŁÓDŹ
the last word in both strings is the same!
Also ? sign used to replace non-translatable chars once is used once not:
STANISŁAWurodzony/a 24.03.1907, RAKSZANY
^
| here is written ok - not replaced
And here is the code:
def findPerson():
file = codecs.open('Names.txt','a','ISO-8859-1','replace')
try:
with codecs.open('./listNames_links.txt','r','ISO-8859-1','replace') as f:
line = f.readline()
while line != '':
#print line,
line = f.readline()
res = requests.get('http://real.address.gov.pl'+line)
res.raise_for_status()
soup = BeautifulSoup(res.text)
linkElems = soup.find('a','css_class_name').text
file.write(linkElems)
file.write('\r\n')#preserve end-of-line
question:
How to fix this. Is my procedure wrong? Or the source page has broken encoding? (I suppose it is ok, i can read it in browser without any errors.)

I'm not very familiar with python, but had similar problems with java programs. And in almost every case it was the problem that not the same encoding was used in all steps and thus while converting it results in ugly charcters.
I'd suggest to use UTF-8 during the whole process if possible.
An off-topic side remark: Because I stumpled upon this so very often, I bought myself this T-Shirt (correctly spelled Scheiß, German for crap) and wear it at work ;-)

This looks like a polish language, and doesn't look like multi-byte encoding such as UTF-8 is used. You are writing to your Names.txt file using ISO-8859-1 which doesn't support all polish characters.
You could use ISO-8859-2 for Polish characgter support, but even better would be to use UTF-8 which supports all languages and is the common standard on the web.
Try
file = codecs.open('Names.txt','a','UTF-8','replace')
When you make a request using Requests, try checking the encoding of each page. For example:
res = requests.get('http://real.address.gov.pl'+line)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)

Related

Django 'ascii' codec can't encode characters despite encoding in UTF-8? What am I doing wrong?

I'm still in the process of learning Django. I have a bit of a problem with encoding a cyrillic strings. I have a text input. I append it's value using JS to the URL and then get that value in my view (I know I should probably use a form for that, but that's not the issue).
So here's my code (it's not complete, but it shows the main idea I think).
JS/HTML
var notes = document.getElementById("notes").value;
...
window.location.href = 'http://my-site/example?notes='+notes
<input type="text" class="notes" name="notes" id="notes">
Django/Python
notes= request.GET.get('notes', 0)
try:
notes = notes.encode('UTF-8')
except:
pass
...
sql = 'INSERT INTO table(notes) VALUES(%s)' % str(notes)
The issue is, whenever I type a string in cyrillic I get this error message: 'ascii' codec can't encode characters at position... Also I know that I probably shouldn't pass strings like that to the query, but it's a personal project so... that would do for now. I've been stuck there for a while now. Any suggestions as to what's causing this would be appreciated.
request.GET.get("key") will already get a string, why you need to encode it?
May set request.encoding="utf-8" work for you.

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

scraping chinese characters python

I learnt how to scrap website from https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html in which the contents is in chinese and write its contents into a .txt file. However, the .txt file contains random symbols which I assume is a encoding/decoding problem.
I've read this thread "how to decode and encode web page with python?" and figured the encoding method for my site is "gb2312" and "windows-1252". I tried decoding in those two encoding methods but failed.
Can someone kindly explain to me the problem with my code? I'm very new to programming so please let me know my misconceptions as well!
Also, when I remove the "html.parser" from the code, the .txt file turns out to be empty instead of having at least symbols. Why is this the case?
import bs4, requests, sys
reload(sys)
sys.setdefaultencoding("utf-8")
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
content = novelSoup.select("br")
novelFile = open("novel.txt", "w")
for i in range(len(content)):
novelFile.write(str(content[i].getText()))
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novel.encoding = "GBK"
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
out:
<br>
一元宗,坐落在青峰山上,绵延极长,现在是盛夏时节,天空之中,太阳慢慢落了下去,夕阳将影子拉的很长。<br/>
<br/>
一片不是很大的小湖泊边上,一个约莫着十七八岁的青衣少年坐在湖边,抓起湖边的一块石头扔出,顿时在湖边打出几朵浪花。<br/>
<br/>
叶希文有些茫然,他没想到,他居然穿越了,原本叶希文只是二十一世纪的地球上一个普通的大学生罢了,一个月了,他才后知后觉的反应过来,这不是有人和他进行恶作剧,而是,他真的穿越了。<br/>
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text. You can find out
what encoding Requests is using, and change it, using the r.encoding
property:
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
If you change the encoding, Requests will use the new value of
r.encoding whenever you call r.text.

how to read only URL from txt file in MATLAB

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?
This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)
As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

Django testing view input with different character sets

I'm having trouble trying to generate a test for my views. I've a view, that consumes in a given input from form, some characters. That characters are commited to DB, without problems.
All I was trying was to generate a test to ensure that different characters, from different languages, were accepted.
I tested this one:
Český jazyk neboli čeština
This input is correctly got from HTML form, and stored in DB. When I try to set this one from test, something weird happens, and view throws an error, saying that
Warning: Incorrect string value: '\xC4\x8Cesk\xC3...' for column 'title' at row 1
My code is as simple as follows:
str1 = "Český jazyk neboli čeština"
self.client.post(url, {"title": str1})
And tryied all combinations:
str1 = u"..."
str1 = str1.encode('utf-8')
str1 = str1.decode('utf-8')
Without any success.
Can anyone tell me what I'm missing?
Thank you in advance
First of all: Make sure you have included this at the beginning of your script:
#-*- coding: utf-8 -*-
That is to tell the interpreter that this file's encoding is utf-8 (make sure it is from your text editor)
Second: instead of
str1 = "Český jazyk neboli čeština"
declare the str1 as unicode like this:
str1 = u"Český jazyk neboli čeština"
Now, I'll suggest you that if you want to include non ascii chars, declare them with their proper unicode code instead of the character to avoid weird encoding issues.
str1 = u'\u010cesk\xfd jazyk neboli \u010de\u0161tina'
This is a useful page to check characters unicode code
Hope this helps!