I am trying to parse a russian website using lxml. However ,I got an issue with displaying russian characters, that i am unable to overcome myself.
Let's take this html piece for example:
Квест в реальности «Карты, деньги, два стола»
I am using this piece to parse it:
title = root.xpath('//*[#id="event-id-41600"]/div[3]/div[2]/a/text()')[0].encode('utf-8').strip()
and this is what i get:
├É┬Ü├É┬▓├É┬Á├Ĺ┬ü├Ĺ┬é ├É┬▓ ├Ĺ┬Ç├É┬Á├É┬░├É┬╗├Ĺ┬î├É┬Ż├É┬ż├Ĺ┬ü├Ĺ┬é├É┬Ş ├é┬ź├É┬Ü├É┬░├Ĺ┬Ç├Ĺ┬é├Ĺ┬ő, ├É┬┤├É┬Á├É┬Ż├Ĺ┬î├É┬│├É┬Ş, ├É┬┤├É┬▓├É┬░ ├Ĺ┬ü├Ĺ┬é├É┬ż├É┬╗├É┬░├é┬╗
In database however instead of cyrillic i see this:
ÐвеÑÑ Ð² ÑеалÑноÑÑи «ÐаÑÑÑ, денÑги, два ÑÑола»
Oh and btw for reference:
this piece:
title = item.xpath('div[3]/div[2]/a')[0]
print etree.tostring(title)
returns me this :
ÐвеÑÑ Ð² ÑеалÑноÑÑи «ÐаÑÑÑ, денÑги, два ÑÑола»
Not sure if it is database related of something to do with python encoding. Any help appreciated :)
Thanks in advance.
EDIT: i am using MySQL and Django ORM
Django settings:
DATABASE_OPTIONS = {
"charset": "utf8_general_ci",
"init_command": "SET storage_engine=INNODB"
}
Webpage :
<!DOCTYPE html>
<html lang="en" prefix="og: http://ogp.me/ns#" class="">
<head>
<title>Интересные события в Москве в январе - феврале 2016</title>
<meta charset="utf-8">
Cyrillic code page does not exist/not setted up on your server. So you can`t view russian characters in terminal even in UTF-8. But python stil work with unicode properly.
By this command:
title = root.xpath('//*[#id="event-id-41600"]/div[3]/div[2]/a/text()')[0].encode('utf-8').strip()
you get unicode string and encode it to bytes (str in python2). And save bytes in database.
When you load string from database python uses default code page (probably Latin-1) and you get this:
ÐвеÑÑ Ð² ÑеалÑноÑÑи «ÐаÑÑÑ, денÑги, два ÑÑола»
So, you should store unicode string in database (don't use encode)
title = root.xpath('//*[#id="event-id-41600"]/div[3]/div[2]/a/text()')[0].strip()
P.S. I don't understand how encode('Latin-1') helps (from comments), but problem is solved :)
Related
I'm trying to familiarize myself with Python with a major concern for web publishing, so I looked around and found the following example. Running 2.7 in PyScripter on Windows 7 didn't bring up the browser as I expected. The code came up in Notepad++ instead, apparently since the html suffix was associated with Notepad. I tried about a dozen different permutations of the code, but the html file still opened in Notepad until I associated that file with Firefox. When I include the print webbrowser._browsers command, I get {'windows-default': [, None], 'c:\program files (x86)\internet explorer\iexplore.exe': [None, ]}
This would imply to me that IE should be the default browser being called, but obviously it is not. Can anyone enlighten me here as I am a Python newbie?
'''A simple program to create an html file froma given string,
and call the default web browser to display the file.'''
contents = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="content-type">
<title>Hello</title>
</head>
<body>
Hello, World!
</body>
</html>
'''
import webbrowser
def main():
browseLocal(contents)
def strToFile(text, filename):
"""Write a file with the given name and the given text."""
output = open(filename,"w")
output.write(text)
output.close()
def browseLocal(webpageText, filename='C:\\Python27\\Programs\\tempBrowseLocal.html'):
'''Start your webbrowser on a local file containing the text
with given filename.'''
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)
main()
def browseLocal(webpageText):#take filename out of here
'''Start your webbrowser on a local file containing the text
with given filename.'''
#define filename here.
filename='C:\\Python27\\Programs\\tempBrowseLocal.html'
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)
I am creating a ColdFusion page with some Japanese characters. I included the following in the top of the page.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If I explicitly include Japanese characters in the output, they look fine. However, if I output them using, say:
<cfoutput>#variables.TitleInJapanese#</cfoutput>
The output is garbled as though the encoding is not recognized. I have tried <cfcontent> and <cfprocessingdirective> tags to no avail.
If I open the .cfm source file, the Japanese characters that are assigned to the variables look as they should in my text editor. It's the content that is generated using <cfoutput> that is giving me trouble. Any suggestions would be welcome. Thanks!
Correction: The page I have created will not display any Japanese characters, explicit or referenced. However, other files using <cfinclude> within the page that have Japanese characters render just fine.
I'm coding a rails app, and I have a problem with the title's page :
In my config/locales/fr.yml I have this : fr:product:edit: "Modification de l'objet"
And in my /app/views/products/edit.html.erb I have this : <title><%= t('product.edit') %></title>
And when I render the page, it gives me this : Modification de l'objet.
Do you know what's wrong with it ?
I tried to add <meta http-equiv="content-type" content="text/html; charset=UTF-8" /> in the head of my HTML, or this but it didn't worked for me...
You can use <%= raw(I18n.t('product.edit')) %> to avoid this. Be aware though, that the code won't be escaped. When using raw you have to be sure there's no way to inject malicious code in the string.
I think I can tell you where l&#'39 is coming from...
Hopefuly then you can find a solution on how to fix it.
Open up notepad and holddown the Alt key and press 39 see what character appears ??
You notice you get the ' character when you type that number so after compling you code seems to look at l'objet as l And #39
So I think there is as you are poinitg out some sort language issues and the characters are represented. You might be able to reverse this to solve your problem.
Sorry this is all I had.
Ascii decoding error
Text = "Hanuman (Sanskrit: हनुमान्, Hanumān), a Hindu deity who was an ardent devotee of Rama according to Hindus legends, and a central character in the Indian epic Ramayana."
I saved the text into MYSQL table to novarchar column, it inserts successfully.
when i retrieve this data in console, it is displaying correctly. But when i tried to retrieve it via django and display it in template ,it is showing as some ascii characters.
Displaying as "Hanuman (Sanskrit: हनà¥à¤®à¤¾à¤¨à¥, HanumÄn), is a Hindu deity who is an ardent devotee of Rama, a central character in the Indian epic Ramayana."
I guess you miss the content type meta tag in your template:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
My django code works in chrome and firefox but in IE the webpage displays unreadable charactars. The following is my code setting:
DEFAULT_CHARSET = 'utf8'
FILE_CHARSET = 'utf8'
and the template files are saved as utf8 format, but my template file has some other language besides english. That non-english part is not readable.
Should I change some setting of django ? Most of the visitors of my website may use IE, so this is a big problem. Any suggestions?
did you add this meta to your base html?
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>