Scraping a percent encoded URL with unicode in Scrapy - python-2.7

Consider I want to scrape a site which contains the following HTML:
<a id="mylink" href="http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premi%C3%A8r-cru-brocard-75cl">
This href is the percent encoding of the utf8 byte string representation of u'https://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl'
I get the href with Scrapy like this:
u = response.xpath('//a[id="mylink"]/#href').extract_first()
Scrapy sets the variable u as
u'http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premi%C3%A8r-cru-brocard-75cl'
Notice that it has incorrectly interpreted the page's byte string (that represented a unicode string) as a unicode string itself and as such it is the wrong unicode object with different unicode chars:
In [67]: print urllib.unquote(x)
http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl
What is actually desired is that Scrapy interprets the href as a byte string:
bs = 'http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premi%C3%A8r-cru-brocard-75cl'
so that this represents the correct unicode object, i.e.
In [70]: print urllib.unquote(bs).decode('utf8')
http://www.sainsburys.co.uk/shop/gb/groceries/chablis/chablis-premièr-cru-brocard-75cl
The only way I've managed to get around this is with a small cleaning function that corrects the "mistake" as follows:
def _deal_with_encoding(url):
# should give no encoding errors since url is ascii
pbs = url.encode('ascii')
# Get a regular (not percent enc) utf8 enc byte str
bs = urllib.unquote(pbs)
# Finally we can decode the utf8 to get correct unicode string
return bs.decode('utf8')
It works but doesn't seem ideal. Is this really the only way?

Related

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

Issues seen during utf-7 conversion

Python : 2.7
I need to convert to utf-7 before I go ahead, so I have used below code in python 2.7 interpreter:
>>> mbox = u'한국의'
>>> mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
'&1VytbcdY-'
Same code when I use in my python script as shown below, the output for mbox is b'&Ti1W,XaE' instead of b'&Ti1W,XaE-' i.e. "-" at end of string is missing when running as a script instead of interpreter.
mbox = "b'" + mbox + "'"
print mbox
mbox = mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",")
print mbox
Please suggest.
Quoting from Wikipedia's description of UTF-7:
Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a + sign. The end is indicated by any character not in the modified Base64 set. If the character after the modified Base64 is a - (ASCII hyphen-minus) then it is consumed by the decoder and decoding resumes with the next character. Otherwise decoding resumes with the character after the base64.
Any block of encoded characters must end with a non-Base64 character. If the string includes such a character, it will be used, otherwise - is added to the end of the block. Your first example includes a - for this reason. Your second example doesn't need one because ' is not part of the Base64 character set.
If your intent is to create a Python literal that creates a valid UTF-7 string, just do things in a different order.
mbox = b"b'" + mbox.encode('utf-7').replace(b"+", b"&").replace(b"/", b",") + b"'"

Replace utf8 characters

I want to replace some utf-8 characters set with another utf-8 character set but anything I try I end up with errors.
I am a noob at Python so please be patient
What I want to achieve is converting characters by unicode values or by html entities (more readable, for maintanance)
Tries (with example):
1.First
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Found this function
def multiple_replace(dic, text):
pattern = "|".join(map(re.escape, dic.keys()))
return re.sub(pattern, lambda m: dic[m.group()], text)
text="Larry Wall is ùm© some text"
replace_table = {
u'\x97' : u'\x82' # ù -> é
}
text2=multiple_replace(dic,text)
print text #Expected:Larry Wall is ém© some text
#Got: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
2.Html entities
dic = {
"ú" : "é" # ù -> é
}
some_text="Larry Wall is ùm© some text"
some_text2=some_text.encode('ascii', 'xmlcharrefreplace')
some_text2=multiple_replace(dic,some_text2)
print some_text2
#Got:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
Any ideas are welcome
Your problem is due to the fact that your input strings are in non-unicode representation (<type 'str'> rather than <type 'unicode'>). You must define the input string using the u"..." syntax:
text=u"Larry Wall is ùm© some text"
# ^
(Besides you will have to fix the last statement in your first example - currently it prints the input string (text), whereas I am pretty sure that you meant to see the result (text2)).

BeautifulSoup miss some alphabets in decoding from utf-8 to unicode

I'm trying to parse cyrillyc text from the site page and i have that error if i try to print soup.text of the scring which includes closing quotation marks in the word "word"
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
The original string page (utf-8)
urllib2.urlopen raw page = bbb = '\xab\x80\xd1\x8c\xc2\xbb'
\xbb and \xab- it's closing quotation mark
I try to convert to unicode by hand (BeautifulSoup does this too)
unicode(bbb, 'utf8', errors='ignore')
But inspite of error key "ignore" unknown elements they still exists int
i get
\xab\u0446\u0435\u0437\u0430\u0440\u044c**\xbb**'
I try to delete all unknown element starting with ^\x with help regular exp, but it's doesn't work
bbb = re.sub(r'[\x00-\x7f]', r' ', bbb)
But inspite of error key "ignore" unknown elements they still exists
u'\xbb' is not an unknown element, there is no problem there. It represents the character U+00BB Right-Pointing Double Angle Quotation Mark. The Unicode string literals u'\xbb' and u'\u00bb' represent the same string.
\x has a different meaning depending on what kind of string literal it is used in. In a byte string, it introduces a hex-encoded byte from 0x00 to 0xFF. In a Unicode string, it introduces a hex-encoded character from U+0000 to U+00FF. When producing the repr() representation of a string, Python prefers to output characters in the range up to U+00FF using \x escapes rather than the arguably-clearer \u escapes, because they're shorter.
The \u and \x are merely alternative ways to refer to a character in the string literal representation; they are not literally part of the value of the string. There is no actual backslash in the value, so you can't use re to try to remove characters that might appear in the repr() form as backslash escapes.
The actual error:
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
Is just PrintFails again as usual. Apparently your console is using an encoding that doesn't include the character U+00AB.
If you are using the Windows Command Prompt, you could try to use win-unicode-console as a workaround for the brokenness of that particular console.

Python hex variable assignment

I'm using a variable to store data that gets sent by a socket. When I assign it in my program it works but when I read it from a file it is treated as a string.
Example:
data = '\x31\x32\x33'
print data
Outputs
123 # <--- this is the result I want when I read from a file to assign data
f = open('datafile') <--- datafile contains \x31\x32\x33 on one line
data = f.readline()
print data
Outputs
\x31\x32\x33 # <--- wanted it to print 123, not \x31\x32\x33.
In Python the string '\x31\x32\x33' is actually only three characters '\x31' is the character with ordinal 0x31 (49), so '\x31' is equivalent to '1'. It sounds like your file actually contains the 12 characters \x31\x32\x33, which is equivalent to the Python string '\\x31\\x32\\x33', where the escaped backslashes represent a single backslash character (this can also be represented with the raw string literal r'\x31\x32\x33').
If you really are sure that this data should be '123', then you need to look at how that file is being written. If that is something you can control then you should address it there so that you don't end up with data consisting of several bytes representing hex escapes.
It is also possible that whatever is writing this data is already using some data-interchange format (similar to JSON), in which case you don't need to change how it is written, you just need to use a decoder for that data-interchange format (like json.loads(), but this isn't JSON).
If somehow neither of the above are really what you want, and you just want to figure out how to convert a string like r'\x31\x32\x33' to '123' in Python, here is how you can do that:
>>> r'\x31\x32\x33'.decode('string_escape')
'123'
Or in Python 3.x:
>>> br'\x31\x32\x33'.decode('unicode_escape')
'123'
edit: Based on comments it looks like you are actually getting hex strings like '313233', to convert a string like that to '123' you can decode using hex:
>>> '313233'.decode('hex')
'123'
Or on Python 3.x:
>>> bytes.fromhex('313233').decode('utf-8')
'123'
I might have violated many programming standards here, but the following code works for the given situation
with open('datafile') as f:
data=f.read()
data=data.lstrip('\\x') #strips the leftmost '\x' so that now string 'data' contains numbers saperated by '\x'
data=data.strip().split('\\x') #now data contains list of numbers
s=''
for d in data:
s+=chr(int(d,16)) #this converts hex ascii values to respective characters and concatenate to 's'
print s
As stated you are doing post processing, it would be easier to handle if the text was "313233"
you would then be able to use
data = "313233"
print data.decode("hex") # this will print '123'
As stated in comments this is for python 2.7, and is deprecated in 3.3. However unless this question is mis-tagged, this will work.
Yes, when you do a conversion from a string to an int, you can specify the base of the numbers in the string:
>>> print int("31", 16)
49
>>> chr(49)
'1'
So you should be able to just parse the hex values out of your file and individually convert them to chars.