BeautifulSoup miss some alphabets in decoding from utf-8 to unicode - python-2.7

I'm trying to parse cyrillyc text from the site page and i have that error if i try to print soup.text of the scring which includes closing quotation marks in the word "word"
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
The original string page (utf-8)
urllib2.urlopen raw page = bbb = '\xab\x80\xd1\x8c\xc2\xbb'
\xbb and \xab- it's closing quotation mark
I try to convert to unicode by hand (BeautifulSoup does this too)
unicode(bbb, 'utf8', errors='ignore')
But inspite of error key "ignore" unknown elements they still exists int
i get
\xab\u0446\u0435\u0437\u0430\u0440\u044c**\xbb**'
I try to delete all unknown element starting with ^\x with help regular exp, but it's doesn't work
bbb = re.sub(r'[\x00-\x7f]', r' ', bbb)

But inspite of error key "ignore" unknown elements they still exists
u'\xbb' is not an unknown element, there is no problem there. It represents the character U+00BB Right-Pointing Double Angle Quotation Mark. The Unicode string literals u'\xbb' and u'\u00bb' represent the same string.
\x has a different meaning depending on what kind of string literal it is used in. In a byte string, it introduces a hex-encoded byte from 0x00 to 0xFF. In a Unicode string, it introduces a hex-encoded character from U+0000 to U+00FF. When producing the repr() representation of a string, Python prefers to output characters in the range up to U+00FF using \x escapes rather than the arguably-clearer \u escapes, because they're shorter.
The \u and \x are merely alternative ways to refer to a character in the string literal representation; they are not literally part of the value of the string. There is no actual backslash in the value, so you can't use re to try to remove characters that might appear in the repr() form as backslash escapes.
The actual error:
error 'charmap' codec can't encode character u'\xab' in position 6: charater maps to undefined
Is just PrintFails again as usual. Apparently your console is using an encoding that doesn't include the character U+00AB.
If you are using the Windows Command Prompt, you could try to use win-unicode-console as a workaround for the brokenness of that particular console.

Related

Regex to exclude non-ASCII but keep Nordic characters

I have a macro in which I use Regex to strip a text of all non-ASCII characters (in order to create folder names).
I am relatively new to Regex and I was wondering how to strip all non-ASCII but still include Nordic characters, as the macro goes through Scandinavian data. Basically, I would need to include characters 128 to 165 from this table
Here is my code so far:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "[^\u0000-\u007F]"
GetStrippedText = regEx.Replace(txt, "")
End Function
I understand that I need to include this range in there somehow "[^\u0000-\u007F]", I just don't know where to find the associated code or how to include it.
To the best of my knowledge I think there are a few points here to highlight:
Not all extended (or non-) ASCII tables follow the same character encoding. The table you linked seems to follow CP437, and Excel follows UTF-8 (Unicode), which you can test using the UNICODE function in Excel. Here is a link to see the difference it makes in Hex-codes. So you most likely need to pick a range of interest within the "Latin-1 Supplement" which can be found here. For this exercise I went with characters from À-ÿ which is range: u00C0-\u00FF
Next, your current character class covers normal ASCII characters, however I believe you might just be interested in 0020-007F as you probably don't want to include 0000-001F.
Thirdly, you did not set the Global parameter to True which means your current UDF will only replace the first character it finds outside your character class. So you'll need to set this parameter to replace all characters outside defined character class.
So to conclude, the below might work for you:
Public Function GetStrippedText(txt As String) As String
Dim regEx As Object
Set regEx = CreateObject("vbscript.regexp")
regEx.Global = True
regEx.Pattern = "[^\u0020-\u007F\u00C0-\u00FF]"
GetStrippedText = regEx.Replace(txt, "")
End Function
For your understanding; [^\u0020-\u007F\u00C0-\u00FF] means:
[....] - The brackets tell us this is a character class
^ - The caret means it's a negated character class
\u0020-\u007F - means the characters run from index 32 till index 127 and \u00C0-\u00FF runs from 192 till 255.
In this same fashion you can extend the amount of character ranges.
Note1: Instead of Unicode, you could also just use the Hex codes: "[^\x20-\x7F\xC0-\xFF]"
Note2: You could also create a character class without Unicode or Hex ranges. Simply concatenate the characters of interest instead.

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:
from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'
I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.
import regex as re
print u'\U0001F469' # 👩
print u'\U0001F60C' # 😌
print u'\U0001F469\U0001F60C' # 👩😌
text = u'some\U0001F469\U0001F60Cthing'
print text # some👩😌thing
# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'[\U0001f469]+', u'', text) # some�thing
In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').
The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.
To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.
subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
The old 2.7 regex engine gets confused because:
Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.
Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).
That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.
This fixes it:
print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text) # something
# Removing only "👩" doesn't work
print re.sub(ur'(\U0001f469)+', u'', text) # some�thing
# .. and now it does:
some😌thing
because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.
If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:
exclude_list = UNICODE_EMOJI.keys()
for bad in exclude_list: # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
if bad in text:
print 'Removing '+bad
text = text.replace(bad, '')
Removing 👩
Removing 😌
something
(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)
To remove all emojis from the input string using the current approach, use
import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'
If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.
Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2.

python .replace not working properly

My code takes a list of strings from a static website.
It then traverses through each character in the list and uses the .replace method to replace any non utf-8 character:
foo.replace('\\u2019', "'")
It doesn't replace the character in the list correctly and ends up looking like the following:
before
u'What\u2019s with the adverts?'
after
u'What\u2019s with the adverts?'
Why is it
Python 2.7 interprets string literals as ASCII, not unicode, and so even though you've tried to include unicode characters in your argument to foo.replace, replace is just seeing ASCII {'\', 'u', '2', '0', '1', '9'}. This is because Python doesn't assign a special meaning to "\u" unless it is parsing a unicode literal.
To tell Python 2.7 that this is a unicode string, you have to prefix the string with a u, as in foo.replace(u'\u2017', "'").
Additionally, in order to indicate the start of a unicode code, you need \u, not \\u - the latter indicates that you want an actual '\' in the string followed by a 'u'.
Finally, note that foo will not change as a result of calling replace. Instead, replace will return a value which you must assign to a new variable, like this:
bar = foo.replace(u'\u2017', "'")
print bar
(see stackoverflow.com/q/26943256/4909087)
yeah. If your string is foo = r'What\u2019s with the adverts?' will ok with foo.replace('\\u2019', "'"). It is a raw string and begins with r''. And with u'' is Unicode.
Hope to help you.

ColdFusion : URLDecode("%ara%") is giving me as %ar%

In ColdFusion, URLDecode() decodes the URL-encoded string (URL encoding formats some characters with a percent sign and the two-character hexadecimal representation of the character).
Example: %3A is the hex equivalent for ":(colon)" When the URLDecode() is applied on different strings as below,
URLDecode("%3A%") = :% -- Valid
URLDecode("%EE") = � -- this is because EE has no equivalent character.
But I am trying to decode the string "%ara%" which is invalid, the result I am getting is "%ar%". I don't find the 2nd occurrence of character "a".Can anyone explain me why this is happening??

Strategy to replace spaces in string

I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?
Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)
I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.
You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!
With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.
Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");
I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).
I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');