c++: Decoding with mbtowsc() not working with special characters - c++

After learning a lot of things from that post (to sum up: I need to decode special characters as 'ñ' in eclipse on a fedora), now I see that I was doing it wrong or at least that is what I think because I was decoding the text as a normal string and then trying to pass it from char to wchar_t and now I'm using mbtowsc() to decode it from the beggining. So now....I have another couple of doubts/problems.
What I have:
A server sends me the text "$ Euroñ $"(which I don't know how is it encoded, just that the string is encoded as ANSI 1 byte). I load that text in descr[0], so when I debug it, some values are:
descr[0][2]= 69 'E'
descr[0][3]= 117 'u'
descr[0][4]= 114 'r'
descr[0][5]= 111 'o'
descr[0][6]= -15 'ñ'
With that, I suppose you can know if it is utf-8 or not, which I suppose it is.
My goal is to be able to do something like:
for (int i = 0; i < total; i++)
{
if(descr[0][i]=='a') //do things
else if (descr[0][i]=='ñ') // do other stuff
}
This is what I am not able to get because ofc descr[0][6] never equals 'ñ'...
I've tried with:
if(descr[0][i]=='\x00D1')
if(descr[0][i]=='\x00F1')
if(descr[0][i]=='ñ') //this compiles but then ignores that if like it is an error
if(descr[0][i]==L'ñ')
But it never enters on any of that ifs, so.. to what should I compare descr[0][i] to know if it is an 'ñ' ?
Thanks.

Related

Python Decoding and Encoding, List Element utf-8

just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list.
I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.

Autohotkey if statement not working, no error msg

I'm trying to use autohotkey to gather a chuck of data from a website and then click a certain spot on the website depending on what the text is. I'm able to get it to actually pick up the value but when it comes to the if statement it won't seem to process and yields no error message. Here is a quick sample of my code, there is about 20 if statement values so for brevity sake I've only included a few of the values.
GuessesLeft = 20
Errorcount = 0
;triple click and copy text making a variable out of the clipboard
;while (GuessesLeft!=0) part of future while loop
;{ part of future while loop
click 927,349
click 927,349
click 927,349
Send ^c
GetValue = %Clipboard%
if ( GetValue = "Frontal boss")
{
click 955,485
Guessesleft -= 1
}
else if ( GetValue = "Supraorbital Ridge")
{
click 955,571
Guessesleft -= 1
}
;....ETC
else
{
Errorcount += 1
}
;} part of future while loop
Any tips on what I might be doing wrong. Ideally I'd use a case statement but AHK doesn't seem to have them.
Wait a second -- you are triple clicking to highlight a full paragraph and copying that to the clipboard and checking to see if the entirety of the copied portion is the words in the if statement, right? And your words in the copied portion have quotes around them? Probably you will have to trim off any trailing spaces and/or returns:
GetValue = % Trim(Clipboard)
If that doesn't work, you may even have to shorten the length of the copied text by an arbitrary character or two:
GetValue = % SubStr(Clipboard, 1, (StrLen(Clipboard)-2))
Now, if I am wrong, and what you are really looking for is the words from the if statement wherever they may be in a longer paragraph -- and they are not surrounded by quotes, then you will want something like:
IfInString, Clipboard, Frontal boss
Or, if the quotes ARE there,
IfInString, Clipboard, "Frontal boss"
Hth,

Unicode string extraction and comparison

1.val Matcher = """.+/(.*)""".r
2.val Matcher(title) = """http://en.wikipedia.org/wiki/Château_La_Louvière"""
3.val lowerCase = title.toLower
4.if(lowercase.equals("château_la_louvière")) //do something
The above comparison returns false because I guess line 2 results in Ch?teau_La_Louvi?re. Any ideas how I can accomplish this?
As 4e6 says the problem lies within the standard configuration of Java. Which assumes all files encoded in Latin1.
1.val Matcher = """.+/(.*)""".r
2.val Matcher(title) = """http://en.wikipedia.org/wiki/Château_La_Louvière"""
This could be fixed by setting the following java-OPTS
export JAVA_OPTS='-Dfile.encoding=UTF-8'
Still 1. and 2. will work, even if you don't change the encoding. The Problem lies in 3. and 4. .
3.val lowerCase = title.toLower
4.if(lowercase.equals("château_la_louvière")) //do something
''toLower'' will cause the test in 4. to fail , because "â" and "è" will be interpreted wrongly. These characters would be encoded as two up to four bytes, which each will be lowercased independently thus yielding a completely different result as ''château_la_louvière'' .

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Bison does not appear to recognize C string literals appropriately

My problem is that I am trying to run a problem that I coded using a flex-bison scanner-parser. What my program is supposed to do is take user input (in my case, queries for a database system I'm designing), lex and parse, and then execute the corresponding actions. What actually happens is that my parser code is not correctly interpreting the string literals that I feed it.
Here's my code:
130 insertexpr : "INSERT" expr '(' expr ')'
131
132 {
133 $$ = new QLInsert( $2, $4 );
134 }
135 ;
And my input, following the "Query: " prompt:
Query: INSERT abc(5);
input:1.0-5: syntax error, unexpected string, expecting end of file or end of line or INSERT or ';'
Now, if I remove the "INSERT" string literal from my parser.yy code on line 130, the program runs just fine. In fact, after storing the input data (namely, "abc" and the integer 5), it's returned right back to me correctly.
At first, I thought this was an issue with character encodings. Bison code needs to be compiled and run using the same encodings, which should not be an issue seeing as I am compiling and running from the same terminal.
My system details:
Ubuntu 8.10 (Linux 2.6.24-16-generic)
flex 2.5.34
bison 2.3
gcc 4.2.4
If you need any more info or code from, let me know!
This is a classic error, if you use flex to lex your input into tokens, you must not refer to the literal strings in the parser as literal strings, but rather use tokens for them.
For details, see this similar question
Thankee, thankee, thankee!
Just to clarify, here is how I implemented my solution, based on the comments from jpalecek. First, I declared an INSERT token in the bison code (parser.yy):
71 %token INSERT
Next, I defined that token in the flex code (scanner.ll):
79 "INSERT INTO" { return token::INSERT; }
Finally, I used the token INSERT in my grammar rule:
132 insertexpr : INSERT expr '(' expr ')'
133
134 {
135 $$ = new QLInsert( $2, $4 );
136 }
137 ;
And voila! my over-extended headache is finally over!
Thanks, jpalecek :).