URL-Encoding a Byte String? - c++

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.

Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.

Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)

Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Related

How to read a certain number of characters (as opposed to bytes) in Crystal?

In Crystal, if I have a string (or a file), how do I read a certain number of characters at a time? Using functions like IO#read, IO#gets, IO#read_string, and IO#read_utf8, one can specify a certain number of bytes to read, but not a certain number of UTF-8 characters (or ones of another encoding).
In Python, for example, one might do this:
from io import StringIO
s = StringIO("abcdefgh")
while True:
chunk = s.read(4)
if not chunk: break
Or, in the case of a file, this:
with open("example.txt", 'r') as f:
while True:
chunk = f.read(4)
if not chunk: break
Generally, I'd expect IO::Memory to be the class to use for the string case, but as far as I can tell, its methods don't allow for this. How would one do this in an efficient and idiomatic fashion (for both strings and files – perhaps the answer is different for each) in Crystal?
There currently is no short cut implementation for this available in Crystal.
You can read individual chars with IO#read_char or consecutive ones with IO#each_char.
So a basic implementation would be:
io = IO::Memory.new("€abcdefgh")
string = String.build(4) do |builder|
4.times do
builder << io.read_char
end
end
puts string
Whether you use a memory IO or a file or any other IO is irrelevant, the behaviour is all the same.
io = IO::Memory.new("€€€abc€€€") #UTF-8 string from memory
or
io = File.open("test.txt","r") #UTF-8 string from file
iter = io.each_char.each_slice(4) #read max 4 chars at once
iter.each { |slice| #into a slice
puts slice
puts slice.join #join to a string
}
output:
['€', '€', '€', 'a']
€€€a
['b', 'c', '€', '€']
bc€€
['€']
€
In addition to the answers already given, for strings in Crystal, you can read X amount of characters with a range like this:
my_string = "A foo, a bar."
my_string[0..5] => "A foo"
This workaround seems to work for me:
io = IO::Memory.new("abcdefghz")
chars_to_read = 2 # Number of chars to read
while true
chunk = io.gets(chars_to_read) # Grab the chunk of type String?
break if chunk.nil? # Break if nothing else to read aka `nil`
end

How to remove prefixed u from a unicode string?

I am reading the lines from a CSV file; I am applying LDA algorithm to find the most common topic, after data processing in doc_processed, I am getting 'u' in every word but why? Please suggest me to remove 'u' from the doc+processed, my code in Python 2.7 is
data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]
stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
shortword = re.compile(r'\W*\b\w{1,2}\b')
output=shortword.sub('', normalized)
return output
doc_processed = [clean(doc) for doc in data]
Output as doc_processed -
[u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']
the u'some string' format means it is a unicode string. See this question for more details on unicode strings themselves, but the easiest way to fix this is likely to str.encode the result before returning it from clean.
def clean(doc):
# all as before until
output = shortword.sub('', normalized).encode()
return output
Note that attempting to encode unicode code points that don't translate directly to the default encoding (which appears to be ASCII. See sys.getdefaultencoding() on your system to check) will throw an error here. You can handle the error in various ways be defining the errors kwarg to encode.
s.encode(errors="ignore") # omit the code point that fails to encode
s.encode(errors="replace") # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.

Python Decoding and Encoding, List Element utf-8

just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list.
I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.

How can I change this function to be compatible with Python 2 and Python 3? I'm running into string, unicode and other problems

I have a function that is designed to make some text safe for filenames or URLs. I'm trying to change it so that it works in Python 2 and Python 3. In my attempt, I've confused myself with bytecode and would welcome some guidance. I'm encountering errors like sequence item 1: expected a bytes-like object, str found.
def slugify(
text = None,
filename = True,
URL = False,
return_str = True
):
if sys.version_info >= (3, 0):
# insert magic here
else:
if type(text) is not unicode:
text = unicode(text, "utf-8")
if filename and not URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip())
text = unicode(re.sub("[\s]+", "_", text))
elif URL:
text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore")
text = unicode(re.sub("[^\w\s-]", "", text).strip().lower())
text = unicode(re.sub("[-\s]+", "-", text))
if return_str:
text = str(text)
return text
It seems like your main problem is figuring out how to convert the text to unicode and back to bytes when you aren't sure what the original type was. In fact, you can do this without any conditional checks if you're careful.
if isinstance(s, bytes):
s = s.decode('utf8')
Should be sufficient to convert something to unicode in either Python 2 or 3 (assuming 2.6+ and 3.2+ as is usual). This is because bytes exists as an alias for string in Python 2. The explicit utf8 argument is only required in Python 2, but there's no harm in providing it in Python 3 as well. Then to convert back to a bytestring, you just do the reverse.
if not isinstance(s, bytes):
s = s.encode('utf8')
Of course, I would recommend that you think hard about why you are unsure what types your strings have in the first place. It is better to keep the distinction seperate, rather than write "weak" APIs that accept either. Python 3 just encourages you to maintain the separation.

python insert to postgres over psycopg2 unicode characters

Hi guys I am having a problem with inserting utf-8 unicode character to my database.
The unicode that I get from my form is u'AJDUK MARKO\u010d'. Next step is to decode it to utf-8. value.encode('utf-8') then I get a string 'AJDUK MARKO\xc4\x8d'.
When I try to update the database, works the same for insert btw.
cur.execute( "UPDATE res_partner set %s = '%s' where id = %s;"%(columns, value, remote_partner_id))
The value gets inserted or updated to the database but the problem is it is exactly in the same format as AJDUK MARKO\xc4\x8d and of course I want AJDUK MARKOČ. Database has utf-8 encoding so it is not that.
What am I doing wrong? Surprisingly couldn't really find anything useful on the forums.
\xc4\x8d is the UTF-8 encoding representation of Č. It looks like the insert has worked but you're not printing the result correctly, probably by printing the whole row as a list. I.e.
>>> print "Č"
"Č"
>>> print ["Č"] # a list with one string
['\xc4\x8c']
We need to see more code to validate (It's always a good idea to give as much reproducible code as possible).
You could decode the result (result.decode("utf-8")) but you should avoid manually encoding or decoding. Psycopg2 already allows you send Unicodes, so you can do the following without encoding first:
cur.execute( u"UPDATE res_partner set %s = '%s' where id = %s;" % (columns, value, remote_partner_id))
- note the leading u
Psycopg2 can return Unicodes too by having strings automatically decoded:
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Edit:
SQL values should be passed as an argument to .execute(). See the big red box at: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters
Instead
E.g.
# Replace the columns field first.
# Strictly we should use http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
sql = u"UPDATE res_partner set {} = %s where id = %s;".format(columns)
cur.execute(sql, (value, remote_partner_id))