How to decode unnecessarily url-encoded characters in a URL? [duplicate] - regex

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.
Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?
Thanks :)

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.
Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.
So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:
(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
So you can match these patterns in the URL and unquote each character separately.
However, remember that not all URLs are encoded in UTF-8.
In some old websites, they still use other character sets, such as Windows-874 for Thai language.
In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.
So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

Related

Regex Error - (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)

I'm trying to do something which seems like it should be very simple. I'm trying to see if a specific string e.g. 'out of stock' is found within a page's source code. However, I don't care if the string is contained within an html comment or javascript. So prior to doing my search, I'd like to remove both of these elements using regular expressions. This is the code I'm using.
urls.each do |url|
response = HTTP.get(url)
if response.status.success?
source_code = response.to_s
# Remove comments
source_code = source_code.gsub(/<!--(.*?)-->/su, '')
# Remove scripts
source_code = source_code.gsub(/<script(.*?)<\/script>/msu, '')
if source_code.match(/out of stock/i)
# Flag URL for further processing
end
end
end
end
This works for 99% of all the urls I tried it with, but certain urls have become problematic. When I try to use these regular expressions on the source code returned for the url "https://www.sunski.com" I get the following error message:
Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string))
The page is definitely UTF-8 encoded, so I don't really understand the error message. A few people on stack overflow recommended using the # encoding: UTF-8 comment at the top of the file, but this didn't work.
If anyone could help with this it would be hugely appreciated. Thank you!
The Net::HTTP standard library only returns binary (ASCII-8BIT) strings. See the long-standing feature request: Feature #2567: Net::HTTP does not handle encoding correctly. So if you want UTF-8 strings you have to manually set their encoding to UTF-8 with String#force_encoding:
source_code.force_encoding(Encoding::UTF_8)
If the website's character encoding isn't UTF-8 you have to implement a heuristic based on the Content-Type header or <meta>'s charset attribute but even then it might not be the correct encoding. You can validate a string's encoding with String#valid_encoding? if you need to deal with such cases. Thankfully most websites use UTF-8 nowadays.
Also as #WiktorStribiżew already wrote in the comments, the regexp encoding specifiers s (Windows-31J) and u (UTF-8) modifiers aren't necessary here and only very rarely are. Especially the latter one since modern Ruby defaults to UTF-8 (or, if sufficient, its subset US-ASCII) anyway. In other programming languages they may have a different meaning, e.g. in Perl s means single line.

Pulling bad chars from an xml feed

I need to figure out a way to remove any non standard ASCII char from a feed I am getting form a partners system.
The issue is they are sending a mix of data - HTML formatted and hard returns along with bad chars.
I am already using the UDF DeMoronize(text) to pull any Microsoft Latin-1 "Extentions" chars out.
The feed is coming over in utf-8 and we have a tag on the page to insure we are processing the feed in utf-8.
is there a simple way to code this to remove any NON UTF-8 char?

URL encoding for multibyte character string in c++

I am trying to achieve URL encoding for some of my strings via c++. Strings can contaim multibyte characters like ™, ®, ©, etc.
Input text: Something ™
Output should be: Something%20%E2%84%A2
I can achieve URL encode or decode in JS with encodeURIComponent and decodeURIComponent,
but I have some native code in c++ and hence need to encode some text via c++.
Any help here would be great relief for me.
It's not to hard to do manually, if you can't find a library. First encode the string as UTF-8 (there are other posts on SO about using the standard library to do that if the string is in another encoding) and then replace every character with a value above 127, and every one that's restricted in URLs, with the percent encoding of that character (A percent sign followed by the two hexadecimal digits representing the character's value).

asp-classic Request.Cookies brings this value "ϑ" for 1 cookie instead of "ÅÙÏ‘‹„‰Š„‹"

This is happening in one cookie with keys in one key only.
The value should be "ÅÙÏ‘‹„‰Š„‹".
The value should be "ÅÙÏ‘‹„‰Š„‹".
Erm, really? That looks like the corrupted, wrong-character set version to me! :-) Either way, “ϑ” is what you get when you save that string in Windows Western European encoding (cp1252) and then read it back in as UTF-8, removing all the ‘invalid character’ codes that result because it's not a valid UTF-8 string. So you've got a classic reading-and-writing-using-different-encodings problem.
As a general rule you can't get away with putting non-ASCII characters in a cookie (name or value) directly. You'll need an application-level encoding mechanism of some sort; one of the most popular ways is to URL-encode the UTF-8 representation of the characters you want, similarly to how JavaScript's encodeURIComponent does it.
(Unfortunately ASP classic has very poor support for handling Unicode.)
Final Solution:
Save As different file with "correct" encoding
Changed encoding
From "Unicode (UTF-8 with signature) -Codepage 65001"
To "Western European (Windows) - Codepage 1252"
We're using encoding on our cookies and some of the resulting characters can cause problems. So what we did is take the cookie string and encode it in HEX. - Problems Solved.

JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?

I am writing a webservice that uses json to represent its resources, and I am a bit stuck thinking about the best way to encode the json. Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8. But the rfc also describes a string escaping mechanism for specifying characters. I assume this would generally be used to escape non-ascii characters, thereby making the resulting utf-8 valid ascii.
So let's say I have a json string that contains unicode characters (code-points) that are non-ascii. Should my webservice just utf-8 encoding that and return it, or should it escape all those non-ascii characters and return pure ascii?
I'd like browsers to be able to execute the results using jsonp or eval. Does that effect the decision? My knowledge of various browser's javascript support for utf-8 is lacking.
EDIT: I wanted to clarify that my main concern about how to encode the results is really about browser handling of the results. What I've read indicates that browsers may be sensitive to the encoding when using JSONP in particular. I haven't found any really good info on the subject, so I'll have to start doing some testing to see what happens. Ideally I'd like to only escape those few characters that are required and just utf-8 encode the results.
The JSON spec requires UTF-8 support by decoders. As a result, all JSON decoders can handle UTF-8 just as well as they can handle the numeric escape sequences. This is also the case for Javascript interpreters, which means JSONP will handle the UTF-8 encoded JSON as well.
The ability for JSON encoders to use the numeric escape sequences instead just offers you more choice. One reason you may choose the numeric escape sequences would be if a transport mechanism in between your encoder and the intended decoder is not binary-safe.
Another reason you may want to use numeric escape sequences is to prevent certain characters appearing in the stream, such as <, & and ", which may be interpreted as HTML sequences if the JSON code is placed without escaping into HTML or a browser wrongly interprets it as HTML. This can be a defence against HTML injection or cross-site scripting (note: some characters MUST be escaped in JSON, including " and \).
Some frameworks, including PHP's json_encode() (by default), always do the numeric escape sequences on the encoder side for any character outside of ASCII. This is a mostly unnecessary extra step intended for maximum compatibility with limited transport mechanisms and the like. However, this should not be interpreted as an indication that any JSON decoders have a problem with UTF-8.
So, I guess you just could decide which to use like this:
Just use UTF-8, unless any software you are using for storage or transport between encoder and decoder isn't binary-safe.
Otherwise, use the numeric escape sequences.
I had a problem there.
When I JSON encode a string with a character like "é", every browsers will return the same "é", except IE which will return "\u00e9".
Then with PHP json_decode(), it will fail if it find "é", so for Firefox, Opera, Safari and Chrome, I've to call utf8_encode() before json_decode().
Note : with my tests, IE and Firefox are using their native JSON object, others browsers are using json2.js.
ASCII isn't in it any more. Using UTF-8 encoding means that you aren't using ASCII encoding. What you should use the escaping mechanism for is what the RFC says:
All Unicode characters may be placed
within the quotation marks except
for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F)
I was facing the same problem. It works for me. Please check this.
json_encode($array,JSON_UNESCAPED_UNICODE);
Reading the json rfc (http://www.ietf.org/rfc/rfc4627.txt) it is clear that the preferred encoding is utf-8.
FYI, RFC 4627 is no longer the official JSON spec. It was obsoleted in 2014 by RFC 7159, which was then obsoleted in 2017 by RFC 8259, which is the current spec.
RFC 8259 states:
8.1. Character Encoding
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].
Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON-based software implementations have chosen to use the UTF-8 encoding, to the extent that it is the only encoding that achieves interoperability.
Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I had a similar problem with é char... I think the comment "it's possible that the text you're feeding it isn't UTF-8" is probably close to the mark here. I have a feeling the default collation in my instance was something else until I realized and changed to utf8... problem is the data was already there, so not sure if it converted the data or not when i changed it, displays fine in mysql workbench. End result is that php will not json encode the data, just returns false. Doesn't matter what browser you use as its the server causing my issue, php will not parse the data to utf8 if this char is present. Like i say not sure if it is due to converting the schema to utf8 after data was present or just a php bug. In this case use json_encode(utf8_encode($string));