Qt QUrl: doesn't url-encode colon

Qt QUrl: doesn't url-encode colon - c++

I have URL: https://example.com/hello?param=first:last. I expect that it should be percent-encoded as https://example.com/hello?param=first%3Alast. But Qt leaves it as-is. My code:
QUrl url("https://example.com/hello?param=first:last");
printf("Encoded: %s\n", url.toEncoded().constData());
How should I encode colon? Manually format parameter with QString::toPercentEncoding?

You must replace colons because of security issues.
More information: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
You can use percent encoding (":" -> "%3A") for colons, see http://qt-project.org/doc/qt-5.0/qtcore/qurl.html#fromPercentEncoding and http://qt-project.org/doc/qt-5.0/qtcore/qurl.html#toPercentEncoding.

There has been some discussion regarding colon safety in URLs. It sounds like it comes down to the RFC adherence which I'm not versed in.
Is a colon safe for friendly-URL use?
Looks like you may have to replace any ":" characters (after https:) yourself.

I think QUrl::toPercentEncoding() is your best bet. It encodes all standard characters by default, and you can manually specify your own list of additional characters to encode:
Reference URL:
http://doc.qt.digia.com/4.7/qurl.html#toPercentEncoding

Pretty sure you want QUrl::setEncodedURL and QUrl::toEncoded
http://harmattan-dev.nokia.com/docs/platform-api-reference/xml/daily-docs/libqt4/qurl.html
Which version of Qt are you using?
IIRC Qt 3 used QUrl:encode

Related

Using utf-8 characters in log4cxx

I need to be able to use utf-8-encoded strings with log4cxx. I can print the strings just fine with std::cout (the characters are displayed correctly). Using log4cxx, i.e. putting the strings into the LOG4CXX_DEBUG() macro with a ConsoleAppender will output "??" instead of the special character. I found one solution:
LOG4CXX_DECODE_CHAR(logstring, str);
LOG4CXX_DEBUG(logstring);
where str is my input string, but this does not work. Anyone have an idea how this might work? I google'd around a bit, but I couldn't find anything useful.

You can use
setlocale(LC_CTYPE, "UTF-8");
to set only the character encoding, without changing any other information about the locale.

I met the same problem and searched and searched. I found this post, It may work, but I don't like the setlocaleish solution. so i made more research, finally the solution came out.
I reconfigure log4cxx and build it, the problem was solved!
add two more configure options in log4cxx:
./configure --prefx=blabla --with-apr=blabla --with-apr-util=blabla --with-charset=utf-8 --with-logchar=utf-8
hope this will help anyone who need it.

One solution is to use
setlocale(LC_ALL, "en_US.UTF-8");
in my main function. This is OK for me, but if you want more localizable applications, this will probably become hard to track/use.

The first answer didn't work for me, the second one is more than i want. So I combined the two answers:
setlocale(LC_CTYPE, "xx_XX.UTF-8"); // or "xx_XX.utf8", it means the same
where xx_XX is some language tag. I tried to log strings in many languages with different alphabets (on LINUX, including Chinese, language left-to-right and rigth-to-left); so I tried:
setlocale(LC_CTYPE, "it_IT.UTF-8");
and it worked with any tested language. I cannot understand why the simple "UTF-8" without indicating a language xx_XX doesn't work, since i use UTF8 to be language-independent and one shouldn't indicate one. (If somebody know the reason also for that, would be an interesting improvement to the answer). Maybe this also depends by Operatin System.
Finally, on Linux you can get a list of the encodings by typing on shell:
# locale -a | grep utf

camelCase to underscore in vi(m)

If for some reason I want to selectively convert camelCase named things to being underscore separated in vim, how could I go about doing so?
Currently I've found that I can do a search /s[a-z][A-Z] and record a macro to add an underscore and convert to lower case, but I'm curious as to if I can do it with something like :
%s/([a-z])([A-Z])/\1\u\2/gc
Thanks in advance!
EDIT: I figured out the answer for camelCase (which is what I really needed), but can someone else answer how to change CamelCase to camel_case?

You might want to try out the Abolish plugin by Tim Pope. It provides a few shortcuts to coerce from one style to another. For example, starting with:
MixedCase
Typing crc [mnemonic: CoeRce to Camelcase] would give you:
mixedCase
Typing crs [mnemonic: CoeRce to Snake_case] would give you:
mixed_case
And typing crm [mnemonic: CoeRce to MixedCase] would take you back to:
MixedCase
If you also install repeat.vim, then you can repeat the coercion commands by pressing the dot key.

This is a bit long, but seems to do the job:
:%s/\<\u\|\l\u/\= join(split(tolower(submatch(0)), '\zs'), '_')/gc

I suppose I should have just kept trying for about 5 more minutes. Well... if anyone is curious:
%s/\(\l\)\(\u\)/\1\_\l\2/gc does the trick.
Actually, I realized this works for camelCase, but not CamelCase, which could also be useful for someone.

I whipped up a plugin that does this.
https://github.com/chiedojohn/vim-case-convert
To convert the case, select a block of text in visual mode and the enter one of the following (Self explanatory) :
:CamelToHyphen
:CamelToSnake
:HyphenToCamel
:HyphenToSnake
:SnakeToCamel
:SnakeToHyphen
To convert all occerences in your document then run one of the following commands:
:CamelToHyphenAll
:CamelToSnakeAll
:HyphenToCamelAll
:HyphenToSnakeAll
:SnakeToCamelAll
:SnakeToHyphen
Add a bang (eg. :CamelToHyphen!) to any of the above command to bypass the prompts before each conversion.
You may not want to do that though as the plugin wouldn't know the different between variables or other text in your file.

For the CamelCase case:%s#(\<\u\|\l)(\l+)(\u)#\l\1\2_\l\3#gc
Tip: the regex delimiters can be altered as in my example to make it (somewhat) more legible.

I have an API for various development oriented processing. Among other things, it provides a few functions for transforming names between (configurable) conventions (variable <-> attribute <-> getter <-> setter <-> constant <-> parameter <-> ...) and styles (camelcase (low/high) <-> underscores). These conversion functions have been wrapped into a plugin.
The plugin + API can be fetch from here: https://github.com/LucHermitte/lh-dev, for this names conversion task, it requires lh-vim-lib
It can be used the following way:
put the cursor on the symbol you want to rename
type :NameConvert + the type of conversion you wish (here : underscore). NB: this command supports auto-completion.
et voilà!

asp-classic Request.Cookies brings this value "ϑ" for 1 cookie instead of "ÅÙÏ‘‹„‰Š„‹"

This is happening in one cookie with keys in one key only.
The value should be "ÅÙÏ‘‹„‰Š„‹".

The value should be "ÅÙÏ‘‹„‰Š„‹".
Erm, really? That looks like the corrupted, wrong-character set version to me! :-) Either way, “ϑ” is what you get when you save that string in Windows Western European encoding (cp1252) and then read it back in as UTF-8, removing all the ‘invalid character’ codes that result because it's not a valid UTF-8 string. So you've got a classic reading-and-writing-using-different-encodings problem.
As a general rule you can't get away with putting non-ASCII characters in a cookie (name or value) directly. You'll need an application-level encoding mechanism of some sort; one of the most popular ways is to URL-encode the UTF-8 representation of the characters you want, similarly to how JavaScript's encodeURIComponent does it.
(Unfortunately ASP classic has very poor support for handling Unicode.)

Final Solution:
Save As different file with "correct" encoding
Changed encoding
From "Unicode (UTF-8 with signature) -Codepage 65001"
To "Western European (Windows) - Codepage 1252"

We're using encoding on our cookies and some of the resulting characters can cause problems. So what we did is take the cookie string and encode it in HEX. - Problems Solved.

Detecting Characters in an XSLT

I have encountered some odd characters that do not display properly in Internet Explorer, such as these: â€œ, â€“, and â€™. I think they're carried over from copy-and-paste Word content.
I am using XSLT to build the page content and it would be great to detect these characters in the XSLT and replace them with valid HTML codes. I already do string replacement in the style sheet, but I'm not sure how detect these encoded characters or whether it's possible.

What about simply changing the encoding for the Stylesheet as well as its output to UTF-8? The characters you mention are “, – and ’. Certainly not invalid or so, given the correct encoding (the characters are at least perfectly valid in Codepage 1252).

Using a good XML editor such as XMLSpy should highlight any errors in formatting your XSLT by validating at development time.

Jeni Tennison's Multiple string replacements may be a good starting point.

RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.

From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.

The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};

Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings

To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}

The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.

I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===

My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js