Are these characters safe to use in HTML, Postgres, and Bash? - regex

I have a project where I'm trying to enable other, possibly hostile, coders to label, in lowercase various properties that will be displayed in differing contexts, including embed in HTML, saved and manipulated in Postgres, used as attribute labels in JavaScript, and manipulated in the shell (say, saving a data file as продажи.zip) as well as various data analysis tools like graph-tool, etc.
I've worked on multilingual projects before, but they were either smaller customers that didn't need to especially worry about sophisticated attacks or they were projects that I came to after the multilingual aspect was in place, so I wasn't the one responsible for verifying security.
I'm pretty sure these should be safe, but I don't know if there are gotchas I need to look out for, like, say, a special [TAB] or [QUOTE] character in the Chinese character set that might escape my escaping.
Am I ok with these in my regex filter?
dash = '-'
english = 'a-z'
italian = ''
russain = 'а-я'
ukrainian = 'ґї'
german = 'äöüß'
spanish = 'ñ'
french = 'çéâêîôûàèùëï'
portuguese = 'ãõ'
polish = 'ąćęłńóśźż'
turkish = 'ğışç'
dutch = 'áíúýÿìò'
swedish = 'å'
danish = 'æø'
norwegian = ''
estonian = ''
romainian = 'șî'
greek = 'α-ωίϊΐόάέύϋΰήώ'
chinese = '([\p{Han}]+)'
japanese = '([\p{Hiragana}\p{Katakana}]+)'
korean = '([\p{Hangul}]+)'

If you restrict yourself to text encodings with a 7-bit ASCII compatible subset, you're reasonably safe treating anything above 0x7f (U+007f) as "safe" when interacting with most saneish programming languages and tools. If you use perl6 you're out of luck ;)
You should avoid supporting or take special care with input or output of text using the text encoding Shift-JIS, where the ¥ symbol is at 0x5c where \ would usually reside. This offers opportunities for nefarious trickery by exploiting encoding conversions.
Avoid or take extra care with other non-ascii-compatible encodings too. EBDIC is one, but you're unlikely to ever meet it in the wild. UTF-16 and UTF-32 obviously, but if you misprocess them the results are glaringly obvious.
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Personally I think your approach is backwards. You should define input and output functions to escape and unescape strings according to the lexical syntaxes of each target tool or language, rather than trying to prohibit any possible metacharacter. But then I don't know your situation, and maybe it's just impractical for what you're doing.

I'm not quite sure what your actual issue is. If you correctly convert your text to the target format, then you don't care what the text could possibly be. This will ensure both proper conversion AND security.
For instance:
If your text is to be included in HTML, it should be escaped using appropriate HTML quoting functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo "<span>".$variable."</span>"
Right:
// Actual encoding function varies based your environment
echo "<span>".htmlspecialchars($variable)."</span>"
Yes, this will also handle properly the case of text containing & or <.
If your text is to be used in an SQL query, you should use parameterised queries.
Example:
Wrong
// XXX DON'T DO THIS XXX
perform_sql_query("SELECT this FROM that WHERE thing=".$variable")
Right
// Actual syntax and function will vary
perform_sql_query("SELECT this FROM that WHERE thing=?", [$variable]);
If you text is to be included in JSON, just use appropriate JSON-encoding functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo '{"this":"'.$variable.'"}'
Right
// actual syntax and function may vary
echo json_encode({this: $variable});
The shell is a bit more tricky, and it's often a pain to deal with non-ASCII characters in many environments (e.g. FTP or doing an scp between different environments). So don't use explicit names for files, use identifiers (numeric id, uuid, hash...) and store the mapping to the actual name somewhere else (in a database).

Related

hyphen character and apostrophe character - the same ASCII code in different languages?

I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.
You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

Is there a set of "Lorem ipsums" files for testing character encoding issues?

For layouting we have our famous "Lorem ipsum" text to test how it looks like.
What I am looking for is a set of files containing Text encoded with several different encodings that I can use in my JUnit tests to test some methods that are dealing with character encoding when reading text files.
Example:
Having a ISO 8859-1 encoded test-file and a Windows-1252 encoded test-file. The Windows-1252 have to trigger the differences in region 8016 – 9F16. In other words it must contain at least one character of this region to distinguish it from ISO 8859-1.
Maybe the best set of test-files is that where the test-file for each encoding contains all its characters once. But maybe I am not aware of sth - we all like this encoding stuff, right? :-)
Is there such a set of test-files for character-encoding issues out there?
The Wikipedia article on diacritics is pretty comprehensive, unfortunately you have to extract these characters manually. Also there might exist some mnemonics for each language. For instance in Polish we use:
Zażółć gęślą jaźń
which contains all 9 Polish diacritics in one correct sentence. Another useful search hint are pangrams: sentences using every letter of the alphabet at least once:
in Spanish, "El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja." (all 27 letters and diacritics).
in Russian, "Съешь же ещё этих мягких французских булок, да выпей чаю" (all 33 Russian Cyrillic alphabet letters).
List of pangrams contains an exhaustive summary. Anyone care to wrap this in a simple:
public interface NationalCharacters {
String spanish();
String russian();
//...
}
library?
How about trying to use the ICU test suite files? I don't know if they are what you need for your test, but they seem to have pretty complete from/to UTF mapping files at least: Link to the repo for ICU test files
I don't know of any complete text documents, but if you can start with a simple overview of all character sets there are some files available at the ftp.unicode.org server
Here's WINDOWS-1252 for example. The first column is the hexadecimal character value, and the second the unicode value.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1250.TXT
There are a few ready-to-use comprehensive unicode setups straight-forward downloadable.
From w3c
Here, there's a nice test file by w3.org including: maths, linguistics, Greek, Georgian, Russian, Thai, Runes, Braille among many others in a single file:
https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
Coming from w3.org should be nice to use, shouldn't it?
Cutting out the HTML part
If you want to get the "original txt file" without risk of your editor messing it up, 1) download, 2) tail+head it, 3) Check with a diff:
wget https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
tail +8 UTF-8-demo.html | head -n -3 > UTF-8-demo.txt
diff UTF-8-demo.html UTF-8-demo.txt
This generates a UTF-8-demo.txt without human intervention and without risk of loosing data.
More from w3c
There are many more files one level up in the directory structure, still inside the dir utf-8-test:
https://www.w3.org/2001/06/utf-8-test/
From github
There's a very interesting file here too with ALL printable chars (including Chinese, Braille, Arab, etc.)
https://raw.githubusercontent.com/bits/UTF-8-Unicode-Test-Documents/master/UTF-8_sequence_separated/utf8_sequence_0-0x10ffff_assigned_printable.txt
Want also non printable characters?
There are also many more test files in the same repo:
https://github.com/bits/UTF-8-Unicode-Test-Documents
and also a generator if you don't trust the committed file and you want to generate it by yourself.
My personal choice
I have decided that for my projects I'll start with 2 files: The specific one I pointed out from w3c and the specific one I pointed out from the github repo by bits.
Well, I had used an online tool to create my text char sets from Lorem Ipsum. I believe it can help you. I dont have one which has all the different charsets in a single page.
http://generator.lorem-ipsum.info/

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

How can I make a case-insensitive regexp match for Russian letters?

I have list of catalog paths and need to filter out some of them. My match pattern is in a non-Unicode encoding.
I tried the following:
require 5.004;
use POSIX qw(locale_h);
my $old_locale = setlocale(LC_ALL);
setlocale(LC_ALL, "ru_RU.cp1251");
#{$data -> {doc_folder_rights}} =
grep {
# catalog path pattern in $_REQUEST{q}
$_->{doc_folder} =~/$_REQUEST{q}/i;
}
#{$data -> {doc_folder_rights}};
setlocale(LC_ALL, $old_locale);
What I need is case-insensitive regexp pattern matching when pattern contains russsian letters.
There are several (potential) issues with your code:
Your code filters out all doc_folders that do not match the regexp in $_REQUEST{q}, however the question suggests that you want to do the opposite.
You might have an encoding issue. Setting the locale (using setlocale) changes the perl's handling of upper- & lower-case-conversions, but it does not change any encoding. You need to assure that $_REQUEST{q} is interpreted correctly.
For simplicity you can assume that any Perl-string contains Unicode-data in some internal representation that you need not know about in detail. Only when Perl does I/O there is an implicit or explicit conversion. When reading from stdin, ARGV or environment, Perl assumes that the bytes are encoded using the current locale and implicitly converts.
If you have an encoding issue, there are several ways to fix it:
Fix the environment in which Perl runs so that it knows about the correct locale from the very start. That will fix the implicit conversion.
In the unlikely case that $_REQUEST is loaded from a filehandle, you could explicitly tell Perl to convert using binmode($fh, ":encoding(cp1251)");. Do that prior the reading $_REQUEST.
There is the $string = Encode::decode(Encoding, $octets) function that tells Perl to forget its assumption about the encoding of $octets and instead treat the contents of $octets as byte-stream that needs to be converted to Unicode using Encoding. You need to do that before touching the contents of $octets, or strange things may happen.
Since $_REQUEST was probably loaded by some cgi-module, and was probably url-encoded in transit, you could just tell the cgi-module how to correctly do the decoding.

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.