hyphen character and apostrophe character - the same ASCII code in different languages?

hyphen character and apostrophe character - the same ASCII code in different languages? - regex

I need to specify a regex for validation of user input that allows the user to enter a hyphen character or apostrophe character on Windows Desktop operating systems or Mac OS/X desktop operating systems.
The user may have configured for the following languages:
English
French
Spanish
Portuguese
Hawaiian
I wan't to understand if I use a standard ASCII regex for hyphen and apostophe (e.g. ['-]) whether that will catch the hyphen or apostrophe keys typed by the user in most cases. I appreciate my definition is quite loose as there are many different keyboard layouts, OS versions, and language definitions (e.g. fr_FR, ca_FR).
I have checked the following resources and generally searched on google, but could not find anything in particular about saying that the ASCII code generated by a hyphen key or apostrophe key will always be ASCII code 45 and ASCII code 39 respectively.
http://en.wikipedia.org/wiki/Keyboard_layout
http://en.wikipedia.org/wiki/Hyphen
http://en.wikipedia.org/wiki/Apostrophe
NOTE: If you feel this question is badly worded, please add a comment to help me improve it.

You're mixing up a couple of things:
keyboard layout is what determines what value get assigned to a scancode.
localization settings determine in what language you should address the user, and wether the user expects a decimal point or comma.
character encoding is how a glyph is encoded into the bits memory and, in reverse, how to decode bits into glyphs
If you're validating user input, you shouldn't be interested in scancodes. A DVORAK layout user on a QWERTY keyboard will be pressing the Q key to input an '. And you shouldn't mess with that. So you have no business dealing with keyboard layouts.
The existence of this keyboard, should remind you, that what keys do is not your head-ache, but up to the user.
The localization settings will matter to you, but not for your regex. They will, however, tell you in what language you should put your error message, in case the user input is invalid. A good coding practice is to use a library like gettext to manage this.
What matters most, when you are validating input. Is just those 2 things: what is valid and what is the input.
You (or your domain expert) decide what is valid. Wether a hyphen-minus is just as acceptable as a hyphen or n-dash.
The input will be in encoded; computers work with bits, not strings of glyphs. It could be ASCII, but I'd steer towards unicode if I could help it.
As for your real concern, if I may rephrase it: "Can all users easily enter ' and -?". I guess they probably can. Many important programming languages use those glyphs to resp. denote strings and as a subtraction operator. And if your application needs to (dis)allow certain glyphs you can put unicode code points or categories in your regex.

Related

Are these characters safe to use in HTML, Postgres, and Bash?

I have a project where I'm trying to enable other, possibly hostile, coders to label, in lowercase various properties that will be displayed in differing contexts, including embed in HTML, saved and manipulated in Postgres, used as attribute labels in JavaScript, and manipulated in the shell (say, saving a data file as продажи.zip) as well as various data analysis tools like graph-tool, etc.
I've worked on multilingual projects before, but they were either smaller customers that didn't need to especially worry about sophisticated attacks or they were projects that I came to after the multilingual aspect was in place, so I wasn't the one responsible for verifying security.
I'm pretty sure these should be safe, but I don't know if there are gotchas I need to look out for, like, say, a special [TAB] or [QUOTE] character in the Chinese character set that might escape my escaping.
Am I ok with these in my regex filter?
dash = '-'
english = 'a-z'
italian = ''
russain = 'а-я'
ukrainian = 'ґї'
german = 'äöüß'
spanish = 'ñ'
french = 'çéâêîôûàèùëï'
portuguese = 'ãõ'
polish = 'ąćęłńóśźż'
turkish = 'ğışç'
dutch = 'áíúýÿìò'
swedish = 'å'
danish = 'æø'
norwegian = ''
estonian = ''
romainian = 'șî'
greek = 'α-ωίϊΐόάέύϋΰήώ'
chinese = '([\p{Han}]+)'
japanese = '([\p{Hiragana}\p{Katakana}]+)'
korean = '([\p{Hangul}]+)'

If you restrict yourself to text encodings with a 7-bit ASCII compatible subset, you're reasonably safe treating anything above 0x7f (U+007f) as "safe" when interacting with most saneish programming languages and tools. If you use perl6 you're out of luck ;)
You should avoid supporting or take special care with input or output of text using the text encoding Shift-JIS, where the ¥ symbol is at 0x5c where \ would usually reside. This offers opportunities for nefarious trickery by exploiting encoding conversions.
Avoid or take extra care with other non-ascii-compatible encodings too. EBDIC is one, but you're unlikely to ever meet it in the wild. UTF-16 and UTF-32 obviously, but if you misprocess them the results are glaringly obvious.
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Personally I think your approach is backwards. You should define input and output functions to escape and unescape strings according to the lexical syntaxes of each target tool or language, rather than trying to prohibit any possible metacharacter. But then I don't know your situation, and maybe it's just impractical for what you're doing.

I'm not quite sure what your actual issue is. If you correctly convert your text to the target format, then you don't care what the text could possibly be. This will ensure both proper conversion AND security.
For instance:
If your text is to be included in HTML, it should be escaped using appropriate HTML quoting functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo "<span>".$variable."</span>"
Right:
// Actual encoding function varies based your environment
echo "<span>".htmlspecialchars($variable)."</span>"
Yes, this will also handle properly the case of text containing & or <.
If your text is to be used in an SQL query, you should use parameterised queries.
Example:
Wrong
// XXX DON'T DO THIS XXX
perform_sql_query("SELECT this FROM that WHERE thing=".$variable")
Right
// Actual syntax and function will vary
perform_sql_query("SELECT this FROM that WHERE thing=?", [$variable]);
If you text is to be included in JSON, just use appropriate JSON-encoding functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo '{"this":"'.$variable.'"}'
Right
// actual syntax and function may vary
echo json_encode({this: $variable});
The shell is a bit more tricky, and it's often a pain to deal with non-ASCII characters in many environments (e.g. FTP or doing an scp between different environments). So don't use explicit names for files, use identifiers (numeric id, uuid, hash...) and store the mapping to the actual name somewhere else (in a database).

Why aren't my hyphens displaying correctly using std::cout?

I am trying to print out the following string using std::cout :
"Encryptor –pid1 0x34f –pid2"
the '-' characters appear as u's with a circumflex above them (I'm not sure how to type this).
How do I print out the hyphen as intended?

That was not a hyphen.
It was a "n-dash", which will render differently across consoles based on encoding settings.
The hyphen key is usually on the number row of your keyboard, on Western layouts.

Make sure your terminal's idea of the character encoding matches that of your source code. How to do this, of course, depends on your operating system, which terminal emulator (assuming it's an emulator at all) you're using, and so on, neither of which you state.
Also, that's not a hyphen in your example, it's too long. It's probably an "em dash".

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama

Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.

One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

Can all keys be represented as a single char in c++?

I've searched around and I can't seem to find a way to represent arrow keys or the escape key as single char in c++. Is this even possible? I would expect that it would be similar to \t or \n for tab and new line respectively. Whenever I search for escaped characters, there's only ever a list of five or six well known characters.

The short answer is no.
The long answer is that there are a number of control characters in the standard ANSI character set (from decimal 1 to decimal 31, inclusive), among which are the control codes for linefeed, carriage return, end-of-file, and so on. A few are commonly interpreted as arrows and the escape key, but only for compatibility with terminals.
Standard PC keyboards send a 2- or 3-byte control code that represents the key that was pressed, what state it's in, which control/alt/shift key is pressed, and a few other things. You'll want to look up "key codes" to see how to handle them. Handling them differs between operating systems and the base libraries you use, and their meaning differs based on the operating system's configured keyboard layout (which may include characters not found in the ANSI character set).

Not possible; keyboards built for some languages have characters that can't be represented in a char, and anyway, how do you represent control-option-command-shift-F11 in a char?
Keyboards send scancodes, which are either some kind of event in a GUI system or a short string of bytes that represent the key. What codes depends on your system, but on most terminal-like systems, ncurses knows how to deal with them.

char variables usually represent elements in the ASCII table.
http://www.asciitable.com/
there is also man ascii on unix. If you want arrow keys you'll need a more direct way to access keyboard input. the arrow keys get translated into sequences of characters before hitting stdio. If oyu want direct keyboard access consider a GUI library, sdl, direct input to name a few.

There aren't any escape characters for the arrow keys. They are represented as Keycodes, afaik. I suggest using a higher level input library to detect key presses. If you want to do it from scratch, the approach taken might vary with the specific platform you are programming for.
In any case, back in the days of TURBO C, I used -
//Wait for keypress
//store input in ch
//see if the ASCII code matches the code for one of the arrow keys

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?

You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.

using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js