PostgreSQL regexp - any language - regex

This
Check Load (1.0ms) SELECT "checks".* FROM "checks" WHERE (title ~* '[p{L}]+' and state ## 'saved')
matches only english characters, but how can I catch any language characters?

AFAIK this functionality is not available in PostgreSQL. This answer seems to agree. It's 3ish years old, so something may have changed since then, but if it has I'm not aware of it.
From the original poster:
PostgreSQL doesn't support character classes based on the Unicode Character Database like .NET does. You get the more-standard [[:alpha:]] character class, but this is locale-dependent and probably won't cover it.
You may be able to get away with just blacklisting the ASCII characters you don't want, and allowing all non-ASCII characters. eg something like
[^\s!"#$%&'()*+,\-./:;<=>?\[\\\]^_`~]+
(JavaScript doesn't have non-ASCII character classes either. Or even [[:alpha:]].)
For example, given v_text as a text variable to be sanitzed:
-- Allow internationalized text characters and remove undesired characters
v_text = regexp_replace( lower(trim(v_text)), '[!"#$%&()*+,./:;<=>?\[\\\]\^_\|~]+'
EDIT: Please also note #depesz answer below. It is possible to get [[:lower:]] and [[:upper:]] character classes working on Postgres in Linux because Linux's ctype implementation (appears to be) based on UTF-8. I'm not sure if this is an "out of the box" configuration or some kind of upgrade, but good to know it's possible.

I have written an extension that integrates PCRE into PostgreSQL: https://github.com/petere/pgpcre. It has better support for Unicode properties. You can write something like
title ~ pcre '^\p{L}'

Why don't you use normal classes - [:lower:] and [:upper:] ? Check this:
$ select w, w ~ '^[[:lower:][:upper:]]+$' from ( values ( 'aBc'::text ), ('żÓŁW'), ('123')) as x (w);
w | ?column?
------+----------
aBc | t
żÓŁW | t
123 | f
(3 rows)

Related

Are these characters safe to use in HTML, Postgres, and Bash?

I have a project where I'm trying to enable other, possibly hostile, coders to label, in lowercase various properties that will be displayed in differing contexts, including embed in HTML, saved and manipulated in Postgres, used as attribute labels in JavaScript, and manipulated in the shell (say, saving a data file as продажи.zip) as well as various data analysis tools like graph-tool, etc.
I've worked on multilingual projects before, but they were either smaller customers that didn't need to especially worry about sophisticated attacks or they were projects that I came to after the multilingual aspect was in place, so I wasn't the one responsible for verifying security.
I'm pretty sure these should be safe, but I don't know if there are gotchas I need to look out for, like, say, a special [TAB] or [QUOTE] character in the Chinese character set that might escape my escaping.
Am I ok with these in my regex filter?
dash = '-'
english = 'a-z'
italian = ''
russain = 'а-я'
ukrainian = 'ґї'
german = 'äöüß'
spanish = 'ñ'
french = 'çéâêîôûàèùëï'
portuguese = 'ãõ'
polish = 'ąćęłńóśźż'
turkish = 'ğışç'
dutch = 'áíúýÿìò'
swedish = 'å'
danish = 'æø'
norwegian = ''
estonian = ''
romainian = 'șî'
greek = 'α-ωίϊΐόάέύϋΰήώ'
chinese = '([\p{Han}]+)'
japanese = '([\p{Hiragana}\p{Katakana}]+)'
korean = '([\p{Hangul}]+)'
If you restrict yourself to text encodings with a 7-bit ASCII compatible subset, you're reasonably safe treating anything above 0x7f (U+007f) as "safe" when interacting with most saneish programming languages and tools. If you use perl6 you're out of luck ;)
You should avoid supporting or take special care with input or output of text using the text encoding Shift-JIS, where the ¥ symbol is at 0x5c where \ would usually reside. This offers opportunities for nefarious trickery by exploiting encoding conversions.
Avoid or take extra care with other non-ascii-compatible encodings too. EBDIC is one, but you're unlikely to ever meet it in the wild. UTF-16 and UTF-32 obviously, but if you misprocess them the results are glaringly obvious.
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Personally I think your approach is backwards. You should define input and output functions to escape and unescape strings according to the lexical syntaxes of each target tool or language, rather than trying to prohibit any possible metacharacter. But then I don't know your situation, and maybe it's just impractical for what you're doing.
I'm not quite sure what your actual issue is. If you correctly convert your text to the target format, then you don't care what the text could possibly be. This will ensure both proper conversion AND security.
For instance:
If your text is to be included in HTML, it should be escaped using appropriate HTML quoting functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo "<span>".$variable."</span>"
Right:
// Actual encoding function varies based your environment
echo "<span>".htmlspecialchars($variable)."</span>"
Yes, this will also handle properly the case of text containing & or <.
If your text is to be used in an SQL query, you should use parameterised queries.
Example:
Wrong
// XXX DON'T DO THIS XXX
perform_sql_query("SELECT this FROM that WHERE thing=".$variable")
Right
// Actual syntax and function will vary
perform_sql_query("SELECT this FROM that WHERE thing=?", [$variable]);
If you text is to be included in JSON, just use appropriate JSON-encoding functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo '{"this":"'.$variable.'"}'
Right
// actual syntax and function may vary
echo json_encode({this: $variable});
The shell is a bit more tricky, and it's often a pain to deal with non-ASCII characters in many environments (e.g. FTP or doing an scp between different environments). So don't use explicit names for files, use identifiers (numeric id, uuid, hash...) and store the mapping to the actual name somewhere else (in a database).

Regex to match Egyptian Hieroglyphics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I want to know a regex to match the Egyptian Hieroglyphics. I am completely clueless and need your help.
I cannot post the letters as stack overflow doesnt seem to recognize it.
So can anyone let me know the unicode range for these characters.
TLDNR: \p{Egyptian_Hieroglyphs}
Javascript
Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is
U+13000 = d80c dc00
the last one is
U+1342E = d80d dc2e
that gives
re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g
t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))
<div id="pyramid">
some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮
</div>
This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:
Other languages
On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:
>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']
Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:
$str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮";
preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);
prints
[0] => Array
(
[0] => 𓀀
[1] => 𓀁
[2] => 𓐬
[3] => 𓐭
[4] => 𓐮
)
Unicode encodes Egyptian hieroglyphs in the range from U+13000 – U+1342F (beyond the Basic Multilingual Plane).
In this case, there are 2 ways to write the regex:
By specifying a character range from U+13000 – U+1342F.
While specifying a character range in regex for characters in BMP is as easy as [a-z], depending on the language support, doing so for characters in astral planes might not be as simple.
By specifying Unicode block for Egyptian hieroglyphs
Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.
Java
(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern classes).
Sun/Oracle implementation
I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.
Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.
Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.
You can read more about the madness in Java regex in this answer by tchist.
I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.
Java 5 (and above)
"[\uD80C\uDC00-\uD80D\uDC2F]"
Java 7 (and above)
"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"
Since we are matching any code point belongs to the Unicode block, it can also be written as:
"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"
"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"
Java supported \p syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.
PCRE (used in PHP)
PHP example is already covered in georg's answer:
'~\p{Egyptian_Hieroglyphs}~u'
Note that u flag is mandatory if you want to match by code points instead of matching by code units.
Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u flag (UTF mode) in this answer of mine.
One thing to note is Egyptian_Hieroglyphs is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).
As an alternative, you can specify a character range with \x{h...hh} syntax:
'~[\x{13000}-\x{1342F}]~u'
Note the mandatory u flag.
The \x{h...hh} syntax is supported from at least PCRE 4.50.
JavaScript (ECMAScript)
ES5
The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.
/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/
The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.
JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.
ES6
Finally, support for code point matching is added in ECMAScript 6, which is made available via u flag to prevent breaking existing implementations in previous versions of ECMAScript.
ES6 Specification - 21.2 RegExp (Regular Expression) Objects
Unicode-aware regular expressions in ECMAScript 6
Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp.
With the introduction of \u{h...hh} syntax in ES6, the character range can be rewritten in a manner similar to Java 7:
/[\u{13000}-\u{1342F}]/u
Or you can also directly specify the character in the RegExp literal, though the intention is not as clear cut as [a-z]:
/[𓀀-𓐯]/u
Note the u modifier in both regexes above.
Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.

Tokenizing japanese string and converting to hiragana

I am using string tokenizer and transform APIs to convert kanji characters to hiragana.
The code in query (What is the replacement for Language Analysis framework's Morpheme analysis deprecated APIs) converts most of kanji characters to hiragana but these APIs fails to convert kanji word having 3-4 characters.
like-
a) 現人神 is converted to latin - 'gen ren shen' and in hiragana- 'げんじんしん'
whereas it should be - in latin - 'Arahitogami ' and in hiragana- 'あらひとがみ'
b) 安本丹 is converted to latin - 'an ben dan' and in hiragana- 'やすもとまこと'
whereas it should be - in latin as - 'Yasumoto makoto ' and in hiragana- 'あんぽんたん'
My main purpose is to obtain the ruby text for given japanese text. I cant use lang analysis framework as its unavailable in 64-bit.
Any suggestions? Are there other APIs to perform such string conversion?
So in both cases your API uses onyomi but shouldn't. So I assume it just guesses "3 or more characters ? onyomi should be more appropriate in most cases, so I use it". Sounds like an actual dictionary is needed for your problem, which you can download.
Names ( for b) ) should still be a problem tho. I don't see how a computer should be able to get the correct name from kanjis, as even native japanese people sometimes fail at it. jisho.org doesn't even find a single name for 安本丹.
( Btw you mixed up your hiragana in b), and the latin for 'あんぽんたん'. I can't write comments yet with my rep so I'm leaving this here )

Unreserved characters in C/C++

I need to encode all occurrence of < character in a C/C++ code file. To prevent conflict, I need to know which characters are not reserved in C/C++ standard. For example, if $ is not reserved, I can encode < to $ temporarily and revive the original C/C++ code later.
I need this encoding for my C/C++ code in the XML-like intermediate language.
Thanks in advance.
Rather than list unreserved characters (there are infinite), here are the reserved ones from 2.3.1 of the standard:
space, horizontal tab, vertical tab, form feed, new line
a through z
A through Z
0 through 9
_ { } [ ] # ( ) % : ; . ? * + - / ^ & | ~ ! = , \ " '
If you convert all < characters to $, how will you preserve any instances of $ in your original file?
Since you say you're targeting an XML-like intermediate language, why not use XML escaping and convert < to &lt instead? (You'll also need to convert & in that case, say to &amp.) There are lots of open source libraries available to help you do this. If you can't find any stand-alone module, here's code I've written which could have its XML (un)escaping functionality extracted.
It depends on what you mean by "reserved". An implementation
is only required to understand a very limited number of
characters in input, with all others being input by means of
universal character names. An implementation is allowed (and
I would even say encouraged) to support more, see §2.2, point 1.
In practice, there are (or should be) no reserved characters
in comments, and in string and character literals (at least the
wide character forms, and in C++11, the Unicode forms). Your
best bet is probably something like quoted printable.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.