Regex to match Egyptian Hieroglyphics [closed] - regex

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I want to know a regex to match the Egyptian Hieroglyphics. I am completely clueless and need your help.
I cannot post the letters as stack overflow doesnt seem to recognize it.
So can anyone let me know the unicode range for these characters.

TLDNR: \p{Egyptian_Hieroglyphs}
Javascript
Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is
U+13000 = d80c dc00
the last one is
U+1342E = d80d dc2e
that gives
re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g
t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))
<div id="pyramid">
some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮
</div>
This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:
Other languages
On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:
>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']
Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:
$str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮";
preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);
prints
[0] => Array
(
[0] => 𓀀
[1] => 𓀁
[2] => 𓐬
[3] => 𓐭
[4] => 𓐮
)

Unicode encodes Egyptian hieroglyphs in the range from U+13000 – U+1342F (beyond the Basic Multilingual Plane).
In this case, there are 2 ways to write the regex:
By specifying a character range from U+13000 – U+1342F.
While specifying a character range in regex for characters in BMP is as easy as [a-z], depending on the language support, doing so for characters in astral planes might not be as simple.
By specifying Unicode block for Egyptian hieroglyphs
Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.
Java
(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern classes).
Sun/Oracle implementation
I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.
Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.
Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.
You can read more about the madness in Java regex in this answer by tchist.
I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.
Java 5 (and above)
"[\uD80C\uDC00-\uD80D\uDC2F]"
Java 7 (and above)
"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"
Since we are matching any code point belongs to the Unicode block, it can also be written as:
"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"
"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"
Java supported \p syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.
PCRE (used in PHP)
PHP example is already covered in georg's answer:
'~\p{Egyptian_Hieroglyphs}~u'
Note that u flag is mandatory if you want to match by code points instead of matching by code units.
Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u flag (UTF mode) in this answer of mine.
One thing to note is Egyptian_Hieroglyphs is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).
As an alternative, you can specify a character range with \x{h...hh} syntax:
'~[\x{13000}-\x{1342F}]~u'
Note the mandatory u flag.
The \x{h...hh} syntax is supported from at least PCRE 4.50.
JavaScript (ECMAScript)
ES5
The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.
/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/
The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.
JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.
ES6
Finally, support for code point matching is added in ECMAScript 6, which is made available via u flag to prevent breaking existing implementations in previous versions of ECMAScript.
ES6 Specification - 21.2 RegExp (Regular Expression) Objects
Unicode-aware regular expressions in ECMAScript 6
Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp.
With the introduction of \u{h...hh} syntax in ES6, the character range can be rewritten in a manner similar to Java 7:
/[\u{13000}-\u{1342F}]/u
Or you can also directly specify the character in the RegExp literal, though the intention is not as clear cut as [a-z]:
/[𓀀-𓐯]/u
Note the u modifier in both regexes above.
Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.

Related

How to split emojis that contain flags without the flag breaking into 2 characters in Google Sheets

This is my initial string:
🇦🇺🏅🏉
I used a not so elegant way to break up the emojis.
=if(len(I88) = 4, REGEXEXTRACT(I88,"(.+?)\s*(.+?)"),if(len(I88) = 6, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 8, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"),if(len(I88) = 10, REGEXEXTRACT(I88,"(.+?)\s*(.+?)\s*(.+?)\s*(.+?)\s*(.+?)"), REGEXEXTRACT(I88,"\s*(.+?)" )))))
The result is 4 columns instead of 3: this is what it looks like
🇦 | 🇺 | 🏅 | 🏉
I left the pipes to indicate a separate column
What I want is this:
🇦🇺 | 🏅 | 🏉
Short answer
To correctly separate the three emoticons we need to use a custom function. Fortunaly there are JavaScript libraries that could be used for this like the one shared in the answer by Orlin Giorgiev to Get grapheme character count in javascript strings?
Explanation
The OP formula is returning four elements instead of three because Google Sheets built-in functions requires four "characters" (actually they are code points) that need more than 4 hexadecimal digits to represent them. Each set of "characters" to represent emoticons are called "astral code points".
From https://mathiasbynens.be/notes/javascript-unicode
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
Internally, JavaScript [as well Google Sheets built-in functions] represents astral symbols as surrogate pairs, and it exposes the separate surrogate halves as separate “characters”. If you represent the symbols using nothing but ECMAScript 5-compatible escape sequences, you’ll see that two escapes are needed for each astral symbol. This is confusing, because humans generally think in terms of Unicode symbols or graphemes instead.
Custom function
function SPLITGRAPHEMES(string) {
var splitter = new GraphemeSplitter();
return splitter.splitGraphemes(string);
}
NOTE: Don't forget to include the referred JavaScript library
Syntax
Assume that A1 contains . To split the three emoticons in a 1 x 3 array use the following formula:
=TRANSPOSE(SPLITGRAPHEMES(A1))
Note: In Windows the emoticons (🇦🇺🏅🏉) in this Q&A doesn't look the same as in Chrome OS, so a image was used in the above paragraph.

Translate accented to unaccented characters in Sublime Text snippet using regex

I'm writing a ST3 snippet that inserts a \subsection{} with a label. The label is created by converting the header text to conform with the LaTeX standards for labels using a (rather lengthy) regular expression:
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:N)(?9:n)(?10O)(?11:o)(?12:U)(?13:u)/g}
Actually, I would like for it to be even longer. But if I add the extra groups that I would like, then ST3 crashes when I execute the snippet.
${1/(?:([ \t_]+)?|\b)(?:([ÅÄÆÁÀÃ])?|\b)(?:([åäæâàáã])?|\b)(?:([Ç])?|\b)(?:([ç])?|\b)(?:([ÉÈÊË])?|\b)(?:([éèëê])?|\b)(?:([ÌÌÎÏ])?|\b)(?:([íìïî])?|\b)(?:([Ñ])?|\b)(?:([ñ])?|\b)(?:([ÖØÓÒÔÖÕ])?|\b)(?:([öøóòôõ])?|\b)(?:([ÜÛÚÙ])?|\b)(?:([üûúù])?|\b)(?:([Ý])?|\b)(?:([ÿý])?|\b)/(?1:-)(?2:A)(?3:a)(?4:C)(?5:c)(?6:E)(?7:e)(?8:I)(?9:i)(?10:O)(?11:o)(?12:N)(?13:n)(?14:U)(?15:u)(?16:Y)(?17:y)/g}
Is there any more efficient way of doing this? Preferably one that won't cause ST3 to crash ;)
Edit:
Here are some example strings:
Flygande bæckasiner søka hwila på mjuka tuvor
Åke Staël hade en överflödig idé
And the results (with the current, working regex):
Flygande-backasiner-soka-hwila-pa-mjuka-tuvor
Ake-Stael-hade-en-overflodig-ide
But I would like to also replace the characters (ÇçÝÿý) with their unaccented counterparts (CcYyy) so that e.g.
Comment ça va
becomes
Comment-ca-va
I don't know this syntax, but I suspect that the problem comes from the too many optional groups combined with a lot of alternatives that cause a too complex processing.
So you can try to design your pattern like this, and you can add other groups of letters in the same way (take a look at the unicode table to find character ranges):
${1/([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ)/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
if the lookahead feature is available you can improve this pattern to prevent non-accented characters to be tested with each alternatives:
${1/(?=[ \t_À-ÆÈ-ÏÑ-ÖØ-Üà-æè-ïñ-öø-üŒœ])(?:([ \t_]+)|([À-Å])|([à-å])|([È-Ë])|([è-ë])|([Ì-Ï])|([ì-ï])|([Ò-ÖØ])|([ò-öø])|([Ù-Ü])|([ù-ü])|(Æ)|(æ)|(Œ)|(œ)|(Ñ)|(ñ))/(?1:-)(?2:A)(?3:a)(?4:E)(?5:e)(?6:I)(?7:i)(?8:O)(?9:o)(?10:U)(?11:u)(?12:AE)(?13:ae)(?14:OE)(?15:oe)(?16:N)(?17:n)/g}
Note: Æ (Aelig) must be transliterated as AE (the same for Œ => OE)

Range of UTF-8 Characters in C++11 Regex

This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.
Encoded in UTF-8, the string "[一-龠々〆ヵヶ]" is equal to this one: "[\xe4\xb8\x80-\xe9\xbe\xa0\xe3\x80\x85\xe3\x80\x86\xe3\x83\xb5\xe3\x83\xb6]". And this is not the droid character class you are looking for.
The character class you are looking for is the one that includes:
any character in the range U+4E00..U+9FA0; or
any of the characters 々, 〆, ヵ, ヶ.
The character class you specified is the one that includes:
any of the "characters" \xe4 or \xb8; or
any "character" in the range \x80..\xe9; or
any of the "characters" \xbe, \xa0, \xe3, \x80, \x85, \xe3 (again), \x80 (again), \x86, \xe3 (again), \x83, \xb5, \xe3 (again), \x83 (again), \xb6.
Messy isn't it? Do you see the problem?
This will not match "latin" characters (which I assume you mean things like a-z) because in UTF-8 those all use a single byte below 0x80, and none of those is in that messy character class.
It will not match "中" either because "中" has three "characters", and your regex matches only one "character" out of that weird long list. Try assert(std::regex_match("中", std::regex("..."))) and you will see.
If you add a + it works because "中" has three of those "characters" in your weird long list, and now your regex matches one or more.
If you instead add {1} it does not match because we are back to matching three "characters" against one.
Incidentally "中" matches "中" because we are matching the three "characters" against the same three "characters" in the same order.
That the regex with + will actually match some undesired things because it does not care about order. Any character that can be made from that list of bytes in UTF-8 will match. It will match "\xe3\x81\x81" (ぁ U+3041) and it will even match invalid UTF-8 input like "\xe3\xe3\xe3\xe3".
The bigger problem is that you are using a regex library that does not even have level 1 support for Unicode, the bare minimum required. It munges bytes and there isn't much your precious tiny regex can do about it.
And the even bigger problem is that you are using a hardcoded set of characters to specify "any Japanese Kanji or Chinese character". Why not use the Unicode Script property for that?
R"(\p{Script=Han})"
Oh right, this won't work with C++11 regexes. For a moment there I almost forgot those are annoyingly worse than useless with Unicode.
So what should you do?
You could decode your input into a std::u32string and use char32_t all over for the matching. That would not give you this mess, but you would still be hardcoding ranges and exceptions when you mean "a set of characters that share a certain property".
I recommend you forget about C++11 regexes and use some regular expression library that has the bare minimum level 1 Unicode support, like the one in ICU.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.