A portable regular expression for matching alphanumerics

A portable regular expression for matching alphanumerics - c++

I'm thinking about using the regular expression [0-9a-zA-Z]+ to match any alphanumeric string in the C++ standard library's regular expression library.
But I'm worried about portability. Sure, in an ASCII character set, this will work, and I think that 0-9 must match only digits in any encoding system since the standard insists that all encodings have this property. But the C++ standard doesn't insist on ASCII encoding, so my a-zA-Z part might give me strange results on some platforms; for example those with EBCDIC encoding.
I could use \d for the digit part but that also matches Arabic numerals.
What should I use for a fully portable regular expression that only matches digits and English alphabet letters of either case?

It seems that PCRE (the current version of which is PCRE2) has support for other encoding types, including EBCDIC.
Within the source code on their website, I found "this file" with the following (formatting mine):
A program called dftables (which is distributed with PCRE2) can be used to build alternative versions of this file. This is necessary if you are running in an EBCDIC environment, or if you want to default to a different encoding, for example ISO-8859-1. When dftables is run, it creates these tables in the current locale. If PCRE2 is configured with --enable-rebuild-chartables, this happens automatically.

Well, if you're worried about supporting an exotic encodings, you can just list all characters manually:
[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]+
This looks a bit dirty, but surely it will work everywhere.

Related

What does [[:print:]] means on a directory path?

What does [[:print:]] means on the following code?
echo 255 > /sys/devices/platform/[[:print:]]*/hwmon/hwmon[[:print:]]*/pwm1
What is the difference between using [[:print:]]* and only *?
echo 255 > /sys/devices/platform/*/hwmon/hwmon*/pwm1
Is there any known name of this feature, or any place I could read and understand more about?

[[:print:]], in either a glob-style expression or a POSIX-compliant regex, matches any printable character.
The reference for (the simplified, single-character case of) glob expressions is https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_01; it references the regular expression portion of the standard at https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 as being authoritative for square-bracket expressions, which describes [:print:] as one of the character-class expressions that all locales must provide. The details of this specific class are then provided in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01.

[[:print:]] is a Posix class recognized by all engines today as matching up to
144,544 or less, Unicode characters as of Unicode 14.
The raw representation of that is this UTF-8/32 regex based on UCD.
This matches up to ALL of them. Some engines match less, doesn't matter though,
it is always this or less, never more as of V14.
[\\a-zA-Z0-9\t-\r-/:-#\[\]-`{-~\\ -¬®-ÿĀ-ͷͺ-Ϳ΄-ΊΌΎ-ΡΣ-ԯԱ-Ֆՙ-֊֍-֏֑-ׇא-תׯ-״؆-؛؝-ۜ۞-܍ܐ-݊ݍ-ޱ߀-ߺ߽-࠭࠰-࠾ࡀ-࡛࡞ࡠ-ࡪࡰ-ࢎ࢘-ࣣ࣡-ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗড়ঢ়য়-ৣ০-৾ਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹ਼ਾ-ੂੇੈੋ-੍ੑਖ਼-ੜਫ਼੦-੶ઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૱ૹ-૿ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍୕-ୗଡ଼ଢ଼ୟ-ୣ୦-୷ஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௺ఀ-ఌఎ-ఐఒ-నప-హ఼-ౄె-ైొ-్ౕౖౘ-ౚౝౠ-ౣ౦-౯౷-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕೖೝೞೠ-ೣ೦-೯ೱೲഀ-ഌഎ-ഐഒ-ൄെ-ൈൊ-൏ൔ-ൣ൦-ൿඁ-ඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟ෦-෯ෲ-෴ก-ฺ฿-๛ກຂຄຆ-ຊຌ-ຣລວ-ຽເ-ໄໆ່-ໍ໐-໙ໜ-ໟༀ-ཇཉ-ཬཱ-ྗྙ-ྼ྾-࿌࿎-࿚က-ჅჇჍა-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፝-፼ᎀ-᎙Ꭰ-Ᏽᏸ-ᏽ᐀-᚜ᚠ-ᛸᜀ-᜕ᜟ-᜶ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-៝០-៩៰-៹᠀-᠍᠏-᠙ᠠ-ᡸᢀ-ᢪᢰ-ᣵᤀ-ᤞᤠ-ᤫᤰ-᤻᥀᥄-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉ᧐-᧚᧞-ᨛ᨞-ᩞ᩠-᩿᩼-᪉᪐-᪙᪠-᪭᪰-ᫎᬀ-ᭌ᭐-᭾ᮀ-᯳᯼-᰷᰻-᱉ᱍ-ᲈᲐ-ᲺᲽ-᳇᳐-ᳺᴀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ῄῆ-ΐῖ-Ί῝-`ῲ-ῴῶ-῾ -\ ‐-\ \ -\ ⁰ⁱ⁴-₎ₐ-ₜ₠-⃀⃐-⃰℀-↋←-␦⑀-⑊①-⭳⭶-⮕⮗-ⳳ⳹-ⴥⴧⴭⴰ-ⵧⵯ⵰⵿-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-⹝⺀-⺙⺛-⻳⼀-⿕⿰-⿻　-〿ぁ-ゖ゙-ヿㄅ-ㄯㄱ-ㆎ㆐-㇣ㇰ-㈞㈠-ꒌ꒐-꓆ꓐ-ꘫꙀ-꛷꜀-ꟊꟐꟑꟓꟕ-ꟙꟲ-꠬꠰-꠹ꡀ-꡷ꢀ-ꣅ꣎-꣙꣠-꥓꥟-ꥼꦀ-꧍ꧏ-꧙꧞-ꧾꨀ-ꨶꩀ-ꩍ꩐-꩙꩜-ꫂꫛ-꫶ꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-꭫ꭰ-꯭꯰-꯹가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ﬀ-ﬆﬓ-ﬗיִ-זּטּ-לּמּנּסּףּפּצּ-﯂ﯓ-ﶏﶒ-ﷇ﷏ﷰ-︙︠-﹒﹔-﹦﹨-﹫ﹰ-ﹴﹶ-ﻼ！-ﾾￂ-ￇￊ-ￏￒ-ￗￚ-ￜ￠-￦￨-￮�𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐄀-𐄂𐄇-𐄳𐄷-𐆎𐆐-𐆜𐆠𐇐-𐇽𐊀-𐊜𐊠-𐋐𐋠-𐋻𐌀-𐌣𐌭-𐍊𐍐-𐍺𐎀-𐎝𐎟-𐏃𐏈-𐏕𐐀-𐒝𐒠-𐒩𐒰-𐓓𐓘-𐓻𐔀-𐔧𐔰-𐕣𐕯-𐕺𐕼-𐖊𐖌-𐖒𐖔𐖕𐖗-𐖡𐖣-𐖱𐖳-𐖹𐖻𐖼𐘀-𐜶𐝀-𐝕𐝠-𐝧𐞀-𐞅𐞇-𐞰𐞲-𐞺𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿-𐡕𐡗-𐢞𐢧-𐢯𐣠-𐣲𐣴𐣵𐣻-𐤛𐤟-𐤹𐤿𐦀-𐦷𐦼-𐧏𐧒-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨵𐨸-𐨿𐨺-𐩈𐩐-𐩘𐩠-𐪟𐫀-𐫦𐫫-𐫶𐬀-𐬵𐬹-𐭕𐭘-𐭲𐭸-𐮑𐮙-𐮜𐮩-𐮯𐰀-𐱈𐲀-𐲲𐳀-𐳲𐳺-𐴧𐴰-𐴹𐹠-𐹾𐺀-𐺩𐺫-𐺭𐺰𐺱𐼀-𐼧𐼰-𐽙𐽰-𐾉𐾰-𐿋𐿠-𐿶𑀀-𑁍𑁒-𑁵𑁿-𑂼𑂾-𑃂𑃐-𑃨𑃰-𑃹𑄀-𑄴𑄶-𑅇𑅐-𑅶𑆀-𑇟𑇡-𑇴𑈀-𑈑𑈓-𑈾𑊀-𑊆𑊈𑊊-𑊍𑊏-𑊝𑊟-𑊩𑊰-𑋪𑋰-𑋹𑌀-𑌃𑌅-𑌌𑌏𑌐𑌓-𑌨𑌪-𑌰𑌲𑌳𑌵-𑌹𑌻-𑍄𑍇𑍈𑍋-𑍍𑍐𑍗𑍝-𑍣𑍦-𑍬𑍰-𑍴𑐀-𑑛𑑝-𑑡𑒀-𑓇𑓐-𑓙𑖀-𑖵𑖸-𑗝𑘀-𑙄𑙐-𑙙𑙠-𑙬𑚀-𑚹𑛀-𑛉𑜀-𑜚𑜝-𑜫𑜰-𑝆𑠀-𑠻𑢠-𑣲𑣿-𑤆𑤉𑤌-𑤓𑤕𑤖𑤘-𑤵𑤷𑤸𑤻-𑥆𑥐-𑥙𑦠-𑦧𑦪-𑧗𑧚-𑧤𑨀-𑩇𑩐-𑪢𑪰-𑫸𑰀-𑰈𑰊-𑰶𑰸-𑱅𑱐-𑱬𑱰-𑲏𑲒-𑲧𑲩-𑲶𑴀-𑴆𑴈𑴉𑴋-𑴶𑴺𑴼𑴽𑴿-𑵇𑵐-𑵙𑵠-𑵥𑵧𑵨𑵪-𑶎𑶐𑶑𑶓-𑶘𑶠-𑶩𑻠-𑻸𑾰𑿀-𑿱𑿿-𒎙𒐀-𒑮𒑰-𒑴𒒀-𒕃𒾐-𒿲𓀀-𓐮𔐀-𔙆𖠀-𖨸𖩀-𖩞𖩠-𖩩𖩮-𖪾𖫀-𖫉𖫐-𖫭𖫰-𖫵𖬀-𖭅𖭐-𖭙𖭛-𖭡𖭣-𖭷𖭽-𖮏𖹀-𖺚𖼀-𖽊𖽏-𖾇𖾏-𖾟𖿠-𖿤𖿰𖿱𗀀-𘟷𘠀-𘳕𘴀-𘴈𚿰-𚿳𚿵-𚿻𚿽𚿾𛀀-𛄢𛅐-𛅒𛅤-𛅧𛅰-𛋻𛰀-𛱪𛱰-𛱼𛲀-𛲈𛲐-𛲙𛲜-𛲟𜼀-𜼭𜼰-𜽆𜽐-𜿃𝀀-𝃵𝄀-𝄦𝄩-𝅲𝅻-𝇪𝈀-𝉅𝋠-𝋳𝌀-𝍖𝍠-𝍸𝐀-𝑔𝑖-𝒜𝒞𝒟𝒢𝒥𝒦𝒩-𝒬𝒮-𝒹𝒻𝒽-𝓃𝓅-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔞-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕒-𝚥𝚨-𝟋𝟎-𝪋𝪛-𝪟𝪡-𝪯𝼀-𝼞𞀀-𞀆𞀈-𞀘𞀛-𞀡𞀣𞀤𞀦-𞀪𞄀-𞄬𞄰-𞄽𞅀-𞅉𞅎𞅏𞊐-𞊮𞋀-𞋹𞋿𞟠-𞟦𞟨-𞟫𞟭𞟮𞟰-𞟾𞠀-𞣄𞣇-𞣖𞤀-𞥋𞥐-𞥙𞥞𞥟𞱱-𞲴𞴁-𞴽𞸀-𞸃𞸅-𞸟𞸡𞸢𞸤𞸧𞸩-𞸲𞸴-𞸷𞸹𞸻𞹂𞹇𞹉𞹋𞹍-𞹏𞹑𞹒𞹔𞹗𞹙𞹛𞹝𞹟𞹡𞹢𞹤𞹧-𞹪𞹬-𞹲𞹴-𞹷𞹹-𞹼𞹾𞺀-𞺉𞺋-𞺛𞺡-𞺣𞺥-𞺩𞺫-𞺻𞻰𞻱🀀-🀫🀰-🂓🂠-🂮🂱-🂿🃁-🃏🃑-🃵🄀-🆭🇦-🈂🈐-🈻🉀-🉈🉐🉑🉠-🉥🌀-🛗🛝-🛬🛰-🛼🜀-🝳🞀-🟘🟠-🟫🟰🠀-🠋🠐-🡇🡐-🡙🡠-🢇🢐-🢭🢰🢱🤀-🩓🩠-🩭🩰-🩴🩸-🩼🪀-🪆🪐-🪬🪰-🪺🫀-🫅🫐-🫙🫠-🫧🫰-🫶🬀-🮒🮔-🯊🯰-🯹𠀀-𪛟𪜀-𫜸𫝀-𫠝𫠠-𬺡𬺰-𮯠丽-𪘀𰀀-𱍊󠄀-󠇯]

What constitutes a control (vs printable) character in Boost's regular expression 'cntrl' character class?

I am making notes in notepad++ on notepad++'s regular expressions, which supposedly uses the same syntax as regular expressions do in Perl, which supposedly uses the Boost library and its character classes here. However, the previous page leaves a lot to be desired, as what constitutes a control, graphical, and printable character are undefined. After doing a lot of research I found that other languages define printable characters as non-control characters, and this source claims that anything that adheres to the POSIX standard does the same. However, using the expression \p{cntrl} I found Notepad++'s Find and Replace feature will match many control characters to printable characters, including carriage returns, line feeds, and even form feeds. I don't have time to test \p{cntrl} against every character in Unicode, so can someone just please give me Notepad's definition?

How to achieve unicode-agnostic case insensitive comparison in C++

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.
Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".
I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:
If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?
I read this post, but it doesn't quite meet my requirements.
Thanks!

The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.
The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)

Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).
in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'
so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character
but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character
You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.
(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'

For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.
The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.
As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).
At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.
There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)
Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.
If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.
See also:
The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt

Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)
If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence
Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.
A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.
C'est la vie. これが私たちの世界です。

How to search a non-ASCII character in a c++ string?

string s="x1→(y1⊕y2)∧z3";
for(auto i=s.begin(); i!=s.end();i++){
if(*i=='→'){
...
}
}
The char comparing is definitely wrong, what's the correct way to do it? I am using vs2013.

First you need some basic understanding of how programs handle Unicode. Otherwise, you should read up, I quite like this post on Joel on Software.
You actually have 2 problems here:
Problem #1: getting the string into your program
Your first problem is getting that actual string in your string s. Depending on the encoding of your source code file, MSVC may corrupt any non-ASCII characters in that string.
either save your C++ file as UTF-16 (which Windows confusingly calls Unicode), and use whcar_t and wstring (effectively encoding the expression as UTF-16). Saving as UTF-8 with BOM will also work. Any other encoding and your L"..." character literals will contain the wrong characters.
Note that other platforms may define wchar_t as 4 bytes instead of 2. So the handling of characters above U+FFFF will be non-portable.
In all other cases, you can't just write those characters in your source file. The most portable way is encoding your string literals as UTF-8, using \x escape codes for all non-ASCII characters. Like this: "x1\xe2\x86\x92a\xe2\x8a\x95" "b)" rather than "x1→(a⊕b)".
And yes, that's as unreadable and cumbersome as it gets. The root problem is MSVC doesn't really support using UTF-8. You can go through this question here for an overview: How to create a UTF-8 string literal in Visual C++ 2008 .
But, also consider how often those strings will actually show up in your source code.
Problem #2: finding the character
(If you're using UTF-16, you can just find the L'→' character, since that character is representable as one whcar_t. For characters above U+FFFF you'll have to use the wide version of the workaround below.)
It's impossible to define a char representing the arrow character. You can however with a string: "\xe2\x86\x92". (that's a string with 3 chars for the arrow, and the \0 terminator.
You can now search for this string in your expression:
s.find("\xe2\x86\x92");
The UTF-8 encoding scheme guarantees this always finds the correct character, but keep in mind this is an offset in bytes.

My comment is too large, so i am submitting it as an answer.
The problem is that everybody is concentrating on the issue of different encodings that Unicode may use (UTF-8, UTF-16, UCS2, etc). But your problems here will just begin.
There is also an issue of composite characters, which will really mess up any search that you are trying to make.
Let's say you are looking for a character 'é', you find it in Unicode as U+00E9 and do your search, but it is not guaranteed that this is the only way to represent this character. The document may also contain U+0065 U+0301 combination. Which is actually exactly the same character.
Yes, not just "character that looks the same", but it is exactly the same, so any software and even some programming libraries will freely convert from one to another without even telling you.
So if you wish to make a search, that is robust, you will need something that represents not just different encodings of Unicode, but Unicode characters themselves with equality between Composite and Ready-Made chars.

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定？

Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).

"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.

As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.

Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.

/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.

it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.

It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).

This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js