What does [[:print:]] means on the following code?
echo 255 > /sys/devices/platform/[[:print:]]*/hwmon/hwmon[[:print:]]*/pwm1
What is the difference between using [[:print:]]* and only *?
echo 255 > /sys/devices/platform/*/hwmon/hwmon*/pwm1
Is there any known name of this feature, or any place I could read and understand more about?
[[:print:]], in either a glob-style expression or a POSIX-compliant regex, matches any printable character.
The reference for (the simplified, single-character case of) glob expressions is https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_01; it references the regular expression portion of the standard at https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 as being authoritative for square-bracket expressions, which describes [:print:] as one of the character-class expressions that all locales must provide. The details of this specific class are then provided in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01.
[[:print:]] is a Posix class recognized by all engines today as matching up to
144,544 or less, Unicode characters as of Unicode 14.
The raw representation of that is this UTF-8/32 regex based on UCD.
This matches up to ALL of them. Some engines match less, doesn't matter though,
it is always this or less, never more as of V14.
[\\a-zA-Z0-9\t-\r-/:-#\[\]-`{-~\\ -¬®-ÿĀ-ͷͺ-Ϳ΄-ΊΌΎ-ΡΣ-ԯԱ-Ֆՙ-֊֍-֏֑-ׇא-תׯ-״؆-؛؝-ۜ۞-܍ܐ-݊ݍ-ޱ߀-ߺ߽-࠭࠰-࠾ࡀ-࡛࡞ࡠ-ࡪࡰ-ࢎ࢘-ࣣ࣡-ঃঅ-ঌএঐও-নপ-রলশ-হ়-ৄেৈো-ৎৗড়ঢ়য়-ৣ০-৾ਁ-ਃਅ-ਊਏਐਓ-ਨਪ-ਰਲਲ਼ਵਸ਼ਸਹ਼ਾ-ੂੇੈੋ-੍ੑਖ਼-ੜਫ਼੦-੶ઁ-ઃઅ-ઍએ-ઑઓ-નપ-રલળવ-હ઼-ૅે-ૉો-્ૐૠ-ૣ૦-૱ૹ-૿ଁ-ଃଅ-ଌଏଐଓ-ନପ-ରଲଳଵ-ହ଼-ୄେୈୋ-୍୕-ୗଡ଼ଢ଼ୟ-ୣ୦-୷ஂஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதந-பம-ஹா-ூெ-ைொ-்ௐௗ௦-௺ఀ-ఌఎ-ఐఒ-నప-హ఼-ౄె-ైొ-్ౕౖౘ-ౚౝౠ-ౣ౦-౯౷-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹ಼-ೄೆ-ೈೊ-್ೕೖೝೞೠ-ೣ೦-೯ೱೲഀ-ഌഎ-ഐഒ-ൄെ-ൈൊ-൏ൔ-ൣ൦-ൿඁ-ඃඅ-ඖක-නඳ-රලව-ෆ්ා-ුූෘ-ෟ෦-෯ෲ-෴ก-ฺ฿-๛ກຂຄຆ-ຊຌ-ຣລວ-ຽເ-ໄໆ່-ໍ໐-໙ໜ-ໟༀ-ཇཉ-ཬཱ-ྗྙ-ྼ྾-࿌࿎-࿚က-ჅჇჍა-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚ፝-፼ᎀ-᎙Ꭰ-Ᏽᏸ-ᏽ᐀-᚜ᚠ-ᛸᜀ-᜕ᜟ-᜶ᝀ-ᝓᝠ-ᝬᝮ-ᝰᝲᝳក-៝០-៩៰-៹᠀-᠍᠏-᠙ᠠ-ᡸᢀ-ᢪᢰ-ᣵᤀ-ᤞᤠ-ᤫᤰ-᤻᥀᥄-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉ᧐-᧚᧞-ᨛ᨞-ᩞ᩠-᩿᩼-᪉᪐-᪙᪠-᪭᪰-ᫎᬀ-ᭌ᭐-᭾ᮀ-᯳᯼-᰷᰻-᱉ᱍ-ᲈᲐ-ᲺᲽ-᳇᳐-ᳺᴀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ῄῆ-ΐῖ-Ί῝-`ῲ-ῴῶ-῾ -\ ‐-\
\ -\ ⁰ⁱ⁴-₎ₐ-ₜ₠-⃀⃐-⃰℀-↋←-␦⑀-⑊①-⭳⭶-⮕⮗-ⳳ⳹-ⴥⴧⴭⴰ-ⵧⵯ⵰⵿-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⷠ-⹝⺀-⺙⺛-⻳⼀-⿕⿰-⿻ -〿ぁ-ゖ゙-ヿㄅ-ㄯㄱ-ㆎ㆐-㇣ㇰ-㈞㈠-ꒌ꒐-꓆ꓐ-ꘫꙀ-꛷꜀-ꟊꟐꟑꟓꟕ-ꟙꟲ-꠬꠰-꠹ꡀ-꡷ꢀ-ꣅ꣎-꣙꣠-꥓꥟-ꥼꦀ-꧍ꧏ-꧙꧞-ꧾꨀ-ꨶꩀ-ꩍ꩐-꩙꩜-ꫂꫛ-꫶ꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-꭫ꭰ-꯭꯰-꯹가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִ-זּטּ-לּמּנּסּףּפּצּ-﯂ﯓ-ﶏﶒ-ﷇ﷏ﷰ-︙︠-﹒﹔-﹦﹨-﹫ﹰ-ﹴﹶ-ﻼ!-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ¢-₩│-○�𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐄀-𐄂𐄇-𐄳𐄷-𐆎𐆐-𐆜𐆠𐇐-𐇽𐊀-𐊜𐊠-𐋐𐋠-𐋻𐌀-𐌣𐌭-𐍊𐍐-𐍺𐎀-𐎝𐎟-𐏃𐏈-𐏕𐐀-𐒝𐒠-𐒩𐒰-𐓓𐓘-𐓻𐔀-𐔧𐔰-𐕣𐕯-𐕺𐕼-𐖊𐖌-𐖒𐖔𐖕𐖗-𐖡𐖣-𐖱𐖳-𐖹𐖻𐖼𐘀-𐜶𐝀-𐝕𐝠-𐝧𐞀-𐞅𐞇-𐞰𐞲-𐞺𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿-𐡕𐡗-𐢞𐢧-𐢯𐣠-𐣲𐣴𐣵𐣻-𐤛𐤟-𐤹𐤿𐦀-𐦷𐦼-𐧏𐧒-𐨃𐨅𐨆𐨌-𐨓𐨕-𐨗𐨙-𐨵𐨸-𐨿𐨺-𐩈𐩐-𐩘𐩠-𐪟𐫀-𐫦𐫫-𐫶𐬀-𐬵𐬹-𐭕𐭘-𐭲𐭸-𐮑𐮙-𐮜𐮩-𐮯𐰀-𐱈𐲀-𐲲𐳀-𐳲𐳺-𐴧𐴰-𐴹𐹠-𐹾𐺀-𐺩𐺫-𐺭𐺰𐺱𐼀-𐼧𐼰-𐽙𐽰-𐾉𐾰-𐿋𐿠-𐿶𑀀-𑁍𑁒-𑁵𑁿-𑂼𑂾-𑃂𑃐-𑃨𑃰-𑃹𑄀-𑄴𑄶-𑅇𑅐-𑅶𑆀-𑇟𑇡-𑇴𑈀-𑈑𑈓-𑈾𑊀-𑊆𑊈𑊊-𑊍𑊏-𑊝𑊟-𑊩𑊰-𑋪𑋰-𑋹𑌀-𑌃𑌅-𑌌𑌏𑌐𑌓-𑌨𑌪-𑌰𑌲𑌳𑌵-𑌹𑌻-𑍄𑍇𑍈𑍋-𑍍𑍐𑍗𑍝-𑍣𑍦-𑍬𑍰-𑍴𑐀-𑑛𑑝-𑑡𑒀-𑓇𑓐-𑓙𑖀-𑖵𑖸-𑗝𑘀-𑙄𑙐-𑙙𑙠-𑙬𑚀-𑚹𑛀-𑛉𑜀-𑜚𑜝-𑜫𑜰-𑝆𑠀-𑠻𑢠-𑣲𑣿-𑤆𑤉𑤌-𑤓𑤕𑤖𑤘-𑤵𑤷𑤸𑤻-𑥆𑥐-𑥙𑦠-𑦧𑦪-𑧗𑧚-𑧤𑨀-𑩇𑩐-𑪢𑪰-𑫸𑰀-𑰈𑰊-𑰶𑰸-𑱅𑱐-𑱬𑱰-𑲏𑲒-𑲧𑲩-𑲶𑴀-𑴆𑴈𑴉𑴋-𑴶𑴺𑴼𑴽𑴿-𑵇𑵐-𑵙𑵠-𑵥𑵧𑵨𑵪-𑶎𑶐𑶑𑶓-𑶘𑶠-𑶩𑻠-𑻸𑾰𑿀-𑿱𑿿-𒎙𒐀-𒑮𒑰-𒑴𒒀-𒕃𒾐-𒿲𓀀-𓐮𔐀-𔙆𖠀-𖨸𖩀-𖩞𖩠-𖩩𖩮-𖪾𖫀-𖫉𖫐-𖫭𖫰-𖫵𖬀-𖭅𖭐-𖭙𖭛-𖭡𖭣-𖭷𖭽-𖮏𖹀-𖺚𖼀-𖽊𖽏-𖾇𖾏-𖾟𖿠-𖿤𖿰𖿱𗀀-𘟷𘠀-𘳕𘴀-𘴈𚿰-𚿳𚿵-𚿻𚿽𚿾𛀀-𛄢𛅐-𛅒𛅤-𛅧𛅰-𛋻𛰀-𛱪𛱰-𛱼𛲀-𛲈𛲐-𛲙𛲜-𛲟𜼀-𜼭𜼰-𜽆𜽐-𜿃𝀀-𝃵𝄀-𝄦𝄩-𝅲𝅻-𝇪𝈀-𝉅𝋠-𝋳𝌀-𝍖𝍠-𝍸𝐀-𝑔𝑖-𝒜𝒞𝒟𝒢𝒥𝒦𝒩-𝒬𝒮-𝒹𝒻𝒽-𝓃𝓅-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔞-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕒-𝚥𝚨-𝟋𝟎-𝪋𝪛-𝪟𝪡-𝪯𝼀-𝼞𞀀-𞀆𞀈-𞀘𞀛-𞀡𞀣𞀤𞀦-𞀪𞄀-𞄬𞄰-𞄽𞅀-𞅉𞅎𞅏𞊐-𞊮𞋀-𞋹𞋿𞟠-𞟦𞟨-𞟫𞟭𞟮𞟰-𞟾𞠀-𞣄𞣇-𞣖𞤀-𞥋𞥐-𞥙𞥞𞥟𞱱-𞲴𞴁-𞴽𞸀-𞸃𞸅-𞸟𞸡𞸢𞸤𞸧𞸩-𞸲𞸴-𞸷𞸹𞸻𞹂𞹇𞹉𞹋𞹍-𞹏𞹑𞹒𞹔𞹗𞹙𞹛𞹝𞹟𞹡𞹢𞹤𞹧-𞹪𞹬-𞹲𞹴-𞹷𞹹-𞹼𞹾𞺀-𞺉𞺋-𞺛𞺡-𞺣𞺥-𞺩𞺫-𞺻𞻰𞻱🀀-🀫🀰-🂓🂠-🂮🂱-🂿🃁-🃏🃑-🃵🄀-🆭🇦-🈂🈐-🈻🉀-🉈🉐🉑🉠-🉥🌀-🛗🛝-🛬🛰-🛼🜀-🝳🞀-🟘🟠-🟫🟰🠀-🠋🠐-🡇🡐-🡙🡠-🢇🢐-🢭🢰🢱🤀-🩓🩠-🩭🩰-🩴🩸-🩼🪀-🪆🪐-🪬🪰-🪺🫀-🫅🫐-🫙🫠-🫧🫰-🫶🬀-🮒🮔-🯊🯰-🯹𠀀-𪛟𪜀-𫜸𫝀-𫠝𫠠-𬺡𬺰-𮯠丽-𪘀𰀀-𱍊󠄀-󠇯]
Related
Recently I was told, that + (one or more occurrence of the previous pattern/character) is not part of basic regex. Not even when written as \+.
It was on a question about maximum compatibility.
I was under the impression that ...
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
... always results in:
Hello.World.I.am.an.example.text
But then I was told that "it replaces every character not lowercase or a digit followed by + " and that it is the same as [^a-z0-9][+].
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
POSIX "basic" regular expressions do not support + (nor ?!). Most implementations of sed add support for \+ but it's not a POSIX standard feature. If your goal is maximum portability you should avoid using it. Notice that you have to use \+ rather than the more common +.
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
The -E flag enables "extended" regular expressions, which are a lot closer to the syntax used in Perl, JavaScript, and most other modern regex engines. With -E you don't need to have a backslash; it's simply +.
echo "Hello World, I am an example-text" | sed -E 's#[^a-z0-9]+#.#ig'
From https://www.regular-expressions.info/posix.html:
POSIX or "Portable Operating System Interface for uniX" is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.
POSIX BRE does not support any other features. Even alternation is not supported.
(Emphasis mine.)
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
I can't think of any real world language or tool that supports neither + nor \+.
In the formal mathematical definition of regular expressions there are commonly only three operations defined:
Concatenation: AB matches A followed by B.
Alternation: A|B matches either A or B.
Kleene star: R* matches 0 or more repetitions of R.
These three operations are enough to give the full expressive power of regular expressions†. Operators like ? and + are convenient in programming but not necessary in a mathematical context. If needed, they are defined in terms of the others: R? is R|ε and R+ is RR*.
† Mathematically speaking, that is. Features like back references and lookahead/lookbehind don't exist in formal language theory. Those features add additional expressive power not available in mathematical definitions of regular expressions.
In some traditional sed implementations, you have to enable "extended" regular expressions to get support for + to mean "one or more."
For evidence of this, see: sed plus sign doesn't work
I am currently learning regular expression. I met a problem that I can't find answer on stackoverflow and I hope someone can help find the answer.
I use vim in mac OS system and vim shows the line.
If the file "regular_expression.txt" is:
"Open Source" is a good mechanism to develop programs.
You are the No. 1.
Then
grep -n '[:lower:]' regular_expression.txt
will return
1:"Open Source" is a good mechanism to develop programs.
2:You are the No. 1.
The command
grep '[[:lower:]]' regular_expression.txt
will return
2:You are the No. 1.
I can't understand the above difference because it seems to me that [:lower:] is a set of lower characters. [[:lower:]] should be the same as [:lower:]. It is also confusing that in the first case where [:lower:] is used, all the lines in the file are returned.
POSIX character classes must be wrapped in bracket expressions.
The [:lower:] pattern is a positive bracket expression that matches a single char, :, l, o, w, e or r.
The [[:lower:]] pattern is a positive bracket expression that matches any char that is matched with the [:lower:] POSIX character class (that matches any lowercase letters).
See grep manual:
certain named classes of characters are predefined within bracket expressions... Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.
If you mistakenly omit the outer brackets, and search for say, [:upper:], GNU grep prints a diagnostic and exits with status 2, on the assumption that you did not intend to search for the nominally equivalent regular expression: [:epru].
I'm thinking about using the regular expression [0-9a-zA-Z]+ to match any alphanumeric string in the C++ standard library's regular expression library.
But I'm worried about portability. Sure, in an ASCII character set, this will work, and I think that 0-9 must match only digits in any encoding system since the standard insists that all encodings have this property. But the C++ standard doesn't insist on ASCII encoding, so my a-zA-Z part might give me strange results on some platforms; for example those with EBCDIC encoding.
I could use \d for the digit part but that also matches Arabic numerals.
What should I use for a fully portable regular expression that only matches digits and English alphabet letters of either case?
It seems that PCRE (the current version of which is PCRE2) has support for other encoding types, including EBCDIC.
Within the source code on their website, I found "this file" with the following (formatting mine):
A program called dftables (which is distributed with PCRE2) can be used to build alternative versions of this file. This is necessary if you are running in an EBCDIC environment, or if you want to default to a different encoding, for example ISO-8859-1. When dftables is run, it creates these tables in the current locale. If PCRE2 is configured with --enable-rebuild-chartables, this happens automatically.
Well, if you're worried about supporting an exotic encodings, you can just list all characters manually:
[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]+
This looks a bit dirty, but surely it will work everywhere.
Say I have an interval with characters ['A'-'Z'], I want to match every of these characters except the letter 'F' and I need to do it through the ^ operator. Thus, I don't want to split it into two different intervals.
How can I do it the best way? I want to write something like ['A'-'Z']^'F' (All characters between A-Z except the letter F). This site can be used as reference: http://regexr.com/
EDIT: The relation to ocaml is that I want to define a regular expression of a string literal in ocamllex that starts/ends with a doublequote ( " ) and takes allowed characters in a certain range. Therefore I want to exclude the doublequotes because it obviously ends the string. (I am not considering escaped characters for the moment)
Since it is very rare to find two regular expressions libraries / processors with exactly the same regular expression syntax, it is important to always specify precisely which system you are using.
The tags in the question lead me to believe that you might be using ocamllex to build a scanner. In that case, according to the documentation for its regular expression syntax, you could use
['A'-'Z'] # 'F'
That's loosely based on the syntax used in flex:
[A-Z]{-}[F]
Java and Ruby regular expressions include a similar operator with very different syntax:
[A-Z&&[^F]]
If you are using a regular expression library which includes negative lookahead assertions (Perl, Python, Ecmascript/C++, and others), you could use one of those:
(?!F)[A-Z]
Or you could use a positive lookahead assertion combined with a negated character class:
(?=[A-Z])[^F]
In this simple case, both of those constructions effectively do a conjunction, but lookaround assertions are not really conjunctions. For a regular expression system which does implement a conjunction operator, see, for example, Ragel.
The ocamllex syntax for character set difference is:
['A'-'Z'] # 'F'
which is equivalent to
['A'-'E' 'G'-'Z']
(?!F)[A-Z] or ((?!F)[A-Z])*
This will match every uppercase character excluding 'F'
Use character class subtraction:
[A-Z&&[^F]]
The alternative of [A-EG-Z] is "OK" for a single exception, but breaks down quickly when there are many exceptions. Consider this succinct expression for consonants (non-vowels):
[B-Z&&[^EIOU]]
vs this train wreck
[B-DF-HJ-NP-TV-Z]
The regex below accomplishes what you want using ^ and without splitting into different intervals. It also resambles your original thought (['A'-'Z']^'F').
/(?=[A-Z])[^F]/ig
If only uppercase letters are allowed simple remove the i flag.
Demo
English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).