Normative reference for Posix regex class definitions? - regex

At http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html (which looks sort of like an official specification for Posix) it lists the character classes which must be supported in regular expressions, including e.g. [:space:].
But where are those character classes defined? Where can I find definitively which characters [:space:] should match? I'm looking for an actual standard, not a wiki-like-page-thing or somebody's blog. Thanks.

This set is locale dependent.
The POSIX one is detailed here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
space
Define characters to be classified as white-space characters.
In the POSIX locale, exactly <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> shall be included.
In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph, or xdigit shall be specified. The <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> of the portable character set, and any characters included in the class blank are automatically included in this class.
In addition the the previously mentioned characters, locales are free to add any number of horizontal or vertical "space" characters, like for example "non breaking space", "fixed width space" and similar.
To know if a given character is part of this class or not in the current locale, you use the isspace function.

Related

What constitutes a control (vs printable) character in Boost's regular expression 'cntrl' character class?

I am making notes in notepad++ on notepad++'s regular expressions, which supposedly uses the same syntax as regular expressions do in Perl, which supposedly uses the Boost library and its character classes here. However, the previous page leaves a lot to be desired, as what constitutes a control, graphical, and printable character are undefined. After doing a lot of research I found that other languages define printable characters as non-control characters, and this source claims that anything that adheres to the POSIX standard does the same. However, using the expression \p{cntrl} I found Notepad++'s Find and Replace feature will match many control characters to printable characters, including carriage returns, line feeds, and even form feeds. I don't have time to test \p{cntrl} against every character in Unicode, so can someone just please give me Notepad's definition?

Can C++ variables in cpp file defined as Special Symbols β

Can we define the variable in c++/ c using special characters such as;
double ε,µ,β,ϰ;
If yes, how can this be achieved?
As per the working draft of CPP standard (N4713),
5.10 Identifiers [lex.name]
...
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified in Table 2. The initial element shall not be a universal-character-name designating a character whose encoding falls into one of the ranges specified in Table 3.
And when we look at table 3:
Table 3 — Ranges of characters disallowed initially (combining characters)
0300-036F 1DC0-1DFF 20D0-20FF FE20-FE2F
The symbols you have mentioned are the Greek Alphabet which ranges from U+0370 to U+03FF and the extended Greek set ranges from U+1F0x to U+1FFx as per wikipedia. Both these ranges are allowed as the initial element of an identifier.
Note that not all compilers provide support for this.
GCC 8.2 with -std=c++17 option fails to compile.
However, Clang 7.0 with -std=c++17 option compiles.
Live Demo for both GCC and Clang
Since the question is tagged Visual Studio: Just write the code as you'd expect it.
double β = 0.1;
When you save the file, Visual Studio will warn you that it needs to save the file as Unicode. Accept it, and it works. AFAICT, this also works in C mode, even though most other C99 extensions are unsupported in Visual Studio.
However, as of g++ 8.2, g++ still does not support non-ASCII characters used directly in identifiers, so the code is then effectively not portable.
Yes you can use special characters, but not all of them. You can find the allowed one in the link below.
You can find a detailed explanation on how to built identifier (with the list of unicode authorized characters) on the page Identifiers - cppreference.com.
An identifier is, quoting,
an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters, and most Unicode characters (see below for details). A valid identifier must begin with a non-digit character (Latin letter, underscore, or Unicode non-digit character). Identifiers are case-sensitive (lowercase and uppercase letters are distinct), and every character is significant.
Furthermore, Unicode characters need to be escaped.

How to search a unicode character using its code point in sublime text

From what I understand, unicode characters have various representations.
e.g., code point or hex byte (these two representations are not always the same if UTF-8 encoding is used).
If I want to search for a visible unicode character (e.g., 汉) I can just copy it and search. This works even if I do not know its underlying unicode representation. But for other characters which may not be easily visible, such as zeros width space, that way does not work well. For these characters, we may want to search it using its code point.
My question
If I have known a character's code point, how do I search it in sublime text using regular expression? I highlight sublime text because different editors may use different format.
Zero width space characters can be found via:
\x{200b}
Demo
Non breaking space characters can be found via:
\xa0
Demo
For unicode character whose code point is CODE_POINT (code point must be in hexadecimal format), we can safely use regular expression of the format \x{CODE_POINT} to search it.
General rules
For unicode characters whose code points can fit in two hex digits, it is fine to use \x without curly braces, but for those characters whose code points are more than two hex digits, you have to use \x followed by curly braces.
Some examples
For example, in order to find character A, you can use either \x{41} or \x41 to search it.
As another example, in order to find 我(according to here, its code point is U+6211), you have to use \x{6211} to search it instead of \x6211 (see image below). If you use \x6211, you will not find the character 我.

A portable regular expression for matching alphanumerics

I'm thinking about using the regular expression [0-9a-zA-Z]+ to match any alphanumeric string in the C++ standard library's regular expression library.
But I'm worried about portability. Sure, in an ASCII character set, this will work, and I think that 0-9 must match only digits in any encoding system since the standard insists that all encodings have this property. But the C++ standard doesn't insist on ASCII encoding, so my a-zA-Z part might give me strange results on some platforms; for example those with EBCDIC encoding.
I could use \d for the digit part but that also matches Arabic numerals.
What should I use for a fully portable regular expression that only matches digits and English alphabet letters of either case?
It seems that PCRE (the current version of which is PCRE2) has support for other encoding types, including EBCDIC.
Within the source code on their website, I found "this file" with the following (formatting mine):
A program called dftables (which is distributed with PCRE2) can be used to build alternative versions of this file. This is necessary if you are running in an EBCDIC environment, or if you want to default to a different encoding, for example ISO-8859-1. When dftables is run, it creates these tables in the current locale. If PCRE2 is configured with --enable-rebuild-chartables, this happens automatically.
Well, if you're worried about supporting an exotic encodings, you can just list all characters manually:
[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]+
This looks a bit dirty, but surely it will work everywhere.

Why does underscore comes under \w?

This may be a theoretical question.
Why does underscore _ comes under \w in regex and not under \W
I hope this isn't primarily opinion based, because there should be a reason.
Citation would be great, if at all available.
From Wikipedia's Regular expression article (emphasis mine):
An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers.
In perl, tcl and vim, this non-standard class is represented by \w (and characters outside this class are represented by \W).
\w matches any single code point that has any of the following properties:
\p{GC=Alphabetic} (letters and some more unicode points)
\p{GC=Mark} (Mark: Spacing, non-spacing, enclosing)
\p{GC=Connector_Punctuation} (e.g. underscore)
\p{GC=Decimal_Number} (numbers and other variants of numbers)
\p{Join_Control} (code points U+0200C and U+0200D)
These properties are used in the composition of programming language identifiers in scripts. For instance[1]:
The Connector Punctuation (\p{GC=Connector_Punctuation}) is added in for programming language identifiers, thus adding "_" and similar characters.
There is a[2]:
general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.
The \p{Join_Control} was actually recently added to the character class \w as well and here's a message that perl devs exchanged for its implementation, supporting my earlier mention that \w is used to compose identifiers.