Implementation feature of character shorthand \s - regex

I am wondering why in Erlang in the regex library re, the character shorthand \s only selects a whitespace (32 ASCII character) and is not the equivalent of [ \\t\\n\\r] regular expression.
At the same time, the "anti-pattern" for \s - \S(Non-space character shorthand) implements predictable behavior.
Test labs
EUnit tests for \s.
EUnit tests for [ \\t\\n\\r].
EUnit tests for \S.

I still found the answer to my question in the documentation of the re library.
For compatibility with Perl, \s did not used to match the VT character
(code 11), which made it different from the the POSIX "space" class.
However, Perl added VT at release 5.18, and PCRE followed suit at
release 8.34. The default \s characters are now HT (9), LF (10), VT
(11), FF (12), CR (13), and space (32), which are defined as white
space in the "C" locale. This list may vary if locale-specific
matching is taking place. For example, in some locales the
"non-breaking space" character (\xA0) is recognized as white space,
and in others the VT character is not.
From this, I conclude that the expected work with is possible so only if there is a set locale value - "C".
Now I understand why everything works this way - it was conceived by the developers, that is, we need to take this feature into account when implementing regular expressions in Erlang.

To overcome the implementation limitations (related to the need to take into account the locale value), I implemented a project to be able to adapt the regular expression text to the available capabilities of my software (my operating system does not have the required parameter set of locale, but I would like to continue using it).
This is a helper library re_tuner.

Related

A portable regular expression for matching alphanumerics

I'm thinking about using the regular expression [0-9a-zA-Z]+ to match any alphanumeric string in the C++ standard library's regular expression library.
But I'm worried about portability. Sure, in an ASCII character set, this will work, and I think that 0-9 must match only digits in any encoding system since the standard insists that all encodings have this property. But the C++ standard doesn't insist on ASCII encoding, so my a-zA-Z part might give me strange results on some platforms; for example those with EBCDIC encoding.
I could use \d for the digit part but that also matches Arabic numerals.
What should I use for a fully portable regular expression that only matches digits and English alphabet letters of either case?
It seems that PCRE (the current version of which is PCRE2) has support for other encoding types, including EBCDIC.
Within the source code on their website, I found "this file" with the following (formatting mine):
A program called dftables (which is distributed with PCRE2) can be used to build alternative versions of this file. This is necessary if you are running in an EBCDIC environment, or if you want to default to a different encoding, for example ISO-8859-1. When dftables is run, it creates these tables in the current locale. If PCRE2 is configured with --enable-rebuild-chartables, this happens automatically.
Well, if you're worried about supporting an exotic encodings, you can just list all characters manually:
[0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]+
This looks a bit dirty, but surely it will work everywhere.

Why does underscore comes under \w?

This may be a theoretical question.
Why does underscore _ comes under \w in regex and not under \W
I hope this isn't primarily opinion based, because there should be a reason.
Citation would be great, if at all available.
From Wikipedia's Regular expression article (emphasis mine):
An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers.
In perl, tcl and vim, this non-standard class is represented by \w (and characters outside this class are represented by \W).
\w matches any single code point that has any of the following properties:
\p{GC=Alphabetic} (letters and some more unicode points)
\p{GC=Mark} (Mark: Spacing, non-spacing, enclosing)
\p{GC=Connector_Punctuation} (e.g. underscore)
\p{GC=Decimal_Number} (numbers and other variants of numbers)
\p{Join_Control} (code points U+0200C and U+0200D)
These properties are used in the composition of programming language identifiers in scripts. For instance[1]:
The Connector Punctuation (\p{GC=Connector_Punctuation}) is added in for programming language identifiers, thus adding "_" and similar characters.
There is a[2]:
general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.
The \p{Join_Control} was actually recently added to the character class \w as well and here's a message that perl devs exchanged for its implementation, supporting my earlier mention that \w is used to compose identifiers.

Perl tainting via regular expression

Short version
In the code below, $1 is tainted and I don't understand why.
Long version
I'm running Foswiki on a system with perl v5.14.2 with -T taint check mode enabled.
Debugging a problem with that setup, I managed to construct the following SSCCE. (Note that I edited this post, the first version was longer and more complicated, and comments still refer to that.)
#!/usr/bin/perl -T
use strict;
use warnings;
use locale;
use Scalar::Util qw(tainted);
my $var = "foo.bar_baz";
$var =~ m/^(.*)[._](.*?)$/;
print(tainted($1) ? "tainted\n" : "untainted\n");
Although the input string $var is untainted and the regular expression is fixed, the resulting capture group $1 is tainted. Which I find really strange.
The perlsec manual has this to say about taint and regular expressions:
Values may be untainted by using them as keys in a hash; otherwise the
only way to bypass the tainting mechanism is by referencing
subpatterns from a regular expression match. Perl presumes that if
you reference a substring using $1, $2, etc., that you knew what you
were doing when you wrote the pattern.
I would imagine that even if the input were tainted, the output would still be untainted. To observe the reverse, tainted output from untainted input, feels like a strange bug in perl. But if one reads more of perlsec, it also points users at the SECURITY section of perllocale. There we read:
when use locale is in effect, Perl uses the tainting mechanism (see
perlsec) to mark string results that become locale-dependent, and
which may be untrustworthy in consequence. Here is a summary of the
tainting behavior of operators and functions that may be affected by
the locale:
Comparison operators (lt, le , ge, gt and cmp) […]
Case-mapping interpolation (with \l, \L, \u or \U) […]
Matching operator (m//):
Scalar true/false result never tainted.
Subpatterns, either delivered as a list-context result or as $1
etc. are tainted if use locale (but not use locale
':not_characters') is in effect, and the subpattern regular
expression contains \w (to match an alphanumeric character), \W
(non-alphanumeric character), \s (whitespace character), or \S
(non whitespace character). The matched-pattern variable, $&, $`
(pre-match), $' (post-match), and $+ (last match) are also
tainted if use locale is in effect and the regular expression contains
\w, \W, \s, or \S.
Substitution operator (s///) […]
[⋮]
This looks like it should be an exhaustive list. And I don't see how it could apply: My regex is not using any of \w, \W, \s or \S, so it should not depend on locale.
Can someone explain why this code taints the varibale $1?
There currently is a discrepancy between the documentation as quoted in the question, and the actual implementation as of perl 5.18.1. The problem are character classes. The documentation mentions \w,\s,\W,\S in what sounds like an exhaustive list, while the implementation taints on pretty much every use of […].
The right solution would probably be somewhere in between: character classes like [[:word:]] should taint, since it depends on locale. My fixed list should not. Character ranges like [a-z] depend on collation, so in my personal opinion they should taint as well. \d depends on what a locale considers a digit, so it, too, should taint even if it is neither one of the escape sequences mentioned so far nor a bracketed class.
So in my opinion, both the documentation and the implementation need fixing. Perl devs are working on this. For progress information, please look at the perl bug report I filed.
For a fixed list of characters, one viable workaround appears to be a formulation as a disjunction, i.e. (?:\.|_) instead of [._]. It is more verbose, but should work even with the current (in my opinion buggy) perl versions.

Does regular expression \d match minus sign and/or decimal point?

I'm look at some old PERL/CGI code to debug an issue and noticed a lot of uses of:
\d - Match non-digit character
\D - Match digit character
Most online docs mention that \d is the same as [0-9], which is what I've always thought of it as. But, I've also noticed Stackoverflow Questions that mention character set difference.
Does "\d" in regex mean a digit?
Does \d also match a minus sign and/or decimal point?
I'm off to do some testing.
Does \d also match a minus sign and/or decimal point?
NO
I don't know how Perl determine whether to use Unicode or ASCII or locale by default (no flag, no use). Regardless, by declaring use re '/a'; (ASCII), or use re '/u'; (Unicode), or use re '/l'; (locale), you will clearly signify to the Perl interpreter (and human reader) which mode you want to use and avoid unexpected behaviour.
Due to the effect of modifiers, \d has at least 2 meanings:
Under effect of /a flag (ASCII), \d will match digits from 0 to 9 (no more and no less).
Under effect of /u flag (Unicode), \d will match any decimal digit in any language, and is equivalent to \p{Digit}reference. This effectively makes \d+ pretty useless and dangerous to use, since it allows a mix of digits in any languages.
Quote from description of /u flag
And, \d+ , may match strings of digits that are a mixture from different writing systems, creating a security issue. num() in Unicode::UCD can be used to sort this out. Or the /a modifier can be used to force \d to match just the ASCII 0 through 9.
\d will not match any sign or punctuation, since those characters does not belong to Nd (Number, decimal digit) General Category of Unicode.
The answer is no. It merely does a digit check. However, Unicode makes things a bit more complex.
If you want to make sure something is a number -- a decimal number -- ake a look at the Scalar::Util module. One of the functions it has is look_like_number. This can be used to see if the string you're looking at could be a number or not, and works better than trying to use a regular expression.
This module has been part of standard Perl for a while, so you should have it on your system.

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).