I am a newbie of regular expressions, I try to understand what kind of string of the following regular expressions trying to match:
set result [regexp "$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" [split $outPut \n]]
what does the regular expressions above trying to match?what is the value of result?
The complication here is that the regex specification is protected from the Tcl's string interpolation rules.
To detangle, you should think along these lines:
"$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" is a double-quoted string, so the usual interpolation rules apply:
Each backslash escapes the following character;
Each $variable reference is substituted for its value;
[command ...] is substituted for the string returned by the executed command.
So each occurence of \\ is there to produce a single '\' character in the interpolated string, and \[ are meant to prevent Tcl from interpreting those [^\n] as commands (named "^\n") to be executed.
So if we suppose that the PersonName variable contains "Joe", PersonId contains DEAD and gender contains "male", Tcl will get Joe\|[^\n]*\|[^\n]*\|\s*0xDEAD\|\s*male after performing all substitutions on the source string.
Now the resulting string is passed to the RE engine which applies its own syntacting rules when it parses the string denoting a regex, as described in the re_syntax manual page.
According to these rules, each backslash, again, escapes the following character unless it's a special "character-entry escape" so here we have:
\s denotes "any whitespace character";
\| escapes the '|' making it lose its usual meaning—to introduce an alteration—so that it literally matches the character '|'.
The [^\n]* construct means "a longest series of zero or more characters not including the newline character". Read up on "character classes" in regexes for more info.
The value of result will be the number of times the regular expression matched. In the absence of the -all option, that will always be 0 or 1 (i.e., not-found/found).
Overall, that regular expression (which #kostix's answer explains well) is really ugly though. REs are a powerful tool, but you can get very confused with them very easily. Moreover, if you're splitting the output on newlines then you don't need to try to exclude them in the RE match; there will definitely be no newlines in the result of split in that case.
If we better understood what you were trying to do, we could direct you to far more effective methods of matching (e.g., using lsearch with suitable options, loading the data into an in-memory SQLite database).
Related
I am struggling with writing regex expression in Snowflake.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123','^DEM.*\d\d$') AS regex
I would like to find all strings that starts with "DEM" and ends with two digits. Unfortunately the expression that I am using returns FALSE.
I was checking this expression in two regex generators and it worked.
In snowflake the backslash character \ is an escape character.
Reference: Escape Characters and Caveats
So you need to use 2 backslashes in a regex to express 1.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123', '^DEM.*\\d\\d$') AS regex
Or you could write the regex pattern in such a way that the backslash isn't used.
For example, the pattern ^DEM.*[0-9]{2}$ matches the same as the pattern ^DEM.*\d\d$.
You need to escape your backslashes in your SQL before it can be parsed as a regex string. (sometimes it gets a bit silly with the number of backslashes needed)
Your example should look like this
RLIKE('DEM7BZB01-123','^DEM.*\\d\\d$') AS regex
RLIKE (which is an alias in Snowflake for the SQL Standard REGEXP_LIKE function) implicitly adds ^ and $ to your search pattern...
The function implicitly anchors a pattern at both ends (i.e. '' automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
so you can remove them, and that then allows you to use $$ quoting
In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).
You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).
so you can simply use the regex DEM.*\d\d to find all strings that starts with DEM and ends with two digits without extra escaping as follows
SELECT
'DEM7BZB01-123' AS SKU
, RLIKE('DEM7BZB01-123', $$DEM.*\d\d$$) AS regex
which gives
SKU |REGEX|
-------------+-----+
DEM7BZB01-123|true |
How do you call the "inner part" of a regular expression without the delimiters?
For example:
Given these regular expressions: /\d+/ and #(hello)# we can break each one down into 3 parts:
/ + \d+ + /
# + (hello) + #
We all name / or # the delimiter.
How do you call the inner part? The \d+ or (hello) part?
In this BNF https://www2.cs.sfu.ca/~cameron/Teaching/384/99-3/regexp-plg.html referenced here https://stackoverflow.com/a/265466/1315009 it seems they call "regular expression" to the inner part. If that is true, then how do you call the regular expression with the delimiters concatenated?
The reason for asking this is Clean Code rules. I'm writing a tokenizer and I need to clearly name the "full thing" and the "inner thing" with proper names.
The regex delimiters delimit the following parts:
<action>/<pattern>(/<substituiton>)/<modifiers>
Action
This part of the regex delimiter construction contains implicit (no char) or explicit (expressed with a char) information about what the regex will be doing: matching, replacing, and sometimes even if it is going to work on the entire file as in Vim. Actions are also called commands (or operators) in the POSIX tools context. The usual action chars are s and m that stand for substitution and match.
Pattern
The second part, you called it inner part - is called a pattern (see perlop reference). When describing the $var =~ m/mushroom/ expression, this reference explains:
The portion enclosed in '/' characters denotes the characteristic we are looking for. We use the term pattern for it.
So, when we say "regex" or "regexp" we basically refer to the regular expression pattern.
Substituiton
This part only exists in substitutions constructions, prefixed with s action/command. Substitution patterns syntax is very different from regex pattern syntax, as they can usually contain named or numbered backreferences, escape sequences to cancel the backreference syntax (cf. "dollar escaping"), and sometimes case changing operators (like \l, \L...\E, \u and \U...\E).
Modifiers
Also called flags, these parts help "fine-tune" the process of matching patterns by regex engines. Most common modifiers are the i case insensitive flag, g global matching flag, s singleline/dotall modifier that makes . match across line breaks (in NFA regexps other than Onigmo/Oniguruma, it uses m).
I am trying to create a simple state machine in flex which has to ensure that strings spanning multiple lines must have \ for line breaks. Concretely:
"this is \
ok"
"this is not
ok"
The first one is valid. The second one is not.
I have the following state machine:
expectstring BEGIN(expectstr);
<expectstr>[^\n] {num_lines++;}
<expectstr>\ {flag = true;}
<expectstr>\n {printf("%s\n", flag ? "True" : False);}
But when I try to compile this state machine, flex tells me that the rule with \ can not be matched. Why is that?
I have looked at this but cannot figure it out.
In flex, the following pattern matches anything other than a newline:
.
You can also write that as
[^\n]
but . is more normal.
In order to match a backslash you can write
\\
"\\"
[\\]
Again, the first would be the usual way.
It's important to understand that [...] is an way of representing a set of characters, and that most regular expression operators are just ordinary characters inside the brackets. Similarly, "..." is a way of representing a sequence of characters and most regular expression operators are just ordinary characters inside the quotes.
Thus,
[a|b] matches one character if it is an a, a |, or a b
"a|b" matches the three-character sequence a | b
and|but matches either of the three-character sequences and or but.
Since flex lets you match regular expressions, you really don't need to manually build a state machine. Just use an appropriate regular expression. For example, the following will match strings which start and end with ", in which \ may be used to escape itself as well as newlines, and in which newlines (other than escaped ones) are illegal. I think that's your goal.
\"([^"\n\\]|\\(.|\n))*\"
You should make sure you understand how it works; there are lots of good explanations of regular expressions on the internet (and even more bad ones, so try to find one written by someone who knows what they are talking about). Here's the summary:
\" A literal double-quote
(...)* Any number of repetitions of:
[^"\n\\] Anything other than a double-quote, newline, or backslash
| Or
\\ A literal backslash, followed by
(...) Grouping
. Anything other than a newline
| Or
\n a newline
I'm curious as to what escape sequences get excluded from being matched in a Perl regular expression when interpolation is turned off, say by using an apostrophe (single-quote) as a delimiter for m'', and also why. The description of interpolation in perlop mentions that:
No interpolation is performed at this stage. Any backslashed sequences including \\ are treated at the stage to parsing regular expressions.
However, a testing of the escape sequence found in perlre, shows that not all escape sequences are treated the same.
So, I've tested all the simple escapes listed in the "Escape Sequences" section of the perlre, and found that some are "off" while some are "on". There appears to be a correspondence between the "on" and "off" escapes and the "character escapes" and "escape modifiers" descriptions in perlrebackslash, respectively. I haven't tested all the possible escapes listed on that page, just the ones from those two groups, thus far.
Even if I test all the possible escapes, I'm not sure I understand why some still work when interpolation is off, while others do not. Can anyone enlighten me?
update: As #tchrist suggested, here are some examples. I essentially used variations on the following shell code to test these against some user input from STDIN:
perl -e "use 5.012; while(<>) { say 'YES' if m'\t';}"
The escapes \e, \f, \n, \r, and \t, when used in a non-interpolated matching construct, such as m'\t' (etc.) will still match the special characters they escape instead of their literal string representations. This is the same matching behavior I see when I use an interpolated form of matching (e.g. m/\t/), which is what I meant by still "working".
On the other hand, modifiers like \L, \U, \l, and \u do not function the same inside of m'' as inside of m//. For example m'\uthis' does not match the input: "This is a string," while m/\uthis/ does match such an input. The first form will match the input: "\uthis is a string."
Its the difference between single quoted string and double quoted string, those rules are seperate from regex patterns
so m'$foo' is like '$foo' and not like "$foo"
use Data::Dump;
$foo = 12;
dd qr/$foo/i;
dd qr'$foo'i;
__END__
qr/12/i
qr/$foo/i
so if using interpolation, you're matching 12
and if you've disabled interpolation, you're matching $, the end of line (or string) followed by foo
More on this in http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators
update: on a side note, in addition to Data::Dump, both Data::Dumper and Data::Dump::Streamer "dump" qr'$foo'i erroneously as qr/$foo/i
Sorry, but once again I need help to understand rather complicated snippet from the "Programming Perl" book. Here it is (what is obscure to me marked as bold):
patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes
as the delimiter) and special characters indicated with backslash escapes. These are applied before the string is interpreted as a regular expression (This is one of the
few places in the Perl language where a string undergoes more than one pass of
processing). ...
Another consequence of this two-pass parsing is that the ordinary Perl tokener
finds the end of the regular expression first, just as if it were looking for the
terminating delimiter of an ordinary string. Only after it has found the end of the
string (and done any variable interpolation) is the pattern treated as a regular
expression. Among other things, this means you can’t “hide” the terminating
delimiter of a pattern inside a regex construct (such as a bracketed character class
or a regex comment, which we haven’t covered yet). Perl will see the delimiter
wherever it is and terminate the pattern at that point.
First, why it is said that Only after it has found the end of the string not the end of the regular expression which it was looking, as stated before?
Second, what does it mean you can’t “hide” the terminating delimiter of a pattern inside a regex construct? Why I can't hide the terminating delimiter /, whereas I can place it wherever I want either in the regexp directly /A\/C/ or in a interpolated variable (even without \):
my $s = 'A/';
my $p = 'A/C';
say $p =~ /$s/;
outputs 1.
While I was writing and re-reading my question I thought that this snippet tells about using a single-quote as a regexp delimiter, then it all seems quite cohesive. Is my assumption correct?
My appreciation.
It says "end of the string" instead of "end of the regular expression" because at that point it's treating the regex as if it were just a string.
It's trying to say that this does not work:
/foo[-/_]/
Even though normal regex metacharacters are not special inside [], Perl will see the regex as /foo[-/ and complain about an unterminated class.
It's trying to say that Perl does not parse the regex as it reads it. First it finds the end of the regex in your source code as if it were a quoted string, so the only special character is \. Then it interpolates any variables. Then it parses the result as a regular expression.
You can hide the terminating delimiter with \ because that works in ordinary strings. You can hide the delimiter inside an interpolated variable, because interpolation happens after the delimiter is found. If you use a bracketing delimiter (e.g. { } or [ ]), you can nest matching pairs of delimiters inside the regex, because q{} works like that too.
But you can't hide it inside any other regex construct.
Say you want to match a *. You would use
m/\*/
But what if you were using you used * as your delimiter? The following doesn't work:
m*\**
because it's interpreted as
m/*/
as seen in the following:
$ perl -e'm*\**'
Quantifier follows nothing in regex; marked by <-- HERE in m/* <-- HERE / at -e line 1.
Take the string literal
"a\"b"
It produces the string
a"b
Similarly, the match operator
m*a\*b*
produces the regex pattern
a*b
If you want to match a literal *, you have to use other means. In other words.
m*a\*b* === m/a*b/ matches pattern a*b
m*a\x{2A}b* === m/a\*b/ matches pattern a\*b