Regex to Glob and vice-versa conversion - regex

We have a requirement where we want to convert Regex to cloudfront supported Glob and vice-versa. Any suggestion how can we achieve that and first of all whether it's possible?especially from Regex to Glob, as I understand regex is kind of superset so it might not be possible to convert all the Regex to corresponding Glob?

To convert from a glob you would need to write a parser that split the pattern into an abstract syntax tree. For example, the glob *-{[0-9],draft}.docx might parse to [Anything(), "-", OneOf([Range("0", "9"), "draft"]), ".docx"].
Then you would walk the AST and output the equivalent regular expression for each node. For example, the rules you might use for this could be:
Anything() -> .*
Range(x, y) -> [x-y]
OneOf(x, y) -> (x|y)
resulting in the regular expression .*-([0-9]|draft).docx.
That's not perfect, because you also have to remember to escape any special characters; . is a special character in regular expressions, so you should escape it, yielding finally .*-([0-9]|draft)\.docx.
Strictly speaking regular expression cannot all be translated to glob patterns. The Kleene star operation does not exist in globbing; the simple regular expression a* (i.e., any number of a characters) cannot be translated to a glob pattern.
I'm not sure what types of globs CloudFront supports (the documentation returned no hits for the term "glob"), but here is some documentation on commonly-supported shell glob pattern wildcards.
Here is a summarization of the some equivalent sequences:
Glob Wildcard
Regular Expression
Meaning
?
.
Any single character
*
.*
Zero or more characters
[a-z]
[a-z]
Any character from the range
[!a-m]
[^a-m]
A character not in the range
[a,b,c]
[abc]
One of the given characters
{cat,dog,bat}
(cat|dog|bat)
One of the given options
{*.tar,*.gz}
(.*\.tar|.*\.gz)
One of the given options, considering nested wildcards

Related

Regular expression in Snowflake - starts with string and ends with digits

I am struggling with writing regex expression in Snowflake.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123','^DEM.*\d\d$') AS regex
I would like to find all strings that starts with "DEM" and ends with two digits. Unfortunately the expression that I am using returns FALSE.
I was checking this expression in two regex generators and it worked.
In snowflake the backslash character \ is an escape character.
Reference: Escape Characters and Caveats
So you need to use 2 backslashes in a regex to express 1.
SELECT
'DEM7BZB01-123' AS SKU,
RLIKE('DEM7BZB01-123', '^DEM.*\\d\\d$') AS regex
Or you could write the regex pattern in such a way that the backslash isn't used.
For example, the pattern ^DEM.*[0-9]{2}$ matches the same as the pattern ^DEM.*\d\d$.
You need to escape your backslashes in your SQL before it can be parsed as a regex string. (sometimes it gets a bit silly with the number of backslashes needed)
Your example should look like this
RLIKE('DEM7BZB01-123','^DEM.*\\d\\d$') AS regex
RLIKE (which is an alias in Snowflake for the SQL Standard REGEXP_LIKE function) implicitly adds ^ and $ to your search pattern...
The function implicitly anchors a pattern at both ends (i.e. '' automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
so you can remove them, and that then allows you to use $$ quoting
In single-quoted string constants, you must escape the backslash character in the backslash-sequence. For example, to specify \d, use \d. For details, see Specifying Regular Expressions in Single-Quoted String Constants (in this topic).
You do not need to escape backslashes if you are delimiting the string with pairs of dollar signs ($$) (rather than single quotes).
so you can simply use the regex DEM.*\d\d to find all strings that starts with DEM and ends with two digits without extra escaping as follows
SELECT
'DEM7BZB01-123' AS SKU
, RLIKE('DEM7BZB01-123', $$DEM.*\d\d$$) AS regex
which gives
SKU |REGEX|
-------------+-----+
DEM7BZB01-123|true |

How do you call the inner part of a regex? (the one delimited by the delimiters)

How do you call the "inner part" of a regular expression without the delimiters?
For example:
Given these regular expressions: /\d+/ and #(hello)# we can break each one down into 3 parts:
/ + \d+ + /
# + (hello) + #
We all name / or # the delimiter.
How do you call the inner part? The \d+ or (hello) part?
In this BNF https://www2.cs.sfu.ca/~cameron/Teaching/384/99-3/regexp-plg.html referenced here https://stackoverflow.com/a/265466/1315009 it seems they call "regular expression" to the inner part. If that is true, then how do you call the regular expression with the delimiters concatenated?
The reason for asking this is Clean Code rules. I'm writing a tokenizer and I need to clearly name the "full thing" and the "inner thing" with proper names.
The regex delimiters delimit the following parts:
<action>/<pattern>(/<substituiton>)/<modifiers>
Action
This part of the regex delimiter construction contains implicit (no char) or explicit (expressed with a char) information about what the regex will be doing: matching, replacing, and sometimes even if it is going to work on the entire file as in Vim. Actions are also called commands (or operators) in the POSIX tools context. The usual action chars are s and m that stand for substitution and match.
Pattern
The second part, you called it inner part - is called a pattern (see perlop reference). When describing the $var =~ m/mushroom/ expression, this reference explains:
The portion enclosed in '/' characters denotes the characteristic we are looking for. We use the term pattern for it.
So, when we say "regex" or "regexp" we basically refer to the regular expression pattern.
Substituiton
This part only exists in substitutions constructions, prefixed with s action/command. Substitution patterns syntax is very different from regex pattern syntax, as they can usually contain named or numbered backreferences, escape sequences to cancel the backreference syntax (cf. "dollar escaping"), and sometimes case changing operators (like \l, \L...\E, \u and \U...\E).
Modifiers
Also called flags, these parts help "fine-tune" the process of matching patterns by regex engines. Most common modifiers are the i case insensitive flag, g global matching flag, s singleline/dotall modifier that makes . match across line breaks (in NFA regexps other than Onigmo/Oniguruma, it uses m).

Strange behavior when using regex to match parentheses in vim

I'm having some trouble understanding why a regular expression is not working. I'm searching for the phrase #Test(groups = {"broken"}), and I'm not able to find it with this expression:
#Test\(groups = {"broken"}\)
However, this expression yields results:
#Test\(.*groups = {"broken"}\)
Why is this happening? I can't see why the first expression would not work, but I understand why the second one does.
\( is used for capture in vim since it does not use extended/"magic" regexen by default. If you want to search for a literal paren, use (.
The second expression works because .* matches (.
If you want to search for literal text, just prepend \V to the search pattern; then, only the backslash has special meaning and must be escaped:
/\V#Test(groups = {"broken"})
In contrast to most other regular expression dialects, many Vim atoms need to be prefixed with \ to be non-literal. To make Vim's patterns look more like Perl's, you can prepend \v; then, (...) do capture grouping (as you've expected), and you need to escape \( to match literal parentheses.

what does this regular expressions trying to match in TCL

I am a newbie of regular expressions, I try to understand what kind of string of the following regular expressions trying to match:
set result [regexp "$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" [split $outPut \n]]
what does the regular expressions above trying to match?what is the value of result?
The complication here is that the regex specification is protected from the Tcl's string interpolation rules.
To detangle, you should think along these lines:
"$PersonName\\|\[^\\n]*\\|\[^\\n]*\\|\\s*0x$PersonId\\|\\s*$gender" is a double-quoted string, so the usual interpolation rules apply:
Each backslash escapes the following character;
Each $variable reference is substituted for its value;
[command ...] is substituted for the string returned by the executed command.
So each occurence of \\ is there to produce a single '\' character in the interpolated string, and \[ are meant to prevent Tcl from interpreting those [^\n] as commands (named "^\n") to be executed.
So if we suppose that the PersonName variable contains "Joe", PersonId contains DEAD and gender contains "male", Tcl will get Joe\|[^\n]*\|[^\n]*\|\s*0xDEAD\|\s*male after performing all substitutions on the source string.
Now the resulting string is passed to the RE engine which applies its own syntacting rules when it parses the string denoting a regex, as described in the re_syntax manual page.
According to these rules, each backslash, again, escapes the following character unless it's a special "character-entry escape" so here we have:
\s denotes "any whitespace character";
\| escapes the '|' making it lose its usual meaning—to introduce an alteration—so that it literally matches the character '|'.
The [^\n]* construct means "a longest series of zero or more characters not including the newline character". Read up on "character classes" in regexes for more info.
The value of result will be the number of times the regular expression matched. In the absence of the -all option, that will always be 0 or 1 (i.e., not-found/found).
Overall, that regular expression (which #kostix's answer explains well) is really ugly though. REs are a powerful tool, but you can get very confused with them very easily. Moreover, if you're splitting the output on newlines then you don't need to try to exclude them in the RE match; there will definitely be no newlines in the result of split in that case.
If we better understood what you were trying to do, we could direct you to far more effective methods of matching (e.g., using lsearch with suitable options, loading the data into an in-memory SQLite database).

Regular expressions documentation while using grep

I am trying to find some comprehensive documentation on character classes in regular expressions that could be used while using grep. I tried
info grep
man grep
man 7 regex
but could not find all the characters classes listed down in the documentation.
I am looking for some comprehensive documentation on regex that grep uses. Is there such a documentation available?
grep has three options for regex -E or --extended-regexp -G or --basic-regexp and -P or --perl-regexp.
Extended / Basic Regex Classes: Follow POSIX Classes
Perl Regex Classes: Follow Perl Classes
From the command line POSIX regex information can be accessed via man 7 regex where as Perl regex data can be accessed via perldoc perlre
http://linux.about.com/od/commands/l/blcmdl1_grep.htm
Finally, certain named classes of characters are predefined within
bracket expressions, as follows. Their names are self explanatory, and
they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
[:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For example, [[:alnum:]] means [0-9A-Za-z], except the latter form
depends upon the C locale and the ASCII character encoding, whereas
the former is independent of locale and character set.
When I do man grep, this is what I get:
REGULAR EXPRESSIONS
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed
analogously to arithmetic expressions, by using various operators to combine smaller expressions.
grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep,
there is no difference in available functionality using either syntax. In other implementations, basic
regular expressions are less powerful. The following description applies to extended regular expressions;
differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters,
including all letters and digits, are regular expressions that match themselves. Any meta-character with
special meaning may be quoted by preceding it with a backslash.
The period . matches any single character.
Character Classes and Bracket Expressions
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that
list; if the first character of the list is the caret ^ then it matches any character not in the list. For
example, the regular expression [0123456789] matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It
matches any single character that sorts between the two characters, inclusive, using the locale's collating
sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many
locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to
[abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of
bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their
names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
[:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z],
except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is
independent of locale and character set. (Note that the brackets in these class names are part of the
symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most
meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it
first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a
literal - place it last.
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the
beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the
edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
Repetition
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
Concatenation
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by
concatenating two substrings that respectively match the concatenated expressions.
Alternation
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any
string matching either alternate expression.
Precedence
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole
expression may be enclosed in parentheses to override these precedence rules and form a subexpression.
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth
parenthesized subexpression of the regular expression.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead,
so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an
extension, but portable scripts should avoid it.
Regular expressions are awesome!