regex: practical example with ms modifier - regex

Is there a practical example with "ms" modifier ? And when use it ?
For example:
$data ~= /regex/ms
ThankS

Here is some sample text.
Begin 111
Match this
and This
End
Begin 222
Match this one too
End
Don't match this: Begin 333
Some stuff
End
This regex uses the s and m modifiers to match each Begin...End block while capturing the digits to Group 1:
(?sm)^Begin (\d+).*?End
(See the demo to examine the matches and captures.)
The s is important because we want the . in .*? to match characters on multiple lines. In s mode, the . can match newline characters, so it grabs characters over several lines.
The m is important because we only want the Begin to match at the beginning of the line (and the ^ allows us to do that when m is set). For instance, we don't want to match a Begin...End block in the middle of a line.
Explain Regex
(?ms) # set flags for this block (with ^ and $
# matching start and end of line) (with .
# matching \n) (case-sensitive) (matching
# whitespace and # normally)
^ # the beginning of a "line"
Begin # 'Begin '
( # group and capture to \1:
\d+ # digits (0-9) (1 or more times (matching
# the most amount possible))
) # end of \1
.*? # any character (0 or more times (matching
# the least amount possible))
End # 'End'

Related

Sublime SQL REGEX highlighting

I'm trying to modify an existing language definition in sublime
It's for SQL
Currently (\#|##)(\w)* is being used to match against local declared parameters (e.g. #MyParam) and also system parameters (e.g. ##Identity or ##FETCH_STATUS)
I'm trying to split these into 2 groups
System parameters I can get like (\#\#)(\w)* but I'm having problems with the local parameters.
(\#)(\w)* matches both #MyParam and ##Identity
(^\#)(\w)* only highlights the first match (i.e. #MyParam but not #MyParam2
Can someone help me with the correct regex?
Try the below regex to capture local parameters and system parameters into two separate groups.
(?<=^| )(#\w*)(?= |$)|(?<=^| )(##\w*)(?= |$)
DEMO
Update:
Sublime text 2 supports \K(discards the previous matches),
(?m)(?:^| )\K(#\w*)(?= |$)|(?:^| )\K(##\w*)(?= |$)
DEMO
Explanation:
(?m) set flags for this block (with ^ and $
matching start and end of line) (case-
sensitive) (with . not matching \n)
(matching whitespace and # normally)
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \1:
# '#'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \1
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"
) end of look-ahead
| OR
(?: group, but do not capture:
^ the beginning of a "line"
| OR
' '
) end of grouping
\K '\K' (resets the starting point of the
reported match)
( group and capture to \2:
## '##'
\w* word characters (a-z, A-Z, 0-9, _) (0 or
more times)
) end of \2
(?= look ahead to see if there is:
' '
| OR
$ before an optional \n, and the end of a
"line"

Regex Help, anti-query replacement

How do I remove any lines that have 3 or less slashes, but retain bigger links?
A. http://two/three/four
B. http://two/three
C. http://two
A would stay nothing else would.
Thanks
Search: (?m)^(?:[^/]*/){0,3}[^/]*$
Replace: ""
On the demo, see how only the lines with 3 or fewer slashes are matched. These are the ones to nix.
Explain Regex
(?m) # set flags for this block (with ^ and $
# matching start and end of line) (case-
# sensitive) (with . not matching \n)
# (matching whitespace and # normally)
^ # the beginning of a "line"
(?: # group, but do not capture (between 0 and 3
# times (matching the most amount
# possible)):
[^/]* # any character except: '/' (0 or more
# times (matching the most amount
# possible))
/ # '/'
){0,3} # end of grouping
[^/]* # any character except: '/' (0 or more times
# (matching the most amount possible))
$ # before an optional \n, and the end of a
# "line"
sed
You can use following sed command to do that, assuming your lines are in foo.txt:
sed -n '/\(.*\/\)\{4,\}/p' foo.txt
The -n option is for no output, but lines matching the pattern between the /s are printed anyway thanks to the p command at the end of the sed expression.
The pattern is: at least 4 occurences of /, each one potentially preceeded by any other string.

How to express this in regular expression?

I need to capture all Upper and Lower case character strings other than "FOO" and "BAR". How to do this?
I tried [^(^FOO$)(^BAR$)] but it doesn't work.
Update:
Actually I'm using this in a context, I am concatenating it with another regex
["(\w)+": _this_regex_ ]
For example ["abc":FOO] shouldn't be matched
All other types say ["abc":BAZ] should match
You want a negative look ahead:
\["(\w+)"\s*:\s*(?!FOO\b|BAR\b)(\w+)]
The (\w+) are capturing group, they store the key/value pairs inside variables (I guess that's what you want to do?)
(?!...) is a negative lookahead: it will cause the regex to fail if what's inside matches.
\b is a word-boundary: here it will make the loohahead match (and so fail the regex) only if FOO is followed by a non alphanum character (so ["foo": FOOLISH] will be accepted by the regex)
\s is a short for all type of whitespaces (spaces, tabs, newlines etc)
Demo: http://regex101.com/r/fM3uZ7
What you tried [^...] was a negative character range: it matches any character (and only one character) that's not inside the character range. And keep in mind that inside character ranges only ], ^ and - are special character (so $ means \$ and so on)
Have a try with:
(?i)^(?!.*foo)(?!.*bar)
In action within perl script:
my $re = qr~(?i)^(?!.*foo)(?!.*bar)~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK : $_" : "KO : $_");
}
__DATA__
["abc":FOO]
["abc":BAZ]
output:
KO : ["abc":FOO]
OK : ["abc":BAZ]
Explanation:
The regular expression:
(?i)^(?!.*foo)(?!.*bar)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i) set flags for this block (case-
insensitive) (with ^ and $ matching
normally) (with . not matching \n)
(matching whitespace and # normally)
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
foo 'foo'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
bar 'bar'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------

I can't find proper regexp

I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/

What is this regex doing?

/^([a-z]:)?\//i
I don't quite understand what the ? in this regex if I had to explain it from what I understood:
Match begin "Group1 is a to z and :" outside ? (which I don't get what its doing) \/ which makes it match / and option /i "case insensitive".
I realize that this will return 0 or 1 not quiet sure why because of the ?
Is this to match directory path or something ?
If I test it:
$var = 'test' would get 0 while $var ='/test'; would get 1 but $var = 'test/' gets 0
So anything that begins with / will get 1 and anything else 0.
Can someone explain this regex in elementary terms to me?
See YAPE::Regex::Explain:
#!/usr/bin/perl
use strict; use warnings;
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/^([a-z]:)?\//i)->explain;
The regular expression:
(?i-msx:^([a-z]:)?/)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1 (optional
(matching the most amount possible)):
----------------------------------------------------------------------
[a-z] any character of: 'a' to 'z'
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
)? end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
It matches a lower- or upper case letter ([a-z] with the i modifier) positioned at the start of the input string (^) followed by a colon (:) all optionally (?), followed by a forward slash \/.
In short:
^ # match the beginning of the input
( # start capture group 1
[a-z] # match any character from the set {'A'..'Z', 'a'..'z'} (with the i-modifier!)
: # match the character ':'
)? # end capture group 1 and match it once or none at all
\/ # match the character '/'
? will match one or none of the preceding pattern.
? Match 1 or 0 times
See also: perldoc perlre
Explanation:
/.../i # case insensitive
^(...) # match at the beginning of the string
[a-z]: # one character between 'a' and 'z' followed by a colon
(...)? # zero or one time of the group, enclosed in ()
So in english: Match anything which begins with a / (slash) or some letter followed by a colon followed by a /. This looks like it matches pathnames across unix and windows, e.g.
it would match:
/home/user
and
C:/Applications
etc.
It looks like it is looking for a "rooted" path. It will successfully match any string that either starts with a forward slash (/test), or a drive letter followed by a colon, followed by a forward slash (c:/test).
Specifically, the question mark makes something optional. It applies to the part in parentheses, which is a letter followed by a colon.
Things that will match:
C:/
a:/
/
(That last item above is why the ? is there)
Things that will not match:
C:
a:
ab:/
a/