I found this regex and want to understand it. Are there any regex decompilers that will translate what the following regex does into words? It is really complicated.
$text =~ /(((\w)\W*(?{$^R.(0+( q{a}lt$3))})) {8}(?{print +pack"B8" ,$^Rand ""})) +/x;
Using YAPE::Regex::Explain (not sure if it is good, but it's the first result in searching):
use YAPE::Regex::Explain;
my $REx = qr/(((\w)\W*(?{$^R.(0+( q{a}lt$3))})) {8}(?{print +pack"B8" ,$^Rand ""})) +/x;
my $exp = YAPE::Regex::Explain->new($REx)->explain;
print $exp;
I've got the explanation as:
( group and capture to \1 (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
( group and capture to \2 (8 times):
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\W* non-word characters (all but a-z, A-Z,
0-9, _) (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
(?{$^R.(0+( run this block of Perl code
q{a}lt$3))})
----------------------------------------------------------------------
){8} end of \2 (NOTE: because you are using a
quantifier on this capture, only the
LAST repetition of the captured pattern
will be stored in \2)
----------------------------------------------------------------------
(?{print +pack"B8" run this block of Perl code
,$^Rand ""})
----------------------------------------------------------------------
)+ end of \1 (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
There are 2 blocks of Perl code, which must be analyzed independently.
In the first block:
$^R . (0 + (q{a} lt $3))
here, $^R is "the result of evaluation of the last successful (?{ code }) regular expression assertion", and the expression (0 + (q{a} lt $3)) gives 1 if the 3rd capture is in [b-z], 0 otherwise.
In the second block:
print +pack "B8", $^R and ""
it interpret the previous result of evaluation as a (big-endian) binary string, get the number, convert it to the corresponding character, and finally print it out.
Together, the regex finds every 8 alphanumeric characters, then treat those in [b-z] as the binary digit 1, otherwise 0. These 8 binary digits are then interpreted as a character code, and that character is printed out.
For instance, the letter 'H' = 0b01001000 would be printed when matching the string
$test = 'OvERfLOW';
Im not sure what all is in that statement, but for regex analyzing i use this site
http://xenon.stanford.edu/~xusch/regexp/analyzer.html
I always found OptiPerl's Regex Editor to be really good at this type of thing
Related
I have the following file(like this scheme, but much longer):
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
There are 2 words (like i.e. LSE ZTX) in every line with optional spaces and/or tabs at the beginning, at the end and always in between.
Could someone help me to match these 2 words with regexp? Following the example I wish to have LSE in $1 and ZTX in $2 for the first line, SWX in $1 and ZURN in $2 for the second etc.
I have tried something like:
$line =~ /(\t|\s)*?(.*?)(\t|\s)*?(.*?)/msgi;
$line =~ /[\t*\s*]?(.*?)[\t*\s*]?(.*?)/msgi;
I don't know how can I say, that there could be either spaces or tabs (or both of them mixed, so for ex. \t\s\t)
If you want to just match the two first words, the most basic thing is to just match any sequence of characters that are not whitespace:
my ($word1, $word2) = $line =~ /\S+/g;
This will capture the first two words in $line into the variables, if they exist. Note that parentheses are not required when using the /g modifier. Use an array instead if you want to capture all existing matches.
Always two words, you don't need to match the entire line, so your most simple regex would be:
/(\w+)\s+(\w+)/
I think this is what you want
^\s*([A-Z]+)\s+([A-Z]+)
See it here on Regexr, you find the first code of a row in group 1 and the second in group 2. \s is a whitespace character, it includes e.g. spaces, tabs and newline characters.
In Perl it is something like this:
($code1, $code2) = $line =~ /^\s*([A-Z]+)\s+([A-Z]+)/i;
I think you are reading the text file row by row, so you don't need the modifiers s and m, and g is also not needed.
In case the codes are not only ASCII letters, then replace [A-Z] with \p{L}. \p{L} is a Unicode property that will match every letter in every language.
\s includes also tabulation so your regex looks like:
$line =~ /^\s*([A-Z]+)\s+([A-Z]+)/;
the first word is in the first group ($1) and the second in $2.
You can change [A-Z] to whatever's more convenient with your needs.
Here is the explanation from YAPE::Regex::Explain
The regular expression:
(?-imsx:^\s*([A-Z]+)\s+([A-Z]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
With option "Multiline" this Regex:
^\s*(?<word1>\S+)\s+(?<word2>\S+)\s*$
Will give you N matches each containing 2 groups named:
- word1
- word2
^\s*([A-Z]{3,4})\s+([A-Z]{3,4})$
What this does
^ // Matches the beginning of a string
\s* // Matches a space/tab character zero or more times
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $1
\s+ // Then matches at least one tab or space
([A-Z]{3,4}) // Matches any letter A-Z either 3 or 4 times and captures to $2
$ // Matches the end of a string
You can use split here:
use strict;
use warnings;
while (<DATA>) {
my ( $word1, $word2 ) = split;
print "($word1, $word2)\n";
}
__DATA__
LSE ZTX
SWX ZURN
LSE ZYT
NYSE CGI
Output:
(LSE, ZTX)
(SWX, ZURN)
(LSE, ZYT)
(NYSE, CGI)
Assuming the spaces at the start of the line are what you use to identify your codes you want, try this:
Split your string up at newlines, then try this regex:
^\s+(\w+\s+){2}$
This will only match lines that start with some space, followed by a (word - some space - word), then end with some space.
# ^ --> String start
# \s+ --> Any number of spaces
# (\w+\s+){2} --> A (word followed by some space)x2
# $ --> String end.
However, if you want to capture the codes alone, try this:
$line =~ /^\s*(\w+)\s+(\w+)/;
# \s* --> Zero or more whitespace,
# (\w+) --> Followed by a word (group #1),
# \s+ --> Followed by some whitespace,
# (\w+) --> Followed by a word (group #2),
This will match all your codes
/[A-Z]+/
I find the following statement in a perl (actually PDL) program:
/\/([\w]+)$/i;
Can someone decode this for me, an apprentice in perl programming?
Sure, I'll explain it from the inside out:
\w - matches a single character that can be used in a word (alphanumeric, plus '_')
[...] - matches a single character from within the brackets
[\w] - matches a single character that can be used in a word (kinda redundant here)
+ - matches the previous character, repeating as many times as possible, but must appear at least once.
[\w]+ - matches a group of word characters, many times over. This will find a word.
(...) - grouping. remember this set of characters for later.
([\w]+) - match a word, and remember it for later
$ - end-of-line. match something at the end of a line
([\w]+)$ - match the last word on a line, and remember it for later
\/ - a single slash character '/'. it must be escaped by backslash, because slash is special.
\/([\w]+)$ - match the last word on a line, after a slash '/', and remember the word for later. This is probably grabbing the directory/file name from a path.
/.../ - match syntax
/.../i - i means case-insensitive.
All together now:
/\/([\w]+)$/i; - match the last word on a line and remember it for later; the word must come after a slash. Basically, grab the filename from an absolute path. The case insensitive part is irrelevant, \w will already match both cases.
More details about Perl regex here: http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
And as JRFerguson pointed out, YAPE::Regex::Explain is useful for tokenizing regex, and explaining the pieces.
You will find the Yape::Regex::Explain module worth installing.
#!/usr/bin/env perl
use YAPE::Regex::Explain;
#...may need to single quote $ARGV[0] for the shell...
print YAPE::Regex::Explain->new( $ARGV[0] )->explain;
Assuming this script is named 'rexplain' do:
$ ./rexplain '/\/([\w]+)$/i'
...to obtain:
The regular expression:
(?-imsx:/\/([\w]+)$/i)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
/ '/'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\w]+ any character of: word characters (a-z,
A-Z, 0-9, _) (1 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
/i '/i'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
UPDATE:
See also: https://stackoverflow.com/a/12359682/1015385 . As noted there and in the module's documentation:
There is no support for regular expression syntax added after Perl version 5.6, particularly any
constructs added in 5.10.
/\/([\w]+)$/i;
It is a regex, and if it is a complete statement, it is applied to the $_ variable, like so:
$_ =~ /\/([\w]+)$/i;
It looks for a slash \/, followed by an alphanumeric string \w+, followed by end of line $. It also captures () the alphanumeric string, which ends up in the variable $1. The /i on the end makes it case-insensitive, which has no effect in this case.
While it doesn't help "explain" a regex, once you have a test case, Damian's new Regexp::Debugger is a cool utility to watch what actually occurs during the matching. Install it and then do rxrx at the command line to start the debugger, then type in /\/([\w]+)$/ and '/r' (for example), and finally m to start the matching. You can then step through the debugger by hitting enter repeatedly. Really cool!
This is comparing $_ to a slash followed by one or more character (case insensitive) and storing it in $1
$_ value then $1 value
------------------------------
"/abcdes" | "abcdes"
"foo/bar2" | "bar2"
"foobar" | undef # no slash so doesn't match
The Online Regex Analyzer deserves a mention. Here's a link to explain what your regex means, and pasted here for the record.
Sequence: match all of the followings in order
/ (slash)
--+
Repeat | (in GroupNumber:1)
AnyCharIn[ WordCharacter] one or more times |
--+
EndOfLine
I promise you all I've searched the site for about two hours now. I've found several that should have worked, but they didn't.
I have a line that consists of a varying amount of numbers separated by spaces. I want to delete everything after the third number.
I should say that everything I've been writing has been assuming that \S\s\S\s\S would match the first three numbers. with spaces between 1 and 2, and 2 and 3.
I anticipated the following working:
s/^.*?[\S\s\S\s\S].{5}//s;
but it did the exact opposite of what I wanted.
I would like 2 3 0 4 5 6 7 1 0 1 2 to become 2 3 0
I would really prefer to keep it substitution. I've tried look-behind as one person mentioned and I had no luck. Should I be saving the first 3 numbers as a string before I'm trying these commands?
EDIT:
I should have clarified that these numbers could be in the form 1.57 or 1.00E01 as well. I had integers when I was trying to get that to just baseline work.
\S\s\S\s\S will indeed match three non-space characters separated by space characters. However, ^.*?[\S\s\S\s\S].{5} does something completely different:
^ matches the beginning of the line.
.*? matches characters until the next match can start (not as many as it can). Since you specify /s, . will match newline as well.
[\S\s\S\s\S] is a character class, and so is the same as [\S\s]—match either \S or \s, which is to say anything.
.{5} will match five characters.
Since [\S\s] and . with /s match the same things, the .*? will never match any characters as it wants to match as little as possible. Thus, this is the same as s/^.{6}//s—delete the first six characters from the string. As you can see, that's not what you wanted!
One way to keep the first three numbers is to explicitly match them: s/^(\d \d \d).*/$1/s. Here, \d matches a single digit (0–9) with literal spaces in between them. We match the first three followed by anything at all, and then replace the whole match—since it ends in .*, that's the whole string—with just the bit in between parentheses, i.e. the first three numbers. If your numbers can be more than one digit long, then s/^(\d+ \d+ \d+).*/$1/s will do what you want; if you can have arbitrary space-like characters (space, tab, newline) separating them, then s/^(\d\s\d\s\d\s).*/$1/s is what you want (or \s+ if you can have multiple spaces). If you want to catch lines which have things other than digits, you can use \S or \S+, just as you were.
Another approach, using lookbehind, would be s/(?<=^\d \d \d).*//s. In other words, delete any characters which are preceded by ^\d \d \d—the beginning of the string followed by three space-separated numbers. There's no real advantage to this approach—I'd probably do it the other way—but since you mentioned lookbehind, here's how you can do it. (Again, things like s/(?<=^\S\s\S\s\S).*//s are more general.)
So match the first three numbers explicitly, and drop everything else.
s/^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$/$1 $2 $3/;
This works as follows:
$ perl -MYAPE::Regex::Explain -E 'say YAPE::Regex::Explain->new(q{^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$})->explain;'
The regular expression:
(?-imsx:^([\dE.]+)\s+([\dE.]+)\s+([\dE.]+).*$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ") (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[\dE.]+ any character of: digits (0-9), 'E', '.'
(1 or more times (matching the most
amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(Updated in consideration of the changes the OP made to the original specification.)
Your code where you say s/^.*?[\S\s\S\s\S].{5}//s;
I would write as: s/^(\S\s\S\s\S).*$/$1/
You're forgetting to use a $1 to capture the part of the substitution that you want to keep, and having a .* at the beginning could lead to starting numbers being removed instead of trailing numbers.
Also, I'm not sure if you have some guarantee of single digit numbers, or of single whitespace characters, so you could write the code with s/^(\S+\s+\S+\s+\S+).*$/$1/ to capture all of the spaces and all of the digits.
Let me know if I need to clarify that a little more.
Here's a website I find super helpful for Perl regex: http://pubcrawler.org/perl-reference.html
Question is, why do u want to do such a thing with regexp? it seems easier to me with:
substr $string, 5;
or if u really want to (I didn't test):
s/^(.{5})(.*)/$1/
parentheses allows you to "remember" patterns, this is the way to say that you want to replace pretty much everything with just the first part of the pattern (the first five characters). this pattern will match any line of text and leave just the first 5 characters maybe you want to modify it to match 3 digits with spaces between them
If I run
"Year 2010" =~ /([0-4]*)/;
print $1;
I get empty string.
But
"Year 2010" =~ /([0-4]+)/;
print $1;
outputs "2010". Why?
You get an empty match right at the start of the string "Year 2010" for the first form because the * will immediately match 0 digits. The + form will have to wait until it sees at least one digit before it matches.
Presumably if you can go through all the matches of the first form, you'll eventually find 2010... but probably only after it finds another empty match before the 'e', then before the 'a' etc.
The first regular expression successfully matches zero digits at the start of the string, which results in capturing the empty string.
The second regular expression fails to match at the start of the string, but it does match when it reaches 2010.
The first matches the zero-length string at the beginning (before Y) and returns it. The second searches for one-or-more digits and waits until it finds 2010.
you can also use YAPE::Regex::Explain for explanation of a regular expression like
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new('([0-4]*)')->explain();
print YAPE::Regex::Explain->new('([0-4]+)')->explain();
output:
The regular expression:
(?-imsx:([0-4]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]* any character of: '0' to '4' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The regular expression:
(?-imsx:([0-4]+))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[0-4]+ any character of: '0' to '4' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
The star symbol tries to basically match 0 or more symbols in given set (in theory, the set {x,y}* consists of empty string and all possible finite sequences made of x and y), and therefore, it will match exactly zero characters (empty string) at the beginning of the string, zero characters after first character, zero characters after the second character, etc. Then finally it will find 2 and match whole 2010.
The plus symbol matches one or more characters from the given set ({x,y}+ consists of all possible finite sequences made of x and y, without the empty string, as opposed to {x,y}*). So the first met matching character is 2, then next - 0 is checked, then 1, then another 0, and then the sentence ends, so found group looks like '2010'.
It is standard behavior for regular expressions, defined in formal language theory. I strongly suggest to learn a bit theory about regular expressions, it can't hurt, but can help :)
We have this as a trick question in Learning Perl. Any regex that can match zero characters that doesn't match at the beginning of the string will match zero characters.
The Perl regex engine matches the leftmost longest match, with the leftmost part coming first. Not all regex engines work like that, though. If you want all of the technical details, read Mastering Regular Expressions, which explains how regex engines work and find matches.
To make your first RE match, use the anchor '$':
"Year 2010" =~ /([0-4]*)$/;
print $1;
I am trying to match what is before /../ but after / with a regular expressions, but I want it to look back and stop at the first /
I feel like I am close but it just looks at the first slash and then takes everything after it like... input is this:
this/is/a/./path/that/../includes/face/./stuff/../hat
and my regular expression is:
#\/(.*)\.\.\/#
matching /is/a/./path/that/../includes/face/./stuff/../ instead of just that/../ and stuff/../
How should I change my regex to make it work?
.* means "match any number of any character at all[1]". This is not what you want. You want to match any number of non-/ characters, which is written [^/]*.
Any time you are tempted to use .* or .+ in a regex, be very suspicious. Stop and ask yourself whether you really mean "any character at all[1]" or not - most of the time you don't. (And, yes, non-greedy quantifiers can help with this, but character classes are both more efficient for the regex engine to match against and more clear in their communication of your intent to human readers.)
[1] OK, OK... . isn't exactly "any character at all" - it doesn't match newline (\n) by default in most regex flavors - but close enough.
Change your pattern that only characters other than / ([^/]) get matched:
#([^/]*)/\.\./#
Alternatively, you can use a lookahead.
#(\w+)(?=/\.\./)#
Explanation
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
) end of look-ahead
I think you're essentially right, you just need to make the match non-greedy, or change the (.*) to not allow slashes: #/([^/]*)/\.\./#
In your favourite language, do a few splits and string manipulation eg Python
>>> s="this/is/a/./path/that/../includes/face/./stuff/../hat"
>>> a=s.split("/../")[:-1] # the last item is not required.
>>> for item in a:
... print item.split("/")[-1]
...
that
stuff
In python:
>>> test = 'this/is/a/./path/that/../includes/face/./stuff/../hat'
>>> regex = re.compile(r'/\w+?/\.\./')
>>> regex.findall(me)
['/that/..', '/stuff/..']
Or if you just want the text without the slashes:
>>> regex = re.compile(r'/(\w+?)/\.\./')
>>> regex.findall(me)
['that', 'stuff']
([^/]+) will capture all the text between slashes.
([^/]+)*/\.\. matches that\.. and stuff\.. in you string of this/is/a/./path/that/../includes/face/./stuff/../hat It captures that or stuff and you can change that, obviously, by changing the placement of the capturing parens and your program logic.
You didn't state if you want to capture or just match. The regex here will only capture that last occurrence of the match (stuff) but is easily changed to return that then stuff if used global in a global match.
NODE EXPLANATION
--------------------------------------------------------------------------------
( group and capture to \1 (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[^/]+ any character except: '/' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of \1 (NOTE: because you're using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in \1)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\. '.'
--------------------------------------------------------------------------------
\. '.'