Using a match from a regex in another regex: skipping over metacharacters - regex

I have a regular expression (REGEX 1) plus some Perl code that picks out a specific string of text, call it the START_POINT, from a large text document. This START_POINT is the beginning of a larger string of text that I want to extract from the large text document. I want to use another regular expression (REGEX 2) to extract from START_POINT to an END_POINT. I have a set of words to use in the regular expression (REGEX 2) which will easily find the END_POINT. Here is my problem. The START_POINT text string may contain metacharacters which will be interpreted differently by the regular expression. I don't know ahead of time which ones these will be. I am trying to process a large set of text documents and the START_POINT will vary from document to document. How do I tell the a regular expression to interpret a text string as just the text string and not as a text string with meta characters?
Perhaps this code will help this make more sense. $START_POINT was identified in code above this piece of code and is an extracted part of the large text string $TEXT.
my $END_POINT = "(STOP|CEASE|END|QUIT)";
my #NFS = $TEXT =~ m/(($START_POINT).*?($END_POINT))/misog;
I have tried to use the quotemeta function, but haven't had any success. It seems to destroy the integrity of the $START_POINT text string by adding in slashes which change the text.
So to summarize I am looking for some way to tell the regular expression to look for the exact string in $START_POINT without interpreting any of the string as a metacharacter while still maintaining the integrity of the string. Although I may be able to get the quotemeta to work, do you know of any other options available?
Thanks in advance for your help!

You need to convert the text to a regex pattern. That's what quotemeta does.
my $start = '*';
my $start_pat = quotemeta($start); # * => \*
/$start_pat/ # Matches "*"
quotemeta can be accessed via \Q..\E:
my $start = '*';
/\Q$start_pat\E/ # Matches "*"
Why reimplement quotemeta?

Related

Regular expression formatting issue

I'm VERY new to using regular expressions, and I'm trying to figure something simple out.
I have a simple string, and i'm trying to pull out the 590111 and place it into another string.
HMax_590111-1_v8980.bin
So the new string would simply be...
590111
The part number will ALWAYS have 6 digits, and ALWAYS have a version and such. The part number might change location inside of the string.. so it needs to be able to work if it's like this..
590111-1_v8980_HMXAX.bin
What regex expression will do this? Currently, i'm using ^[0-9]* to find it if it's in the front of the file.
Try the following Regex:
Dim text As String = "590111-1_v8980_HMXAX.bin"
Dim pattern As String = "\d{6}"
'Instantiate the regular expression object.
Dim r As Regex = new Regex(pattern, RegexOptions.IgnoreCase)
'Match the regular expression pattern against a text string.
Dim m As Match = r.Match(text)
In Regex \d denotes numerics, so first you write \d.
Then as you know there will be a fix length of numbers which can be specified in Regex with "{}". If you specify \d{6} it means it will expect 6 continuous occurrences of a numeric character.
I would recommend to use this site to try your own expressions. Here you can also find a little bit of information about the expressions you are building if you hover over it.
Regex Tester

Regexp match until

I have the following string:
{"name":"db-mysql","authenticate":false,"remove":false,"skip":false,"options":{"create":{"Image":"mysql:5.6/image-name:0.3.44","Env":{"MYSQL_ROOT_PASSWORD":"dummy_pass","MYSQL_DATABASE":"dev_db"}}}}
I need to get the version: 0.3.44
The pattern will always be "Image":"XXX:YYY/ZZZ:VVV"
Any help would be greatly appreciated
Rubular link
This regular expression will reliably match the given "pattern" in any string and capture the group designated VVV:
/"Image":"[^:"]+:[^"\/]+\/[^":]+:([^"]+)"/
where the pattern is understood to express the following characteristics of the input to be matched:
there is no whitespace around the colon between "Image" and the following " (though that could be matched with a small tweak);
the XXX and ZZZ substrings do contain any colons;
the YYY substring does not contain a forward slash (/); and
none of the substrings XXX, YYY, ZZZ, or VVV contains a literal double-quote ("), whether or not escaped.
Inasmuch as those constraints are stronger than JSON or YAML requires to express the given data, you'd probably be better off using using a bona fide JSON / YAML parser. A real parser will cope with semantically equivalent inputs that do not satisfy the constraints, and it will recognize invalid (in the JSON or YAML sense) inputs that contain the pattern. If none of that is a concern to you, however, then the regex will do the job.
This is extremely hard to do correctly using just a regular expression, and it would require an regex engine capable of recursion. Rather than writing a 50 line regexp, I'd simply use an existing JSON parser.
$ perl -MJSON::PP -E'
use open ":std", ":encoding(UTF-8)";
my $json = <>;
my $data = JSON::PP->new->decode($json);
my $image = $data->{options}{create}{Image};
my %versions = map { split(/:/, $_, 2) } split(/\//, $image);
say $versions{"image-name"};
' my.json
0.3.44

Regular expression using powershell

Here's is the scenario, i have these lines mentioned below i wanted to extract only the middle character in between two dots.
"scvmm.new.resources" --> This after an regular expression match should return only "new"
"sc.new1.rerces" --> This after an regular expression match should return only "new1"
What my basic requirement was to exract anything between two dots anything can come in prefix and suffix
(.*).<required code>.(.*)
Could anyone please help me out??
You can do that without using regex. Split the string on '.' and grab the middle element:
PS> "scvmm.new.resources".Split('.')[1]
new
Or this
'scvmm.new.resources' -replace '.*\.(.*)\..*', '$1'
Like this:
([regex]::Match("scvmm.new1.resources", '(?<=\.)([^\.]*)(?=\.)' )).value
You don't actually need regular expressions for such a trivial substring extraction. Like Shay's Split('.') one can use IndexOf() for similar effect like so,
$s = "scvmm.new.resources"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new
$s = "sc.new1.rerces"
$l = $s.IndexOf(".")+1
$r = $s.IndexOf(".", $l)
$s.Substring($l, $r-$l) # Prints new1
This looks the first occurence of a dot. Then it looks for first occurense of a dot after the first hit. Then it extracts the characters between the two locations. This is useful in, say, scenarios in which the separation characters are not the same (though the Split() way would work in many cases too).

Regular expression literal-text span

Is there any way to indicate to a regular expression a block of text that is to be searched for explicitly? I ask because I have to match a very very long piece of text which contains all sorts of metacharacters (and (and has to match exactly), followed by some flexible stuff (enough to merit the use of a regex), followed by more text that has to be matched exactly.
Rinse, repeat.
Needless to say, I don't really want to have to run through the entire thing and have to escape every metacharacter. That just makes it a bear to read. Is there a way to wrap those portions so that I don't have to do this?
Edit:
Specifically, I am using Tcl, and by "metacharacters", I mean that there's all sorts of long strings like "**$^{*$%\)". I would really not like to escape these. I mean, it would add thousands of characters to the string. Does Tcl regexp have a literal-text span metacharacter?
The normal way of doing this in Tcl is to use a helper procedure to do the escaping, like this:
proc re_escape str {
# Every non-word char gets a backslash put in front
regsub -all {\W} $str {\\&}
}
set awkwardString "**$^{*$%\\)"
regexp "simpleWord *[re_escape $awkwardString] *simpleWord" $largeString
Where you have a whole literal string, you have two other alternatives:
regexp "***=$literal" $someString
regexp "(?q)$literal" $someString
However, both of these only permit patterns that are pure literals; you can't mix patterns and literals that way.
No, tcl does not have such a feature.
If you're concerned about readability you can use variables and commands to build up your expression. For example, you could do something like:
set fixed1 {.*?[]} ;# match the literal five-byte sequence .*?[]
set fixed2 {???} ;# match the literal three byte sequence ???
set pattern "this.*and.*that"
regexp "[re_escape $fixed1]$pattern[re_escape $fixed2]"
You would need to supply the definition for re_escape but the solution should be pretty obvious.
A Tcl regular expression can be specified with the q metasyntactical directive to indicate that the expression is literal text:
% set string {this string contains *emphasis* and 2+2 math?}
% puts [regexp -inline -all -indices {*} $string]
couldn't compile regular expression pattern: quantifier operand invalid
% puts [regexp -inline -all -indices {(?q)*} $string]
{21 21} {30 30}
This does however apply to the entire expression.
What I would do is to iterate over the returned indices, using them as arguments to [string range] to extract the other stuff you're looking for.
I believe Perl and Java support the \Q \E escape. so
\Q.*.*()\E
..will actually match the literal ".*.*()"
OR
Bit of a hack but replace the literal section with some text which does not need esacping and that will not appear elsewhere in your searched string. Then build the regex using this meta-character-free text. A 100 digit random sequence for example. Then when your regex matches at a certain postion and length in the doctored string you can calculate whereabouts it should appear in the original string and what length it should be.

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###