Regexp match until - regex

I have the following string:
{"name":"db-mysql","authenticate":false,"remove":false,"skip":false,"options":{"create":{"Image":"mysql:5.6/image-name:0.3.44","Env":{"MYSQL_ROOT_PASSWORD":"dummy_pass","MYSQL_DATABASE":"dev_db"}}}}
I need to get the version: 0.3.44
The pattern will always be "Image":"XXX:YYY/ZZZ:VVV"
Any help would be greatly appreciated
Rubular link

This regular expression will reliably match the given "pattern" in any string and capture the group designated VVV:
/"Image":"[^:"]+:[^"\/]+\/[^":]+:([^"]+)"/
where the pattern is understood to express the following characteristics of the input to be matched:
there is no whitespace around the colon between "Image" and the following " (though that could be matched with a small tweak);
the XXX and ZZZ substrings do contain any colons;
the YYY substring does not contain a forward slash (/); and
none of the substrings XXX, YYY, ZZZ, or VVV contains a literal double-quote ("), whether or not escaped.
Inasmuch as those constraints are stronger than JSON or YAML requires to express the given data, you'd probably be better off using using a bona fide JSON / YAML parser. A real parser will cope with semantically equivalent inputs that do not satisfy the constraints, and it will recognize invalid (in the JSON or YAML sense) inputs that contain the pattern. If none of that is a concern to you, however, then the regex will do the job.

This is extremely hard to do correctly using just a regular expression, and it would require an regex engine capable of recursion. Rather than writing a 50 line regexp, I'd simply use an existing JSON parser.
$ perl -MJSON::PP -E'
use open ":std", ":encoding(UTF-8)";
my $json = <>;
my $data = JSON::PP->new->decode($json);
my $image = $data->{options}{create}{Image};
my %versions = map { split(/:/, $_, 2) } split(/\//, $image);
say $versions{"image-name"};
' my.json
0.3.44

Related

Raku Regex to capture and modify the LFM code blocks

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}
TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.
This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

Possible to use one RegEx group for multiple matches?

Example Text:
[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]
Current RegEx:
[ABC.*?(?:value='(.*?)')+.*?]]]
What I want to achive:
There's an extremely long text (HTTP Response) with data I want to grab. A single dataset contains multiple lines. On every line the data I want to collect is located inside the "value:''" tag. On each line there are multiple of those value tags. Is it somehow possible to use (optimize) the above regex to get the data of all value tags with just a single capturing group in the regex pattern?
To clarify what I want: alternatively I would have to use the following pattern:
[ABC.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?]]]
Using Perl, you can easily get at all matches of a regular expression, and most of the other regular expression libraries have similar capabilities. As you want to match a header, doing a repeated match with an anchor ( \G )is the easiest:
use strict;
#use Regexp::Debugger;
my $data = "[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]";
my #matches = $data =~ /(?:^\[ABC|\G).*?\bvalue='([^']*)'/g;
print "[$_]" for #matches;
__END__
[123][5463][Text]
Most likely you will need to add the "global" flag to whatever regex library you are using for matching.
Personally, I would split this up into a two-step process. First, extract the string between [ABC[[ and ]]], and then extract all value='...' parts from that string. Also, most likely, you can parse the string [ABC[[...]]] in a sane way, counting opening and closing brackets. Or maybe that string is even JSON and you can just use a proper parser there?

Annotating mismatches in regular expression

I need to "annotate" with a X character each mismatch in a regular expression, For example if I have a text file like:
Line1Name: this is a (string).
Line2Name: (a string)
Line3Name this is a line without parenthesis
Line4Name: (a string 2)
Now following regular expression will match everything before a :
^[^:]+(?=:)
so the result will be
Line1Name:
Line2Name:
Line4Name:
However I would need to annotate the mismatch at the 3rd line, having this output:
Line1Name:
Line2Name:
X
Line4Name:
Is this possible with regular expressions?
If you have a look at what a regular expression is, you will realize that it is not possible to do logical operations with a regex alone. Quoting Wikipedia:
In computing, a regular expression provides a concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
emphasis mine – simply put, a regex is a fancy way to find a string; it either does (it matches), or not.
To achieve what you are after, you need some kind of logic switch that operates on the match / not-match result of your regex search and triggers an action. You haven’t specified in what environment you are using your regex, so providing a solution is a bit pointless, but as an example, this would do what you are trying to do in pure bash:
# assuming your string is in $str
result="$([[ $str =~ ^[^:]+: ]] && echo "${str%:*}" || echo "X")"
and this does the same thing in a language supporting your regex pattern (Ruby):
# assuming your string is in str
result = str.match(/^[^:]+(?=:)/) || "X"
As a side note, your example code does not match the output: you are using a lookahead for the colon, which excludes it in the match, but your output includes it. I’ve opted for sticking with your regex over your output pattern in my examples, thus excluding the colon from the result.

Using a match from a regex in another regex: skipping over metacharacters

I have a regular expression (REGEX 1) plus some Perl code that picks out a specific string of text, call it the START_POINT, from a large text document. This START_POINT is the beginning of a larger string of text that I want to extract from the large text document. I want to use another regular expression (REGEX 2) to extract from START_POINT to an END_POINT. I have a set of words to use in the regular expression (REGEX 2) which will easily find the END_POINT. Here is my problem. The START_POINT text string may contain metacharacters which will be interpreted differently by the regular expression. I don't know ahead of time which ones these will be. I am trying to process a large set of text documents and the START_POINT will vary from document to document. How do I tell the a regular expression to interpret a text string as just the text string and not as a text string with meta characters?
Perhaps this code will help this make more sense. $START_POINT was identified in code above this piece of code and is an extracted part of the large text string $TEXT.
my $END_POINT = "(STOP|CEASE|END|QUIT)";
my #NFS = $TEXT =~ m/(($START_POINT).*?($END_POINT))/misog;
I have tried to use the quotemeta function, but haven't had any success. It seems to destroy the integrity of the $START_POINT text string by adding in slashes which change the text.
So to summarize I am looking for some way to tell the regular expression to look for the exact string in $START_POINT without interpreting any of the string as a metacharacter while still maintaining the integrity of the string. Although I may be able to get the quotemeta to work, do you know of any other options available?
Thanks in advance for your help!
You need to convert the text to a regex pattern. That's what quotemeta does.
my $start = '*';
my $start_pat = quotemeta($start); # * => \*
/$start_pat/ # Matches "*"
quotemeta can be accessed via \Q..\E:
my $start = '*';
/\Q$start_pat\E/ # Matches "*"
Why reimplement quotemeta?

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.