Turning off greed not working in this regex - regex

I am trying to run the following search (with . made to match newlines either by adding the /s flag in perl or replacing it with \_. in vim):
/<output_channels>.*(?=Story).*?<\/output_channels>/
However the ? isn't turning off greed as it normally does - can anyone explain why? For example, it matches the entire contents of the following file rather than just the first element:
<output_channels>
<output_channel>RSS</output_channel>
<output_channel>Story</output_channel>
</output_channels>
<output_channels>
<output_channel>RSS</output_channel>
</output_channels>
Sorry if I'm missing something obvious.

I put your sample text into a vim buffer, and then executed the command
:%!perl -e '$text = join("", <STDIN>); $text =~ /<output_channels>.*(?=Story).*?<\/output_channels>/s; print $&;'
The result is just the first block of XML. I think this is what you want?
Note that I escaped the / within the regex. Other than this, it is the same one given in your question.
Also note that the equivalent vim RE would be (tested, works):
<output_channels>\_.*\(story\)\#=\_.\{-}<\/output_channels>
See :help perl-patterns for a rundown of the differences between perl and vim REs.
Further note that parsing heirarchical markup with regexps has been known to reawaken ancient demons.

The first .* in your regex is still greedy. You only added ? after the second one.

Related

Raku Regex to capture and modify the LFM code blocks

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}
TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.
This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

Extracting string before pattern with sed (bash)

I need some help with sed to remove everything after matching pattern and remove the last "." if it exists..
Take this string as an example:
The.100.S02E05.720p.HDTV.x264-KILLERS.mkv
I want everything before the pattern "S[0-9][0-9]E[0-9[0-9]" except the last "."
What I want:
"The.100"
Does anyone have a great oneliner for this one?
It sounds like you can pretty much use exactly what you had in your question:
sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//'
This matches an optional . character followed by the pattern you suggested (and anything after it), replacing with nothing. You were missing a ] in the question, which I have added.
Testing it out:
$ sed 's/\.*S[0-9][0-9]E[0-9][0-9].*//' <<<'The.100.S02E05.720p.HDTV.x264-KILLERS.mkv'
The.100

Is there a regex engine that supports "for each captured group" in replacement strings?

Here's my example. If I want to use a regex to replace tabs in the code with spaces, but wanted to preserve tab characters in the middle or end of a line of code, I would use this as my search string to capture each tab character at the start of a line: ^(\t)+
Now, how could I write a search string that replaces each captured group with four spaces? I'm thinking there must be some way to do this with backreferences?
I've found I can work around this by running similar regex-replacements (like s/^\t/ /g, s/^ \t/ /g, ...) multiple times until no more matches are found, but I wonder if there's a quicker way to do all the necessary replacements at once.
Note: I used sed format in my example, but I'm not sure if this is possible with sed. I'm wondering if sed supports this, and if not, is there a platform that does? (e.g., there's a Python/Java/bash extended regex lib that supports this.)
With perl and other languages that support this feature (Java, PCRE(PHP, R, libboost), Ruby, Python(the new regex module), .NET), you can use the \G anchor that matches the position after the last match or the start of the string:
s/(?:\G|^)\t/ /gm
This works in Perl. Maybe sed too, I don't know sed.
It relies on doing an eval, basically a callback.
It takes the length of $1 then cats ' ' that many times.
Perl sample.
my $str = "
\t\t\tThree
\t\tTwo
\tOne
None";
$str =~ s/^(\t+)/ ' ' x length($1) /emg;
print "$str\n";
Output
Three
Two
One
None
Just another idea that came to me, this could also be solved with positive lookbehind:
s/(?<=^[\t]*)\t/ /gm
It's ugly, but it works.
sed ':a
s/^\(\t*\)\t/\1 /
ta' YourFile
Use recursive action on 1 regex with sed, it's a workaround

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Change delimiter of grep command

I am using grep to detect something here
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.
So if input is like something here it works, but if input is like
<a href="xxxx">
something here /a>
, then it doesn't.
Any solutions?
I'd use awk rather than grep. This should work:
awk '/a href="xxxx">/,/\/a>/' filename
I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).
I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):
sed '/<[Aa][^A-Za-z]/{ :A
/<\/[Aa]>/ bD
N
bA
:D
/\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
This is probably a repeat question:
Grep search strings with line breaks
You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.
Consider egrep -3 '(<a|</a>)'
"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'
So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.
I am assuming that you meant </a> rather than /a> as an xml closing delimiter.
Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.
Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:
Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)
$_ = "a{b{c}e}f";
my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
if($& eq "{") {
++$level;
} elsif($& eq "}") {
--$level;
if($level == 1) {
print "Result: ".$`.$&."\n";
$_=$'; # reset searchspace to after the match
last;
}
}
}
Result: {b{c}e}