Raku Regex to capture and modify the LFM code blocks - regex

Update: Corrected code added below
I have a Leanpub flavored markdown* file named sample.md I'd like to convert its code blocks into Github flavored markdown style using Raku Regex
Here's a sample **ruby** code, which
prints the elements of an array:
{:lang="ruby"}
['Ian','Rich','Jon'].each {|x| puts x}
Here's a sample **shell** code, which
removes the ending commas and
finds all folders in the current path:
{:lang="shell"}
sed s/,$//g
find . -type d
In order to capture the lang value, e.g. ruby from the {:lang="ruby"} and convert it into
```ruby
I use this code
my #in="sample.md".IO.lines;
my #out;
for #in.kv -> $key,$val {
if $val.starts-with("\{:lang") {
if $val ~~ /^{:lang="([a-z]+)"}$/ { # capture lang
#out[$key]="```$0"; # convert it into ```ruby
$key++;
while #in[$key].starts-with(" ") {
#out[$key]=#in[$key].trim-leading;
$key++;
}
#out[$key]="```";
}
}
#out[$key]=$val;
}
The line containing the Regex gives
Cannot modify an immutable Pair (lang => True) error.
I've just started out using Regexes. Instead of ([a-z]+) I've tried (\w) and it gave the Unrecognized backslash sequence: '\w' error, among other things.
How to correctly capture and modify the lang value using Regex?
the LFM format just estimated
Corrected code:
my #in="sample.md".IO.lines;
my \len=#in.elems;
my #out;
my $k = 0;
while ($k < len) {
if #in[$k] ~~ / ^ '{:lang="' (\w+) '"}' $ / {
push #out, "```$0";
$k++;
while #in[$k].starts-with(" ") {
push #out, #in[$k].trim-leading;
$k++; }
push #out, "```";
}
push #out, #in[$k];
$k++;
}
for #out {print "$_\n"}

TL;DR
TL? Then read #jjemerelo's excellent answer which not only provides a one-line solution but much more in a compact form ;
DR? Aw, imo you're missing some good stuff in this answer that JJ (reasonably!) ignores. Though, again, JJ's is the bomb. Go read it first. :)
Using a Perl regex
There are many dialects of regex. The regex pattern you've used is a Perl regex but you haven't told Raku that. So it's interpreting your regex as a Raku regex, not a Perl regex. It's like feeding Python code to perl. So the error message is useless.
One option is to switch to Perl regex handling. To do that, this code:
/^{:lang="([a-z]+)"}$/
needs m :P5 at the start:
m :P5 /^{:lang="([a-z]+)"}$/
The m is implicit when you use /.../ in a context where it is presumed you mean to immediately match, but because the :P5 "adverb" is being added to modify how Raku interprets the pattern in the regex, one has to also add the m.
:P5 only supports a limited set of Perl's regex patterns. That said, it should be enough for the regex you've written in your question.
Using a Raku regex
If you want to use a Raku regex you have to learn the Raku regex language.
The "spirit" of the Raku regex language is the same as Perl's, and some of the absolute basic syntax is the same as Perl's, but it's different enough that you should view it as yet another dialect of regex, just one that's generally "powered up" relative to Perl's regexes.
To rewrite the regex in Raku format I think it would be:
/ ^ '{:lang="' (<[a..z]>+) '"}' $ /
(Taking advantage of the fact whitespace in Raku regexes is ignored.)
Other problems in your code
After fixing the regex, one encounters other problems in your code.
The first problem I encountered is that $key is read-only, so $key++ fails. One option is to make it writable, by writing -> $key is copy ..., which makes $key a read-write copy of the index passed by the .kv.
But fixing that leads to another problem. And the code is so complex I've concluded I'd best not chase things further. I've addressed your immediate obstacle and hope that helps.

This one-liner seems to solve the problem:
say S:g /\{\: "lang" \= \" (\w+) \" \} /```$0/ given "text.md".IO.slurp;
Let's try and explain what was going on, however. The error was a regular expression grammar error, caused by having a : being followed by a name, and all that inside a curly. {} runs code inside a regex. Raiph's answer is (obviously) correct, by changing it to a Perl regular expression. But what I've done here is to change it to a Raku's non-destructive substitution, with the :g global flag, to make it act on the whole file (slurped at the end of the line; I've saved it to a file called text.md). So what this does is to slurp your target file, with given it's saved in the $_ topic variable, and printed once the substitution has been made. Good thing is if you want to make more substitutions you can shove another such expression to the front, and it will act on the output.
Using this kind of expression is always going to be conceptually simpler, and possibly faster, than dealing with a text line by line.

Related

Perl: how to use string variables as search pattern and replacement in regex

I want to use string variables for both search pattern and replacement in regex. The expected output is like this,
$ perl -e '$a="abcdeabCde"; $a=~s/b(.)d/_$1$1_/g; print "$a\n"'
a_cc_ea_CC_e
But when I moved the pattern and replacement to a variable, $1 was not evaluated.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/g; print "$a\n"'
a_$1$1_ea_$1$1_e
When I use "ee" modifier, it gives errors.
$ perl -e '$a="abcdeabCde"; $p="b(.)d"; $r="_\$1\$1_"; $a=~s/$p/$r/gee; print "$a\n"'
Scalar found where operator expected at (eval 1) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 1) line 1, near "$1_"
(Missing operator before _?)
Scalar found where operator expected at (eval 2) line 1, near "$1$1"
(Missing operator before $1?)
Bareword found where operator expected at (eval 2) line 1, near "$1_"
(Missing operator before _?)
aeae
What do I miss here?
Edit
Both $p and $r are written by myself. What I need is to do multiple similar regex replacing without touching the perl code, so $p and $r have to be in a separate data file. I hope this file can be used with C++/python code later.
Here are some examples of $p and $r.
^(.*\D)?((19|18|20)\d\d)年 $1$2<digits>年
^(.*\D)?(0\d)年 $1$2<digits>年
([TKZGD])(\d+)/(\d+)([^\d/]) $1$2<digits>$3<digits>$4
([^/TKZGD\d])(\d+)/(\d+)([^/\d]) $1$3分之$2$4
With $p="b(.)d"; you are getting a string with literal characters b(.)d. In general, regex patterns are not preserved in quoted strings and may not have their expected meaning in a regex. However, see Note at the end.
This is what qr operator is for: $p = qr/b(.)d/; forms the string as a regular expression.
As for the replacement part and /ee, the problem is that $r is first evaluated, to yield _$1$1_, which is then evaluated as code. Alas, that is not valid Perl code. The _ are barewords and even $1$1 itself isn't valid (for example, $1 . $1 would be).
The provided examples of $r have $Ns mixed with text in various ways. One way to parse this is to extract all $N and all else into a list that maintains their order from the string. Then, that can be processed into a string that will be valid code. For example, we need
'$1_$2$3other' --> $1 . '_' . $2 . $3 . 'other'
which is valid Perl code that can be evaluated.
The part of breaking this up is helped by split's capturing in the separator pattern.
sub repl {
my ($r) = #_;
my #terms = grep { $_ } split /(\$\d)/, $r;
return join '.', map { /^\$/ ? $_ : q(') . $_ . q(') } #terms;
}
$var =~ s/$p/repl($r)/gee;
With capturing /(...)/ in split's pattern, the separators are returned as a part of the list. Thus this extracts from $r an array of terms which are either $N or other, in their original order and with everything (other than trailing whitespace) kept. This includes possible (leading) empty strings so those need be filtered out.
Then every term other than $Ns is wrapped in '', so when they are all joined by . we get a valid Perl expression, as in the example above.
Then /ee will have this function return the string (such as above), and evaluate it as valid code.
We are told that safety of using /ee on external input is not a concern here. Still, this is something to keep in mind. See this post, provided by Håkon Hægland in a comment. Along with the discussion it also directs us to String::Substitution. Its use is demonstrated in this post. Another way to approach this is with replace from Data::Munge
For more discussion of /ee see this post, with several useful answers.
Note on using "b(.)d" for a regex pattern
In this case, with parens and dot, their special meaning is maintained. Thanks to kangshiyin for an early mention of this, and to Håkon Hægland for asserting it. However, this is a special case. Double-quoted strings directly deny many patterns since interpolation is done -- for example, "\w" is just an escaped w (what is unrecognized). The single quotes should work, as there is no interpolation. Still, strings intended for use as regex patterns are best formed using qr, as we are getting a true regex. Then all modifiers may be used as well.

Change delimiter of grep command

I am using grep to detect something here
This is not working when the link is split on two lines in the input. I want to grep to check till it detects a </a> but right now it only is taking the input into grep till it detects a new line.
So if input is like something here it works, but if input is like
<a href="xxxx">
something here /a>
, then it doesn't.
Any solutions?
I'd use awk rather than grep. This should work:
awk '/a href="xxxx">/,/\/a>/' filename
I think you would have much less trouble using some xslt tool, but you can do it with sed, awk or an extended version of grep pcregrep, which is capable of multiline pattern (-M).
I'd suggest to fold input so openning and closing tags are on the same line, then check the line against the pattern. An idiomatic approach using sed(1):
sed '/<[Aa][^A-Za-z]/{ :A
/<\/[Aa]>/ bD
N
bA
:D
/\n/ s// /g
}
# now try your pattern
/<[Aa][^A-Za-z] href="xxx"[^>]*>[^<]*something here[^<]*<\/[Aa]>/ !d'
This is probably a repeat question:
Grep search strings with line breaks
You can try it with tr '\n' ' 'command as was explained in one of the answers, if all you need is to find the files and not the line numbers.
Consider egrep -3 '(<a|</a>)'
"-3" prints up to 3 surrounding lines around each regex match (3 lines before and 3 lines after the match). You can use -1 or -2 as well if that works better.
perl -e '$_=join("", <>); m#<a.*?>.*?<.*?/a>#s; print "$&\n";'
So the trick here is that the entire input is read into $_. Then a standard /.../ regex is run. I used the alternate syntax m#...# so that I do not have to backslash "/"s which are used in xml. Finally the "s" postfix makes multiline matches work by making "." also match newlines (note also option "m" which changes the meaning of ^ and $). "$&" is the matched string. It is the result you are looking for. If you want just the inner-text, you can put round brackets around that part and print $1.
I am assuming that you meant </a> rather than /a> as an xml closing delimiter.
Note the .*? is a non-greedy version of .* so for <a>1</a><a>2</a>, it only matches <a>1</a>.
Note that nested nodes may cause problems eg <a><a></a></a>. This is the same as when trying to match nested brackets "(", ")" or "{", "}". This is a more interesting problem. Regex's are normally stateless so they do not by themselves support keeping an unlimited bracket-nesting-depth. When programming parsers, you normally use regex's for low-level string matching and use something else for higher level parsing of tokens eg bison. There are bison grammars for many languages and probably for xml. xslt might even be better but I am not familiar with it. But for a very simple use case, you can also handle nested blocks like this in perl:
Nested bracket-handling code: (this could be easily adapted to handle nested xml blocks)
$_ = "a{b{c}e}f";
my($level)=(1);
s/.*?({|})/$1/; # throw away everything before first match
while(/{|}/g) {
if($& eq "{") {
++$level;
} elsif($& eq "}") {
--$level;
if($level == 1) {
print "Result: ".$`.$&."\n";
$_=$'; # reset searchspace to after the match
last;
}
}
}
Result: {b{c}e}

Turning off greed not working in this regex

I am trying to run the following search (with . made to match newlines either by adding the /s flag in perl or replacing it with \_. in vim):
/<output_channels>.*(?=Story).*?<\/output_channels>/
However the ? isn't turning off greed as it normally does - can anyone explain why? For example, it matches the entire contents of the following file rather than just the first element:
<output_channels>
<output_channel>RSS</output_channel>
<output_channel>Story</output_channel>
</output_channels>
<output_channels>
<output_channel>RSS</output_channel>
</output_channels>
Sorry if I'm missing something obvious.
I put your sample text into a vim buffer, and then executed the command
:%!perl -e '$text = join("", <STDIN>); $text =~ /<output_channels>.*(?=Story).*?<\/output_channels>/s; print $&;'
The result is just the first block of XML. I think this is what you want?
Note that I escaped the / within the regex. Other than this, it is the same one given in your question.
Also note that the equivalent vim RE would be (tested, works):
<output_channels>\_.*\(story\)\#=\_.\{-}<\/output_channels>
See :help perl-patterns for a rundown of the differences between perl and vim REs.
Further note that parsing heirarchical markup with regexps has been known to reawaken ancient demons.
The first .* in your regex is still greedy. You only added ? after the second one.

Embedding evaluations in Perl regex

So i'm writing a quick perl script that cleans up some HTML code and runs it through a html -> pdf program. I want to lose as little information as possible, so I'd like to extend my textareas to fit all the text that is currently in them. This means, in my case, setting the number of rows to a calculated value based on the value of the string inside the textbox.
This is currently the regex i'm using
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)/80 })"$2>$3<\/textarea>/gis;
Unfortunately Perl doesn't seem to be recognizing what I was told was the syntax for embedding Perl code inside search-and-replace regexs
Are there any Perl junkies out there willing to tell me what I'm doing wrong?
Regards,
Zach
The (?{...}) pattern is an experimental feature for executing code on the match side, but you want to execute code on the replacement side. Use the /e regular-expression switch for that:
#! /usr/bin/perl
use warnings;
use strict;
use POSIX qw/ ceil /;
while (<DATA>) {
s[<textarea rows="(.+?)"(.*?)>(.*?)</textarea>] {
my $rows = ceil(length($3) / 80);
qq[<textarea rows="$rows"$2>$3</textarea>];
}egis;
print;
}
__DATA__
<textarea rows="123" bar="baz">howdy</textarea>
Output:
<textarea rows="1" bar="baz">howdy</textarea>
The syntax you are using to embed code is only valid in the "match" portion of the substitution (the left hand side). To embed code in the right hand side (which is a normal Perl double quoted string), you can do this:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{<textarea rows="#{[ length($3)/80 ]}"$2>$3</textarea>}gis;
This uses the Perl idiom of "some string #{[ embedded_perl_code() ]} more string".
But if you are working with a very complex statement, it may be easier to put the substitution into "eval" mode, where it treats the replacement string as Perl code:
$file =~ s{<textarea rows="(.+?)"(.*?)>(.*?)</textarea>}
{'<textarea rows="' . (length($3)/80) . qq{"$2>$3</textarea>}}gise;
Note that in both examples the regex is structured as s{}{}. This not only eliminates the need to escape the slashes, but also allows you to spread the expression over multiple lines for readability.
Must this be done with regex? Parsing any markup language (or even CSV) with regex is fraught with error. If you can, try to utilize a standard library:
http://search.cpan.org/dist/HTML-Parser/Parser.pm
Otherwise you risk the revenge of Cthulu:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
(Yes, the article leaves room for some simple string-manipulation, so I think your soul is safe, though. :-)
I believe your problem is an unescaped /
If it's not the problem, it certainly is a problem.
Try this instead, note the \/80
$file=~s/<textarea rows="(.+?)"(.*?)>(.*?)<\/textarea>/<textarea rows="(?{ length($3)\/80 })"$2>$3<\/textarea>/gis;
The basic pattern for this code is:
$file =~ s/some_search/some_replace/gis;
The gis are options, which I'd have to look up. I think g = global, i = case insensitive, s = nothing comes to mind right now.
First, you need to quote the / inside the expression in the replacement text (otherwise perl will see a s/// operator followed by the number 80 and so on). Or you can use a different delimiter; for complex substitutions, matching brackets are a good idea.
Then you get to the main problem, which is that (?{...}) is only available in patterns. The replacement text is not a pattern, it's (almost) an ordinary string.
Instead, there is the e modifier to the s/// operator, which lets you write a replacement expression rather than replacement string.
$file =~ s(<textarea rows="(.+?)"(.*?)>(.*?)</textarea>)
("<textarea rows=\"" . (length($3)/80) . "\"$2>$3</textarea>")egis;
As per http://perldoc.perl.org/perlrequick.html#Search-and-replace, this can be accomplished with the "evaluation modifier s///e", e.g., you gis must have an extra e in it.
The evaluation modifier s///e wraps an eval{...} around the replacement string and the evaluated result is substituted for the matched substring. Some examples:
# convert percentage to decimal
$x = "A 39% hit rate";
$x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate"

Regex which ignores comments

being a regex beginner, I need some help writing a regex. It should match a particular pattern, lets say "ABC". But the pattern shouldn't be matched when it is used in comment (' being the comment sign). So XYZ ' ABC
shouldn't match. x("teststring ABC") also shouldn't match. But ABC("teststring ' xxx") has to match to end, that is xxx not being cut off.
Also does anybody know a free Regex application that you can use to "debug" your regex? I often have problems recognizing whats wrong with my tries. Thanks!
Some will swear by RegexBuddy. I've never used the debugger, but I advise you to steer away from the regex generator it provides. It's just a bad idea.
You may be able to pull this off with whatever regex flavor you're using, but in general I think you're going to find it easier and more maintainable to do this the "hard" way. Regular expressions are for regular languages, and nested anything usually means that regexes aren't a good idea. Modern extensions to regex syntax means it may be doable, but it's not going to be pretty, and you sure won't remember what happened in the morning. And one place where regular expressions fail quite spectacularly (even with modern non-regular extensions) is parsing nested structures - trying to parse any mixture comments, quoted strings, and parenthesis quickly devolves into an incomprehensible and unmaintainable mess. Don't get me wrong - I'm a fan of regular expressions in the right places. This isn't one of them.
On the topic of good regex tools, I really like RegexBuddy, but it's not free.
Other than that, a regex is the wrong tool for the job if you need to check inside string delimiters and all sorts too. You need a finite-state machine.
Odd that lots of people recommend their favorite tools, but nobody provides a solution for the problem at hand. (I'm the developer of RegexBuddy, so I'll refrain from recommending any tools.)
There's no good way of matching Y unless it's part of XYZ with a single regular expression. What you can do is write a regex that matches both Y and XYZ: Y|XYZ. Then use a bit of extra code to process the matches for Y, and ignore those for XYZ. One way to do that is with a capturing group: (Y)|XYZ. Now you can process the matches of the first capturing group. When XYZ matches, the capturing group doesn't match anything.
To do this for your VB-style comments, you can use the regex:
'.*|(ABC)
This regex matches a single quote and everything up to the end of the line, or ABC. This regex will match all comments (whether those include ABC or not). The capturing group will match all occurrences of ABC, except those in comments.
If you want your regex to both skip comments and strings, you can add strings to your regex:
'.*|"[^"\r\n]*"|(ABC)
I find the best 'debugger' for regexes is just messing around in an interactive environment trying lots of small bits out. For Python, ipython is great; for Ruby, irb, for command-line type stuff, sed...
Just try out little pieces at a time, make sure you understand them, then add an extra little bit. Rinse and repeat.
For NET development you might as well try RegexDesigner, this tool can generate code(VB/C#) for you. It is a very good tool for us Regex starters.
link text
Here is my solution to this problem:
1. Find a store all your comments in hash
2. Do your regexp replacement
3. Bring comments back to file
Save your time :-)
string fileTextWithComments = "Some tetx file contents";
Dictionary<string, string> comments = new Dictionary<string, string>();
// 1. Find a store all your comments in hash
Regex rc = new Regex("(?:/\\*(?:[^*]|(?:\\*+[^*/]))*\\*+/)|(?://.*)");
MatchCollection matches = rc.Matches(fileTextWithComments);
int index = 0;
foreach (Match match in matches)
{
string key = string.Format("/*Comment#{0}*/", index++);
comments.Add(key, match.Value);
fileTextWithComments = fileTextWithComments.Replace(match.Value, key);
}
// 2. Do your regexp replacement
Regex r = new Regex("YOUR REGEXP PATTERN");
fileTextWithComments = r.Replace(fileTextWithComments, "NEW STRING");
// 3. Bring comments back to file :-)
foreach (string key in comments.Keys)
{
string comment = comments[key];
fileTextWithComments = fileTextWithComments.Replace(key, comment);
}
Could you clarify? I read it thrice, and I think you want to match a given pattern when it appears as a literal. As in not as part of a comment or a string.
What your asking for is pretty tricky to do as a single regexp. Because you want to skip strings. Multiple strings in one line would complicate matters.
I wouldn't even try to do it in one regexp. Instead, I'd pass each line through a filter first, to remove strings, and then comments in that order. And then try and match your pattern.
In Perl because of it's regexp processing power. Assuming #lines is a list of lines you want to match, and $pattern is the pattern you want to match.
#matches =[];
for (#lines){
$line = $_;
$line ~= s/"[^"]*?(?<!\)"//g;
$line ~= s/'.*//g;
push #matches, $_ if $line ~= m/$pattern/;
}
The first substitution finds any pattern that starts with a double quotation mark and ends with the first unescaped double quote. Using the standard escape character of a backspace.
The next strips comments. If the pattern still matches, it adds that line to the list of matches.
It's not perfect because it can't tell the difference between "a\\" and "a\" The first is usually a valid string, the later is not. Either way these substitutions will continue to look for another ", if one isn't found the string isn't thrown out. We could use another substitution to replace all double backslashes with something else. But this will cause problems if the pattern you're looking for contains a backslash.
You can use a zero width look-behind assertion if you only have single line comments, but if you're using multi-line comments, it gets a little trickier.
Ultimately, you really need to solve this kind of issue with some sort of parser, given that the definition of a comment is really driven by a grammar.
This answer to a different but related question looks good too...
If you have Emacs, there is a built-in regex tool called "regexp-builder". I don't really understand the specifics of your regex question well enough to suggest an answer to that.
RegEx1: (-user ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -user "test user"
Regex2: (-daterange ")(.*?)"
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -daterange "1/4/13 1/20/13"
RegEx3: (-date )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -date 1/4/13 -
RegEx4: (-day )(.*?)( -)
Subject: report -user "test user" -date 1/4/13 -day monday -daterange "1/4/13 1/20/13" -
Result: -day monday -
Search for the quoted value first if not found, search for the no quotes parameter. This expects only one occurrence of the parameter. It also expects the command to either; use quotes to encapsulate a string with no quotes inside, or; use any character other than a quote in the first position, have no occurrence of ' -' until the next parameter, and have a trailing ' -' (add it onto the string before the regex).