Why isn't this regex executing? - regex

I'm attempting to convert my personal wiki from Foswiki to Markdown files and then to a JAMstack deployment. Foswiki uses flat files and stores metadata in the following format:
%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%
I want to use a git repo for versioning and will worry about linking that to article metatada later. At this point I simply want to convert these blocks to something that looks like this:
---
author: Teoti Nathaniel
revdate: 1539108277
---
After a bit of tweaking I have constructed the following regex:
author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]
According to regex101 this works and my two capture groups contain the desired results. Attempting to actually run it:
perl -0777 -pe 's/author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]/author: $1\nrevdate: $2/gms' somefile.txt
gets me only this:
>
My previous attempt (which breaks if the details aren't in a specific order) looked like this and executed correctly:
perl -0777 -pe 's/%META:TOPICINFO\{author="(.*)"\ date="(.*)"\ format="(.*)"\ (.*)\}\%/author:$1 \nrevdate:$2/gms' somefile.txt
I think that this is an escape character problem but can't figure it out. I even went and found this tool to make sure that they are correct.
Brute-forcing my way to understanding here is feeling both inefficient and frustrating, so I'm asking the community for help.

The first major problem is that you're trying to use a single quote (') in the program, when the program is being passed to the shell in single quotes.
Escape any instance of ' in the program by using '\''. You could also use \x27 if the quote happens to be a single double-quoted string literal or regex literal (as is the case of every instance in your program).
perl -0777pe's/author=['\''"].../.../gs'
perl -0777pe's/author=[\x27"].../.../gs'

I would try to break it down into a clean data structure then process it. By seperating the data processing to printing, you can modifiy to add extra data later. It also makes it far more readable. Please see the example below
#!/usr/bin/env perl
use strict;
use warnings;
## yaml to print the data, not required for operation
use YAML::XS qw(Dump);
my $yaml;
my #lines = '%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%';
for my $str (#lines )
{
### split line into component parts
my ( $type , $subject , $data ) = $str =~ /\%(.*?):(.*?)\{(.*)\}\%/;
## break data in {} into a hash
my %info = map( split(/=/), split(/\s+/, $data) );
## strip quotes if any exist
s/^"(.*)"$/$1/ for values %info;
#add to data structure
$yaml->{$type}{$subject} = \%info;
}
## yaml to print the data, not required for operation
print Dump($yaml);
## loop data and print
for my $t (keys %{ $yaml } ) {
for my $s (keys %{ $yaml->{$t} } ) {
print "-----------\n";
print "author: ".$yaml->{$t}{$s}{"author"}."\n";
print "date: ".$yaml->{$t}{$s}{"date"}."\n";
}
}

Ok, I kept fooling around with it by reducing the execution to a single term and expanding. I soon got to here:
$ perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=\['\"\]\(\\w\+\)\['\"\]/author\: \$1\\nrevdate\: \$2/gms' somefile.txt
Unmatched [ in regex; marked by <-- HERE in m/author=["](\w+)["](?:.*)date=\["](\w+)[ <-- HERE \"\]/ at -e line 1.
This eventually got me to here:
perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=['\"]\(\\w\+\)['\"]/\nauthor\ $1\nrevdate\:$2\n/gms' somefile.txt
Which produces a messy output but works. (Note: Output is proof-of-concept and this can now be used within a Python script to programattically generate Markdown metadata.
Thanks for being my rubber duckie, StackOverflow. Hopefully this is useful to someone, somewhere, somewhen.

Related

Edit within multi-line sed match

I have a very large file, containing the following blocks of lines throughout:
start :234
modify 123 directory1/directory2/file.txt
delete directory3/file2.txt
modify 899 directory4/file3.txt
Each block starts with the pattern "start : #" and ends with a blank line. Within the block, every line starts with "modify # " or "delete ".
I need to modify the path in each line, specifically appending a directory to the front. I would just use a general regex to cover the entire file for "modify #" or "delete ", but due to the enormous amount of other data in that file, there will likely be other matches to this somewhat vague pattern. So I need to use multi-line matching to find the entire block, and then perform edits within that block. This will likely result in >10,000 modifications in a single pass, so I'm also trying to keep the execution down to less than 30 minutes.
My current attempt is a sed one-liner:
sed '/^start :[0-9]\+$/ { :a /^[modify|delete] .*$/ { N; ba }; s/modify [0-9]\+ /&Appended_DIR\//g; s/delete /&Appended_DIR\//g }' file_to_edit
Which is intended to find the "start" line, loop while the lines either start with a "modify" or a "delete," and then apply the sed replacements.
However, when I execute this command, no changes are made, and the output is the same as the original file.
Is there an issue with the command I have formed? Would this be easier/more efficient to do in perl? Any help would be greatly appreciated, and I will clarify where I can.
I think you would be better off with perl
Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
while (<>) {
#multi-lines of text one at a time here.
if (m/^start :\d+/) {
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.
It'll take data from STDIN via a pipe, or myscript.pl somefile.
Output is to STDOUT and you can redirect that in the normal way.
Your limiting factor on processing files in this way are typically:
Data transfer from disk
pattern complexity
The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.
(If you want to do an in place edit, you can with this approach)
If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ( /^start :/ .. /^$/)
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).
It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.
sed's pattern ranges are your friend here:
sed -r '/^start :[0-9]+$/,/^$/ s/^(delete |modify [0-9]+ )/&prepended_dir\//' filename
The core of this trick is /^start :[0-9]+$/,/^$/, which is to be read as a condition under which the s command that follows it is executed. The condition is true if sed currently finds itself in a range of lines of which the first matches the opening pattern ^start:[0-9]+$ and the last matches the closing pattern ^$ (an empty line). -r is for extended regex syntax (-E for old BSD seds), which makes the regex more pleasant to write.
I would also suggest using perl. Although I would try to keep it in one-liner form:
perl -i -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit
Or you can use redirection of stdout:
perl -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit > new_file
with gnu sed (with BRE syntax):
sed '/^start :[0-9][0-9]*$/{:a;n;/./{s/^\(modify [0-9][0-9]* \|delete \)/\1NewDir\//;ba}}' file.txt
The approach here is not to store the whole block and to proceed to the replacements. Here, when the start of the block is found the next line is loaded in pattern space, if the line is not empty, replacements are performed and the next line is loaded, etc. until the end of the block.
Note: gnu sed has the alternation feature | available, it may not be the case for some other sed versions.
a way with awk:
awk '/^start :[0-9]+$/,/^$/{if ($1=="modify"){$3="newdirMod/"$3;} else if ($1=="delete"){$2="newdirDel/"$2};}{print}' file.txt
This is very simple in Perl, and probably much faster than the sed equivalent
This one-line program inserts Appended_DIR/ after any occurrence of modify 999 or delete at the start of a line. It uses the range operator to restrict those changes to blocks of text starting with start :999 and ending with a line containing no printable characters
perl -pe"s<^(?:modify\s+\d+|delete)\s+\K><Appended_DIR/> if /^start\s+:\d+$/ .. not /\S/" file_to_edit
Good grief. sed is for simple substitutions on individual lines, that is all. Once you start using constructs other than s, g, and p (with -n) you are using the wrong tool. Just use awk:
awk '
/^start :[0-9]+$/ { inBlock=1 }
inBlock { sub(/^(modify [0-9]+|delete) /,"&Appended_DIR/") }
/^$/ { inBlock=0 }
{ print }
' file
start :234
modify 123 Appended_DIR/directory1/directory2/file.txt
delete Appended_DIR/directory3/file2.txt
modify 899 Appended_DIR/directory4/file3.txt
There's various ways you can do the above in awk but I wrote it in the above style for clarity over brevity since I assume you aren't familiar with awk but should have no trouble following that since it reuses your own sed scripts regexps and replacement text.

Multiple replacement in sed

Is there a way to replace multiple captured groups and replace it with the value of the captured groups from a key-value format (delimited by =) in sed?
Sorry, that question is confusing so here is an example
What I have:
aaa="src is $src$ user is $user$!" src="over there" user="jason"
What I want in the end:
aaa="src is over there user is jason!"
I don't want to hardcode the position of the $var$ because they could change.
sed ':again
s/\$\([[:alnum:]]\{1,\}\)\$\(.*\) \1="\([^"]*\)"/\3\2/g
t again
' YourFile
As you see, sed is absolut not interesting doing this kind of task ... even with element on several line it can work with few modification and it doesn not need a quick a dirty 6 complex line of higher powerfull languages.
principe:
create a label (for a futur goto)
search an occurence of a patterne between $ $ and take his associate content and replace pattern with it and following string but without the pattern content definition
if it occur, try once again by restarting at the label reference of the script. If not print and treat next line
This is a quick & dirty way to solve it using perl. It could fail in some ways (spaces, escapes double quotes, ...), but it would get the job done for most simple cases:
perl -ne '
## Get each key and value.
#captures = m/(\S+)=("[^"]+")/g;
## Extract first two elements as in the original string.
$output = join q|=|, splice #captures, 0, 2;
## Use a hash for a better look-up, and remove double quotes
## from values.
%replacements = #captures;
%replacements = map { $_ => substr $replacements{$_}, 1, -1 } keys %replacements;
## Use a regex to look-up into the hash for the replacements strings.
$output =~ s/\$([^\$]+)\$/$replacements{$1}/g;
printf qq|%s\n|, $output;
' infile
It yields:
aaa="src is over there user is jason!"

shell script: search and replace over multiple lines

I'm looking for a way to search and replace over multiple lines through a shell script. This is what I'm trying to do:
source:
[stuff before]
<!--WIERD_SPECIAL_COMMENT_BEGIN-->
[stuff here, possibly multiple lines.
<!--WIERD_SPECIAL_COMMENT_END-->
[stuff after]
target:
[stuff before]
[new content]
[stuff after]
In short, I want to delete the comments and everything between them and replace with some new content. Basically, I want to do a simple sed command over multiple lines, and if possible just using some basic *nix tools, no additional scripting language.
If you only need to match complete lines then you can do this task with
awk. Something like:
awk -v NEWTEXT=foo 'BEGIN{n=0} /COMMENT_BEGIN/ {n=1} {if (n==0) {print $0}} /COMMENT_END/ {print NEWTEXT; n=0}' < myfile.txt
If the file is not so well formatted, with comments on
the same line as text you want to keep or remove, then I
would use perl, read the entire file into a single string,
do a regular expression match and replace on that string, then write the new string to
a new file. This is not so simple and you need to write a perl script to do the work.
Something like:
#!/usr/bin/perl
$newtext = "foo\nbar";
$/ = ''; # no input separator so whole file is read.
$s = <>; # read whole file from stdin
$startPattern = quotemeta('<!--WIERD_SPECIAL_COMMENT_BEGIN-->');
$endPattern = quotemeta('<!--WIERD_SPECIAL_COMMENT_END-->');
$pattern = $startPattern . '.+' . $endPattern;
$s =~ s/$pattern/$newtext/sg;
print $s;
sed does this just fine. The following is as simple as it gets; if you need to extract stuff from the delimiter line before the start delimiter or after the end delimiter, that's going to be a little more complex.
sed '/<!--WIERD_SPECIAL_COMMENT_BEGIN-->/,/<!--WIERD_SPECIAL_COMMENT_END-->/d' input >output
If you have any control over this, fix the spelling of "weird".
another solution... this is possible to be done in a one-liner, but using perl regular expressions, which I find easier to work with than sed or awk (which are cumbersome with multi-line match and replace):
perl -0 -i -pe 's/<!--WIERD_SPECIAL_COMMENT_BEGIN-->[\s\S]*<!--WIERD_SPECIAL_COMMENT_END-->/your new content here/gim' yourfile1.txt
please note that this will replace the file with the new, changed content.

How can I make this Perl one-liner to toggle character in line in a file?

I am attempting to write a one-line Perl script that will toggle a line in a configuration file from "commented" to not and back. I have the following so far:
perl -pi -e 's/^(#?)(\tDefaultServerLayout)/ ... /e' xorg.conf
I am trying to figure out what code to put in the replacement (...) section. I would like the replacement to insert a '#' if one was not matched on, and remove it if it was matched on.
pseudo code:
if ( $1 == '#' ) then
print $2
else
print "#$2"
My Perl is very rusty, and I don't know how to fit that into a s///e replacement.
My reason for this is to create a single script that will change (toggle) my display settings between two layouts. I would prefer to have this done in only one script.
I am open to suggestions for alternate methods, but I would like to keep this a one-liner that I can just include in a shell script that is doing other things I want to happen when I change layouts.
perl -pi -e 's/^(#?)(?=\tDefaultServerLayout)/ ! $1 && "#" /e' foo
Note the addition of ?= to simplify the replacement string by using a look-ahead assertion.
Some might prefer s/.../ $1 ? "" : "#" /e.

Perl: Grabbing the nth and mth delimited words from each line in a file

Because of the more tedious way of adding hosts to be monitored in Nagios (it requires defining a host object, as opposed to the previous program which only required the IP and hostname), I figured it'd be best to automate this, and it'd be a great time to learn Perl, because all I know at the moment is C/C++ and Java.
The file I read from looks like this:
xxx.xxx.xxx.xxx hostname #comments. i.dont. care. about
All I want are the first 2 bunches of characters. These are obviously space delimited, but for the sake of generality, it might as well be anything. To make it more general, why not the first and third, or fourth and tenth? Surely there must be some regex action involved, but I'll leave that tag off for the moment, just in case.
The one-liner is great, if you're not writing more Perl to handle the result.
More generally though, in the context of a larger Perl program, you would either write a custom regular expression, for example:
if($line =~ m/(\S+)\s+(\S+)/) {
$ip = $1;
$hostname = $2;
}
... or you would use the split operator.
my #arr = split(/ /, $line);
$ip = $arr[0];
$hostname = $arr[1];
Either way, add logic to check for invalid input.
Let's turn this into code golf! Based on David's excellent answer, here's mine:
perl -ane 'print "#F[0,1]\n";'
Edit: A real golf submission would look more like this (shaving off five strokes):
perl -ape '$_="#F[0,1]
"'
but that's less readable for this question's purposes. :-P
Here's a general solution (if we step away from code-golfing a bit).
#!/usr/bin/perl -n
chop; # strip newline (in case next line doesn't strip it)
s/#.*//; # strip comments
next unless /\S/; # don't process line if it has nothing (left)
#fields = (split)[0,1]; # split line, and get wanted fields
print join(' ', #fields), "\n";
Normally split splits by whitespace. If that's not what you want (e.g., parsing /etc/passwd), you can pass a delimiter as a regex:
#fields = (split /:/)[0,2,4..6];
Of course, if you're parsing colon-delimited files, chances are also good that such files don't have comments and you don't have to strip them.
A simple one-liner is
perl -nae 'print "$F[0] $F[1]\n";'
you can change the delimiter with -F
David Nehme said:
perl -nae 'print "$F[0] $F[1}\n";
which uses the -a switch. I had to look that one up:
-a turns on autosplit mode when used with a -n or -p. An implicit split
command to the #F array is done as the first thing inside the implicit
while loop produced by the -n or -p.
you learn something every day. -n causes each line to be passed to
LINE:
while (<>) {
... # your program goes here
}
And finally -e is a way to directly enter a single line of a program. You can have more than -e. Most of this was a rip of the perlrun(1) manpage.
Since ray asked, I thought I'd rewrite my whole program without using Perl's implicitness (except the use of <ARGV>; that's hard to write out by hand). This will probably make Python people happier (braces notwithstanding :-P):
while (my $line = <ARGV>) {
chop $line;
$line =~ s/#.*//;
next unless $line =~ /\S/;
#fields = (split ' ', $line)[0,1];
print join(' ', #fields), "\n";
}
Is there anything I missed? Hopefully not. The ARGV filehandle is special. It causes each named file on the command line to be read, unless none are specified, in which case it reads standard input.
Edit: Oh, I forgot. split ' ' is magical too, unlike split / /. The latter just matches a space. The former matches any amount of any whitespace. This magical behaviour is used by default if no pattern is specified for split. (Some would say, but what about /\s+/? ' ' and /\s+/ are similar, except for how whitespace at the beginning of a line is treated. So ' ' really is magical.)
The moral of the story is, Perl is great if you like lots of magical behaviour. If you don't have a bar of it, use Python. :-P
To Find Nth to Mth Character In Line No. L --- Example For Finding Label
#echo off
REM Next line = Set command value to a file OR Just Choose Your File By Skipping The Line
vol E: > %temp%\justtmp.txt
REM Vol E: = Find Volume Lable Of Drive E
REM Next Line to choose line line no. +0 = line no. 1
for /f "usebackq delims=" %%a in (`more +0 %temp%\justtmp.txt`) DO (set findstringline=%%a& goto :nextstep)
:nextstep
REM Next line to read nth to mth Character here 22th Character to 40th Character
set result=%findstringline:~22,40%
echo %result%
pause
exit /b
Save as find label.cmd
The Result Will Be Your Drive E Label
Enjoy