Global Regex Substitution with Unique Arbitrary Values - regex

I have one huge HTML files with many links i.e. <a href="...">. I need to substitute each href with a unique arbitrary value. So, after substitution the first link will be <a href="http://link1">, second link <a href="http://link2">, and so on.
Can we do this using a regex? Or, do I need to write a small script to scan over the file? Ideally, the solution will be a Perl or bash script (not something proprietary).
Thanks.

Perl is probably your best bet, but I wouldn't try to do it in one regex (might not even be possible). I think this is as short as you can make the script while still making it readable:
#!/usr/bin/perl
$link = 1;
while(<>) {
$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ );
print;
}
Then call it like so:
./thatScript.pl inputFile.html > newInputFile.html
It will examine each line of input, and for each href="..." it finds, replaces it with a numbered link and increments the link number. There is also a negative lookahead to avoid replacing the same href continuously.
EDIT: Just for the hell of it, here's how you would compress the above into a single line of bash:
perl -pe '$link++ while( s/href="(?!link\d)[^"]*"/href="link$link"/ )' inFile.html > outFile.html
This makes use of Perl's amazing -p flag, as explained here.

I definitely don't recommend this (tchrist is right, of course, it should be a script) but it does have the virtue of being terse and fulfilling the literal requirements in a deterministic/repeatable way without needing to save state/mapping.
perl -MDigest::MD5=md5_hex -MXML::LibXML -le '$d = XML::LibXML->load_html( location => shift || die "need location" ); for $a ( $d->findnodes("//\#href") ) { $a->setValue( md5_hex $a->value ) }; print $d->serialize' targeted.html
Digest::MD5
XML::LibXML

untested:
perl -pe 's{(href=")[^"]+}{$1 . "http://link" . ++$count}ge' filename > newfile

Related

Why isn't this regex executing?

I'm attempting to convert my personal wiki from Foswiki to Markdown files and then to a JAMstack deployment. Foswiki uses flat files and stores metadata in the following format:
%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%
I want to use a git repo for versioning and will worry about linking that to article metatada later. At this point I simply want to convert these blocks to something that looks like this:
---
author: Teoti Nathaniel
revdate: 1539108277
---
After a bit of tweaking I have constructed the following regex:
author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]
According to regex101 this works and my two capture groups contain the desired results. Attempting to actually run it:
perl -0777 -pe 's/author\=\['"\]\(\\w\+\)\['"\]\(\?\:\.\*\)date\=\['"\]\(\\w\+\)\['"\]/author: $1\nrevdate: $2/gms' somefile.txt
gets me only this:
>
My previous attempt (which breaks if the details aren't in a specific order) looked like this and executed correctly:
perl -0777 -pe 's/%META:TOPICINFO\{author="(.*)"\ date="(.*)"\ format="(.*)"\ (.*)\}\%/author:$1 \nrevdate:$2/gms' somefile.txt
I think that this is an escape character problem but can't figure it out. I even went and found this tool to make sure that they are correct.
Brute-forcing my way to understanding here is feeling both inefficient and frustrating, so I'm asking the community for help.
The first major problem is that you're trying to use a single quote (') in the program, when the program is being passed to the shell in single quotes.
Escape any instance of ' in the program by using '\''. You could also use \x27 if the quote happens to be a single double-quoted string literal or regex literal (as is the case of every instance in your program).
perl -0777pe's/author=['\''"].../.../gs'
perl -0777pe's/author=[\x27"].../.../gs'
I would try to break it down into a clean data structure then process it. By seperating the data processing to printing, you can modifiy to add extra data later. It also makes it far more readable. Please see the example below
#!/usr/bin/env perl
use strict;
use warnings;
## yaml to print the data, not required for operation
use YAML::XS qw(Dump);
my $yaml;
my #lines = '%META:TOPICINFO{author="TeotiNathaniel" comment="reprev" date="1571215308" format="1.1" reprev="13" version="14"}%';
for my $str (#lines )
{
### split line into component parts
my ( $type , $subject , $data ) = $str =~ /\%(.*?):(.*?)\{(.*)\}\%/;
## break data in {} into a hash
my %info = map( split(/=/), split(/\s+/, $data) );
## strip quotes if any exist
s/^"(.*)"$/$1/ for values %info;
#add to data structure
$yaml->{$type}{$subject} = \%info;
}
## yaml to print the data, not required for operation
print Dump($yaml);
## loop data and print
for my $t (keys %{ $yaml } ) {
for my $s (keys %{ $yaml->{$t} } ) {
print "-----------\n";
print "author: ".$yaml->{$t}{$s}{"author"}."\n";
print "date: ".$yaml->{$t}{$s}{"date"}."\n";
}
}
Ok, I kept fooling around with it by reducing the execution to a single term and expanding. I soon got to here:
$ perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=\['\"\]\(\\w\+\)\['\"\]/author\: \$1\\nrevdate\: \$2/gms' somefile.txt
Unmatched [ in regex; marked by <-- HERE in m/author=["](\w+)["](?:.*)date=\["](\w+)[ <-- HERE \"\]/ at -e line 1.
This eventually got me to here:
perl -0777 -pe 's/author=['\"]\(\\w\+\)['"](?:.*)date=['\"]\(\\w\+\)['\"]/\nauthor\ $1\nrevdate\:$2\n/gms' somefile.txt
Which produces a messy output but works. (Note: Output is proof-of-concept and this can now be used within a Python script to programattically generate Markdown metadata.
Thanks for being my rubber duckie, StackOverflow. Hopefully this is useful to someone, somewhere, somewhen.

Edit within multi-line sed match

I have a very large file, containing the following blocks of lines throughout:
start :234
modify 123 directory1/directory2/file.txt
delete directory3/file2.txt
modify 899 directory4/file3.txt
Each block starts with the pattern "start : #" and ends with a blank line. Within the block, every line starts with "modify # " or "delete ".
I need to modify the path in each line, specifically appending a directory to the front. I would just use a general regex to cover the entire file for "modify #" or "delete ", but due to the enormous amount of other data in that file, there will likely be other matches to this somewhat vague pattern. So I need to use multi-line matching to find the entire block, and then perform edits within that block. This will likely result in >10,000 modifications in a single pass, so I'm also trying to keep the execution down to less than 30 minutes.
My current attempt is a sed one-liner:
sed '/^start :[0-9]\+$/ { :a /^[modify|delete] .*$/ { N; ba }; s/modify [0-9]\+ /&Appended_DIR\//g; s/delete /&Appended_DIR\//g }' file_to_edit
Which is intended to find the "start" line, loop while the lines either start with a "modify" or a "delete," and then apply the sed replacements.
However, when I execute this command, no changes are made, and the output is the same as the original file.
Is there an issue with the command I have formed? Would this be easier/more efficient to do in perl? Any help would be greatly appreciated, and I will clarify where I can.
I think you would be better off with perl
Specifically because you can work 'per record' by setting $/ - if you're records are delimited by blank lines, setting it to \n\n.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
while (<>) {
#multi-lines of text one at a time here.
if (m/^start :\d+/) {
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
Each iteration of the loop will pick out a blank line delimited chunk, check if it starts with a pattern, and if it does, apply some transforms.
It'll take data from STDIN via a pipe, or myscript.pl somefile.
Output is to STDOUT and you can redirect that in the normal way.
Your limiting factor on processing files in this way are typically:
Data transfer from disk
pattern complexity
The more complex a pattern, and especially if it has variable matching going on, the more backtracking the regex engine has to do, which can get expensive. Your transforms are simple, so packaging them doesn't make very much difference, and your limiting factor will be likely disk IO.
(If you want to do an in place edit, you can with this approach)
If - as noted - you can't rely on a record separator, then what you can use instead is perls range operator (other answers already do this, I'm just expanding it out a bit:
#!/usr/bin/env perl
use strict;
use warnings;
while (<>) {
if ( /^start :/ .. /^$/)
s/(modify \d+)/$1 Appended_DIR\//g;
s/(delete) /$1 Appended_DIR\//g;
}
print;
}
We don't change $/ any more, and so it remains on it's default of 'each line'. What we add though is a range operator that tests "am I currently within these two regular expressions" that's toggled true when you hit a "start" and false when you hit a blank line (assuming that's where you would want to stop?).
It applies the pattern transformation if this condition is true, and it ... ignores and carries on printing if it is not.
sed's pattern ranges are your friend here:
sed -r '/^start :[0-9]+$/,/^$/ s/^(delete |modify [0-9]+ )/&prepended_dir\//' filename
The core of this trick is /^start :[0-9]+$/,/^$/, which is to be read as a condition under which the s command that follows it is executed. The condition is true if sed currently finds itself in a range of lines of which the first matches the opening pattern ^start:[0-9]+$ and the last matches the closing pattern ^$ (an empty line). -r is for extended regex syntax (-E for old BSD seds), which makes the regex more pleasant to write.
I would also suggest using perl. Although I would try to keep it in one-liner form:
perl -i -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit
Or you can use redirection of stdout:
perl -pe 'if ( /^start :/ .. /^$/){s/(modify [0-9]+ )/$1Append_DIR\//;s/(delete )/$1Append_DIR\//; }' file_to_edit > new_file
with gnu sed (with BRE syntax):
sed '/^start :[0-9][0-9]*$/{:a;n;/./{s/^\(modify [0-9][0-9]* \|delete \)/\1NewDir\//;ba}}' file.txt
The approach here is not to store the whole block and to proceed to the replacements. Here, when the start of the block is found the next line is loaded in pattern space, if the line is not empty, replacements are performed and the next line is loaded, etc. until the end of the block.
Note: gnu sed has the alternation feature | available, it may not be the case for some other sed versions.
a way with awk:
awk '/^start :[0-9]+$/,/^$/{if ($1=="modify"){$3="newdirMod/"$3;} else if ($1=="delete"){$2="newdirDel/"$2};}{print}' file.txt
This is very simple in Perl, and probably much faster than the sed equivalent
This one-line program inserts Appended_DIR/ after any occurrence of modify 999 or delete at the start of a line. It uses the range operator to restrict those changes to blocks of text starting with start :999 and ending with a line containing no printable characters
perl -pe"s<^(?:modify\s+\d+|delete)\s+\K><Appended_DIR/> if /^start\s+:\d+$/ .. not /\S/" file_to_edit
Good grief. sed is for simple substitutions on individual lines, that is all. Once you start using constructs other than s, g, and p (with -n) you are using the wrong tool. Just use awk:
awk '
/^start :[0-9]+$/ { inBlock=1 }
inBlock { sub(/^(modify [0-9]+|delete) /,"&Appended_DIR/") }
/^$/ { inBlock=0 }
{ print }
' file
start :234
modify 123 Appended_DIR/directory1/directory2/file.txt
delete Appended_DIR/directory3/file2.txt
modify 899 Appended_DIR/directory4/file3.txt
There's various ways you can do the above in awk but I wrote it in the above style for clarity over brevity since I assume you aren't familiar with awk but should have no trouble following that since it reuses your own sed scripts regexps and replacement text.

sed/grep - get text between two strings (html)

I am trying to extract "pagename" from the following:
<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>
I tried to get it to work using "sed" but it only says invalid command code.
What line of code would you guys suggest to get the pagename? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right?
Thanks in advance for helping me out!
I would use awk for this:
awk -F"[/?]" '/timetable work/ {print $4}'file
pagename
It search for a line containing timetable work, then print fourth field using \ or ? as separator.
As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex:
<a class="timetable.*?<\/a>
Working demo
If you want to grab the content just surround the regex with capturing groups:
(<a class="timetable.*?<\/a>)
The match is:
MATCH 1
1. [9-80] `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`
I think this is what you want:
sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'
Giving you exactly what you asked for using exactly the delimiters you told us to use:
$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename
I know it may be tempting to handle this using a regular expression but here's an alternative.
You are trying to parse some HTML, so use an HTML parser. Here's an example in Perl:
use strict;
use warnings;
use feature qw(say);
use HTML::TokeParser::Simple;
use URI::URL;
my $filename = 'file.html';
my $parser = HTML::TokeParser::Simple->new($filename);
while (my $anchor = $parser->get_tag('a')) {
next unless defined(my $class = $anchor->get_attr('class'));
next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;
my $url = url $anchor->get_attr('href');
say substr($url->path, 1);
}
Parse the HTML using HTML::TokeParser::Simple. loop through the <a> tags, skipping any that don't have the correct classes defined. For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). As you didn't want the leading slash, I used substr to remove the first character.
Output:
pagename
I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. HTML parsers exist for a reason :)

what regex to extract all data except within <> in perl?

I have string
Message <Network=Data Center> All Verified
I need to extract all string except one in angular brackets
I tried
m/(?![^<]*\\>)/s
Not giving desired result.
Removing <..> regions
It's easier to remove the <..> parts from the string and then deal with the remaining string.
Try this oneliner:
cat file | perl -pne 's/<[^>]*?>//g;'
For your sample input, this is the output:
Message All Verified
Notice the non-greedy quantifier ? is used in the regex. Also, because this is a oneliner, the s/// search-and-replace construct is applied to $_ implicit variable (which is a line from standard input). So after search & replace has run in this oneliner, the $_ will be altered(there will be no <..> regions in it). Also the -p was used in order to print the variable $_ after running the block of code. You can read more about Perl commandline switches in perlrun.
This is one solution. Below there is another one:
Capturing regions outside of <..>
On the other hand, you can(if you want) match the parts outside of the <..> regions.
In order to do that let's build a regex. First, we want a < or > free region. The following regex matches just that
$p = ([^<>]*).
Next, we want to match everything before <, and for that we can write (?:$p<) and everything after >, and that's (?:>$p).
Now if we assemble all those parts together we get (?:>$p)|(?:$p<).
Notice that (?:) is a non-capturing group.
So now there are two capturing groups (the two $p you see above) but only one will match at a time, so some of the captures will be undef. We'll have to filter those out.
Finally, we can assemble all the captures, and we're done.
cat file | perl -ne '$p="([^<>]*)";#x=grep{defined} m{(?:>$p)|(?:$p<)}g; print join(" ",#x)."\n";'
Parse::Yapp parser
You might think that using Parser::Yapp for this particular problem is a bit too much(usually, if you have something complicated to parse, you would use a grammar and a parser generator), but .. why not.. :)
Ok, so we need a grammar, here's one right here grammar_file.yp:
#header
%%
#rules
expression:
| exterior '<' interior '>' exterior
| exterior
;
exterior:
| TOK { $_[0]->YYData->{DATA} .= $_[1]; }
| expression
;
interior: TOK;
%%
#footer
sub Error { my ($parser)=shift; }
sub Lexer {
use Data::Dumper;
my($parser)=shift;
$parser->YYData->{INPUT} or return('',undef);
#$parser->YYData->{INPUT}=~s/^\s+//;
for ($parser->YYData->{INPUT}) {
return ('TOK',$1) if(s/^([^<>]+)//);
return ( $1,$1) if(s/^([<>])//);
};
}
You will notice in the grammar above that the interior is completely ignored, and only the terminals from exterior are collected.
Here's a small program that will use the parser(MyParser.pm generated from grammar_file.yp) parse.pl:
#!/usr/bin/env perl
use strict;
use warnings;
use MyParser;
my $parser=MyParser->new;
$parser->YYData->{INPUT} = "Message <Network=Data Center> All Verified";
my $value=$parser->YYParse(
yylex => \&MyParser::Lexer,
yyerror => \&MyParser::Error,
#yydebug => 0x1F,
);
my $nberr=$parser->YYNberr();
my $data=$parser->YYData->{DATA};
print "Result=$data"
And now a Makefile and we're done:
generate_parser_module:
yapp -m MyParser grammar_file.yp;
run:
perl parse.pl
all: generate_parser_module
Note
Some more Parser generators can be found here
Regexp::Grammars
Parse::RecDescent
Marpa::XS or Marpa::R2
You can do it other way: just remove the string in the angular brackets:
s#<.*>##
Or if > is not allowed:
s#<[^>]*>##
You can use sed for that:
cat yourfile |sed 's/<.*>//g' > newfile
If you need perl:
perl -i -pe "s/<.*?>//g" yourfile
Here is a compact approach. The following regex will capture your strings into Group 1:
<[^>]+>|([^<>]*)
What we are interested in here is not the overall match, but just the Group 1 matches.
So we need to iterate over Group 1 matches. I don't code in Perl, but following a recipe from the perlretut tutorial, this should do it:
while ($x =~ /<[^>]+>|([^<>]*)/g) {
print "$1","\n";
}
Please give it a try and let me know if it works for you.

How can I make this Perl one-liner to toggle character in line in a file?

I am attempting to write a one-line Perl script that will toggle a line in a configuration file from "commented" to not and back. I have the following so far:
perl -pi -e 's/^(#?)(\tDefaultServerLayout)/ ... /e' xorg.conf
I am trying to figure out what code to put in the replacement (...) section. I would like the replacement to insert a '#' if one was not matched on, and remove it if it was matched on.
pseudo code:
if ( $1 == '#' ) then
print $2
else
print "#$2"
My Perl is very rusty, and I don't know how to fit that into a s///e replacement.
My reason for this is to create a single script that will change (toggle) my display settings between two layouts. I would prefer to have this done in only one script.
I am open to suggestions for alternate methods, but I would like to keep this a one-liner that I can just include in a shell script that is doing other things I want to happen when I change layouts.
perl -pi -e 's/^(#?)(?=\tDefaultServerLayout)/ ! $1 && "#" /e' foo
Note the addition of ?= to simplify the replacement string by using a look-ahead assertion.
Some might prefer s/.../ $1 ? "" : "#" /e.