Working RegEx that fails in Perl find & replace one-liner - regex

I have the following RegEx (<th>Password<\/th>\s*<td>)\w*(<\/td>) which matches <th>Password</th><td>root</td> in this HTML:
<tr>
<th>Password</th>
<td>root</td>
</tr>
However this Terminal command fails to find a match:
perl -pi -w -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
It appears to have something to do with the whitespace between the </th> and <td> but the <\/th>\s*<td> works in the RegEx so why not in Perl?
Have tried substituting \s* for \n*, \r*, \t* and various combinations thereof but still no match.
A working example can be seen here.
Any help would be gratefully appreciated.

The substitution is only applied to one line of your file at a time.
You can read the entire file in at once using the -0 option, like this
perl -w -0777 -pi -e 's/(<th>Password<\/th>\s*<td>)\w*(<\/td>)/$1NEWPASSWORD$2/g' file.html
Note that it is far preferable to use a proper HTML parser, such as HTML::TreeBuilder::XPath, to process data like this, as it is very difficult to account for all possible representations of a given HTML construct using regular expressions.

Perl evaluates a file one line at a time, in your example you're trying to match over two lines so perl never finds the end of the string it's looking for on the first line, and never finds the beginning of the line it's looking for on the second line.
You can either flatten file.html to a single line temporarily (which might work if the file's small / performance is not so important) or you'll need to write more sophisticated logic to keep track of lines it's found.
Try searching for 'multiline regex perl' :)

You could use sed to do this:
sed -i '/<th>Password<\/th>/{n;s!<td>[^<]*!<td>NEWPASSWORD!}' file.html
Another sed version:
sed -i '/<th>Password<\/th>/!b;n;s/<td>[^<]*/<td>NEWPASSWORD/' file.html

Related

How to use regex for multiple line pattern in shell script

I want to write a bash script that finds a pattern in a html-file which is going over multiple lines.
File for regex:
<td class="content">
some content
</td>
<td class="time">
13.05.2013 17:51
</td>
<td class="author">
A Name
</td>
Now I want to find the content of <td>-tag with the class="time".
So in principle the following regex:
<td class="time">(\d{2}\.\d{2}\.\d{4}\s+\d{2}:\d{2})</td>
grep seems not to be the command I can use, because...
It only returns the complete line or the complete result using -o and not only the result inside the round brackets (...).
It looks only in one line for a pattern
So how is it possible that I will get only a string with 13.05.2013 17:51?
It's not quite there, it prints a leading newline for some reason, but maybe something like this?
$ sed -n '/<td class="time">/,/<\/td>/{s/^<td class="time">$//;/^<\/td>$/d;p}' file
13.05.2013 17:51
Inspired by https://stackoverflow.com/a/13023643/1076493
Edit: Well, there's always perl!
For more info see https://stackoverflow.com/a/1213996/1076493
$ perl -0777 -ne 'print "$1\n" while /<td class="time">\n (.*?)\n<\/td>/gs' regex.txt
13.05.2013 17:51
How fixed is your format? If you're sure it's going to look like that then you can use sed to match the first line, get the next line and print it, like this:
$ sed -n '/<td *class="time">/{n;p}' test
13.05.2013 17:51
You could add something to cover the case where it's on the same line as well. Alternatively pre-process the file to strip all the newlines, maybe collapse the whitespace too (can't be done with sed apparently) and then go from there.
However, if it's an HTML file from somewhere else and you can't be sure of the format I'd consider using some other scripting language that has a library to parse XML, otherwise any solution is liable to break when the format changes.
Edited to add a link to my favorite sed resource for this sort of thing:http://www-rohan.sdsu.edu/doc/sed.html
Try:
awk '/^td class="time">/{gsub(ORS,x); print $2}' RS=\< FS=\> file
or
awk '/^td class="time">/{print $2}' ORS= RS=\< FS='>[[:space:]]*' file

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:
<field number='1' name='Account' type='STRING'W/>
<field number='2' name='AdvId' type='STRING'W>
I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".
What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:
gsed regexp -i FILENAME
gives me a file with lines looking like this:
1 STRING
2 STRING
I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.
I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?
Thanks a lot,
jules
Using xsh, a Perl wrapper around XML::LibXML:
open file.xml ;
for //field echo #number #type ;
I'm sure this can be optimized, but it works for me and answers your question:
sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>
Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.
I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.
Simple cut should work for you:
cut -f2,6 -d"'" --output-delimiter=" "
If you really want sed:
sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"
You can use this:
sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"
You would be better off using an XML parser, but if you had to use sed:
sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*#\1 \2#p" FILENAME
Or if you don't mind contents of number and type to be optional:
sed -ni "/<field .*>/s#^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*#\1 \2#p" FILENAME
Just change from [^']\\+ to [^']* at your preference.

Regex in perl/sed replacement not matching whitespace/characters

Given this file, I'm trying to do a super primitive sed or perl replacement of a footer.
Typically I use DOM to parse HTML files but so far I've had no issues due to the primitive HTML files I'm dealing with ( time matters ) using sed/perl.
All I need is to replace the <div id="footer"> which contains whitespace, an element that has another element, and the closing </div> with <?php include 'footer.php';?>.
For some reason I can't even get this pattern to match up until the <div id="stupid">. I know there are whitespace characters so i used \s*:
perl -pe 's|<div id="footer">.*\s*.*\s*|<?php include INC_PATH . 'includes/footer.php'; ?>|' file.html | less
But that only matches the first line. The replacement looks like this:
<?php include INC_PATH . includes/footer.php; ?>
<div id="stupid"><img src="file.gif" width="206" height="252"></div>
</div>
Am I forgetting something simple, or should I specify some sort of flag to deal with a multiline match?
perl -v is 5.14.2 and I'm only using the pe flags.
You probably want -0777, which will force perl to read the entire file at once.
perl -0777 -n -e 's|something|else|g' file
Also, your strategy of doing .*\s*.*\s* is pretty fragile. It'll match e.g. <div id="foo", which is just a fragment...
Are you forgetting that almost all regex parsing works on a line-by-line basis?
I've always had to use tr to convert the newlines into some other character, and then back again after the regex.
Just found this: http://www.perlmonks.org/?node_id=17947
You need to tell the regex engine to treat your scalar as a multiline string with the /m option; otherwise it won't attempt to match across newlines.
perl -p
is working on the file on a line by line basis see perl.com
that means your regex will never see all lines to match, it will only match when it gets the line that starts with "<div id="footer">" and on the following lines it will not match anymore.

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

Awk/etc.: Extract Matches from File

I have an HTML file and would like to extract the text between <li> and </li> tags. There are of course a million ways to do this, but I figured it would be useful to get more into the habit of doing this in simple shell commands:
awk '/<li[^>]+><a[^>]+>([^>]+)<\/a>/m' cities.html
The problem is, this prints everything whereas I simply want to print the match in parenthesis -- ([^>]+) -- either awk doesn't support this, or I'm incompetent. The latter seems more likely. If you wanted to apply the supplied regex to a file and extract only the specified matches, how would you do it? I already know a half dozen other ways, but I don't feel like letting awk win this round ;)
Edit: The data is not well-structured, so using positional matches ($1, $2, etc.) is a no-go.
If you want to do this in the general case, where your list tags can contain any legal HTML markup, then awk is the wrong tool. The right tool for the job would be an HTML parser, which you can trust to get correct all of the little details of HTML parsing, including variants of HTML and malformed HTML.
If you are doing this for a special case, where you can control the HTML formatting, then you may be able to make awk work for you. For example, let's assume you can guarantee that each list element never occupies more than one line, is always terminated with </li> on the same line, never contains any markup (such as a list that contains a list), then you can use awk to do this, but you need to write a whole awk program that first finds lines that contain list elements, then uses other awk commands to find just the substring you are interested in.
But in general, awk is the wrong tool for this job.
gawk -F'<li>' -v RS='</li>' 'RT{print $NF}' file
Worked pretty well for me.
By your script, if you can get what you want (it means <li> and <a> tag is in one line.);
$ cat test.html | awk 'sub(/<li[^>]*><a[^>]*>/,"")&&sub(/<\/a>.*/,"")'
or
$ cat test.html | gawk '/<li[^>]*><a[^>]*>(.*?)<\/a>.*/&&$0=gensub(/<li[^>]*><a[^>]*>(.*?)<\/a>.*/,"\\1", 1)'
First one is for every awk, second one is for gnu awk.
There are several issues that I see:
The pattern has a trailing 'm' which is significant for multi-line matches in Perl, but Awk does not use Perl-compatible regular expressions. (At least, standard (non-GNU) awk does not.)
Ignoring that, the pattern seems to search for a 'start list item' followed by an anchor '<a>' to '</a>', not the end list item.
You search for anything that is not a '>' as the body of the anchor; that's not automatically wrong, but it might be more usual to search for anything that is not '<', or anything that is neither.
Awk does not do multi-line searches.
In Awk, '$1' denotes the first field, where the fields are separated by the field separator characters, which default to white space.
In classic nawk (as documented in the 'sed & awk' book vintage 1991) does not have a mechanism in place for pulling sub-fields out of matches, etc.
It is not clear that Awk is the right tool for this job. Indeed, it is not entirely clear that regular expressions are the right tool for this job.
Don't really know awk, how about Perl instead?
tr -d '\012' the.html | perl \
-e '$text = <>;' -e 'while ( length( $text) > 0)' \
-e '{ $text =~ /<li>(.*?)<\/li>(.*)/; $target = $1; $text = $2; print "$target\n" }'
1) remove newlines from file, pipe through perl
2) initialize a variable with the complete text, start a loop until text is gone
3) do a "non greedy" match for stuff bounded by list-item tags, save and print the target, set up for next pass
Make sense? (warning, did not try this code myself, need to go home soon...)
P.S. - "perl -n" is Awk (nawk?) mode. Perl is largely a superset of Awk, so I never bothered to learn Awk.