Using a REGEX to replace words within a sub-match - regex

I hope this isn't a repetition...
I need a regex to do what should be a fairly simple task. I have code for an HTML table, and I want to replace all <td> tags with <th> tags in the first row of the table, i.e. within the first set of <tr> </tr> tags. The table might look something like this:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<td>Capacity %</td>
<td>Tension V</td>
<td>Acid kg/l</td>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
and I want:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<th>Capacity %</th>
<th>Tension V</th>
<th>Acid kg/l</th>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
I've tried regexes similar to this:
/(<table>\n<tr>\n)(.+?)(</tr>)
...and then tried to rebuild the table row using back references, but I can't seem to apply the regex to the multiple
</?td>
matches that there might be.
I'm doing this in javascript, which means I can't use look-behinds (although if anyone has a look behind solution I'd be interested in seeing it anyway...).
Thanks in advance for any help.

You could do it if your regex engine supports indefinite repetition inside lookbehind assertions, for example in .NET (C#):
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that we can match this before the current position:
<table # <table
(?: # followed by...
(?! # (unless there's an intervening
</table # </table
| # or
</tr # </tr)
) # (End of lookahead assertion)
. # any character
)* # any number of times
) # (End of lookbehind assertion)
<td # Then match <td",
"<th", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
works on your example. But even in .NET, I wouldn't use a regex for it, it's just too brittle. Better manipulate the DOM directly, that's what it's there for.

You can't do this with a single regex. Since regex basically works line-by-line, and you've got a special condition ("only on the first "), you'll need to write some conditional logic along with regex to make it work.

Related

Regex - Find and replace an url inside href attribute

I have a xlsx/csv file, which I am trying to modify it's contents with notepad++.
Exactly a url inside href. Ex:
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7609_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/6/7981_Datasheet--de.pdf""
etc...
After replace, I want them to look like this:
href=""/docs/7521_Datasheet--de.pdf""
href=""/docs/7609_Datasheet--de.pdf""
href=""/docs/7981_Datasheet--de.pdf""
Right now, I have this pattern on find:
(?<=href=(""|''))[^"']+(?=(.pdf""|.pdf''))
EDIT:
After trying the given examples no string matches. Here is full cell text:
"<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""10""><tbody><tr>
<td align=""left"" valign=""top"">
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td>
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td align=""left"" valign=""top"" class=""DocRepCell1""><img src=""/catalog/pdf.gif"" alt="" "" border=""0""></td>
<td align=""left"" width=""97%"" valign=""middle"" class=""DocRepCell2""><span class=""NavigationButtonMoreInfos"">Produktinformation breite</span> </td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell3"">0,1 MB</td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell4"">
<a class=""NavigationButtonMoreInfos"" target=""_blank"" href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf"">herunterladen</a></td></tr>
</tbody></table></td></tr></tbody>
</table></td></tr>
</tbody></table></td></tr>
</tbody></table>"
You can try the following find and replace in regex mode:
Find:
^href=""/.*?(\d+_Datasheet.*\.pdf"")$
Replace:
href=""/docs/$1
Note that the find pattern could be made more generic if it doesn't work on more of your data. But in general we would need some concrete way of identifying the start of the suffix which you wish to retain in the match. If my answer doesn't work for you, then state where it fails and provide logic which allows the suffix to be identified.
Here's a way to just match the part you want to replace with the path /docs
Find what :
^href=["']+\K(/.*?)(?=/\d+_[\w-]+\.pdf["']+$)
Replace with :
/docs
Search mode : Regular Expression (best with ". matches new lines" checked off)

gregexpr function in R returning different results whether Perl is TRUE or FALSE

I have the following piece of HTML I'm trying to run regex on with gregexpr function in R
<div class=g-unit>
<div class=nwp style=display:inline>
<input type=hidden name=cid value="22144">
<input autocomplete=off class=id-fromdate type=text size=10 name=startdate value="Sep 6, 2013"> -
<input autocomplete=off class=id-todate type=text size=10 name=enddate value="Sep 5, 2014">
<input id=hfs type=submit value=Update style="height:1.9em; margin:0 0 0 0.3em;">
</div>
</div>
</div>
<div id=prices class="gf-table-wrapper sfe-break-bottom-16">
<table class="gf-table historical_price">
<tr class=bb>
<th class="bb lm lft">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
...
...
</table>
</div>
I am trying to extract the table part from this html by using the following regex expression
<table\\s+class="gf-table historical_price">.+<
When I run the gregexpr function with perl=FALSE it works fine and I get a result
However if I run it with perl=TRUE I get back nothing. It doesn't seem to match it
Does anyone know why the results are different from just switching Perl on and off?
Many thanks in advance!
It seems that in the extended mode for regex, the dot is able to match newline characters, that is not the case in perl mode. To make it work in perl mode, you need to use the (?s) modifier to make the dot able to match newline characters too:
> m <- gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', str, perl = TRUE)
In many regex flavors, the dot doesn't match newlines by default, probably to make more handy a line by line job.
The s in the inline modifier (?s) stands for "singleline". In other words, this means that the whole string is seen as a single line (for the dot) even if there are newline characters.
You need to use the inline (?s) modifier forcing the dot to match all characters, including line breaks.
The perl=T argument switches to the (PCRE) library that implements regex pattern matching.
gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', x, perl=T)
However as stated in the comments, a parser is recommended to do this. I would start out using the XML library.
cat(paste(xpathSApply(htmlParse(html), '//table[#class="gf-table historical_price"]', xmlValue), collapse = "\n"))

Select URL in HTML table with regular expression

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?
There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed
Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

how to match a particular tag value and the get the result from the previous tag after matching?

Input file:
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
I like to match the tag <TD><PRE> sample</PRE></TD> and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>
Output:
This is a TD cell
I tried with the below code:
MY $Output = m/<TD.*?\/TD>/;
I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it.
Thanks in advance.
Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM) but for building a tree, try something like HTML::Tree.
EDIT:
So I decided to see if I could do this with Mojo::DOM, and it worked rather nicely:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.10.0;
use Mojo::DOM;
my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML
my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];
You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.
Edit Nov 1, 2012
Mojolicious 3.54 was just released and it gave Mojo::DOM the new next and previous methods, which help here. (I used this post as a case example for their use). That means, now you can do:
say $dom->find('TR TD')->first(qr/sample/)->previous;
rather than the last 4 lines of the example above.
This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.
<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>
I guess you'd need to replace whatever <PRE> sample</PRE> would be with another expression, but you haven't provided enough information about that here.
Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.
You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:
(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
which means:
(?<=TD>) - look behind from the position where you are and assert that there is a tag
[^>]+ - match everything that is not the end of a tag
(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>) - and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD> (closing tag, optional whitespace characters and your condition)
The result of this match is the text matched by #2.
Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE> tag):
use Modern::Perl;
my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html
say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;
Output:
This is a TD cell

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?
/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.
If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...