Select URL in HTML table with regular expression - regex

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?

There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed

Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

Related

Regex - Find and replace an url inside href attribute

I have a xlsx/csv file, which I am trying to modify it's contents with notepad++.
Exactly a url inside href. Ex:
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7609_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/6/7981_Datasheet--de.pdf""
etc...
After replace, I want them to look like this:
href=""/docs/7521_Datasheet--de.pdf""
href=""/docs/7609_Datasheet--de.pdf""
href=""/docs/7981_Datasheet--de.pdf""
Right now, I have this pattern on find:
(?<=href=(""|''))[^"']+(?=(.pdf""|.pdf''))
EDIT:
After trying the given examples no string matches. Here is full cell text:
"<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""10""><tbody><tr>
<td align=""left"" valign=""top"">
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td>
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td align=""left"" valign=""top"" class=""DocRepCell1""><img src=""/catalog/pdf.gif"" alt="" "" border=""0""></td>
<td align=""left"" width=""97%"" valign=""middle"" class=""DocRepCell2""><span class=""NavigationButtonMoreInfos"">Produktinformation breite</span> </td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell3"">0,1 MB</td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell4"">
<a class=""NavigationButtonMoreInfos"" target=""_blank"" href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf"">herunterladen</a></td></tr>
</tbody></table></td></tr></tbody>
</table></td></tr>
</tbody></table></td></tr>
</tbody></table>"
You can try the following find and replace in regex mode:
Find:
^href=""/.*?(\d+_Datasheet.*\.pdf"")$
Replace:
href=""/docs/$1
Note that the find pattern could be made more generic if it doesn't work on more of your data. But in general we would need some concrete way of identifying the start of the suffix which you wish to retain in the match. If my answer doesn't work for you, then state where it fails and provide logic which allows the suffix to be identified.
Here's a way to just match the part you want to replace with the path /docs
Find what :
^href=["']+\K(/.*?)(?=/\d+_[\w-]+\.pdf["']+$)
Replace with :
/docs
Search mode : Regular Expression (best with ". matches new lines" checked off)

how to separate two regexp (for taking text from brackets in commented area)?

I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.

how to match a particular tag value and the get the result from the previous tag after matching?

Input file:
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
I like to match the tag <TD><PRE> sample</PRE></TD> and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>
Output:
This is a TD cell
I tried with the below code:
MY $Output = m/<TD.*?\/TD>/;
I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it.
Thanks in advance.
Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM) but for building a tree, try something like HTML::Tree.
EDIT:
So I decided to see if I could do this with Mojo::DOM, and it worked rather nicely:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.10.0;
use Mojo::DOM;
my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML
my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];
You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.
Edit Nov 1, 2012
Mojolicious 3.54 was just released and it gave Mojo::DOM the new next and previous methods, which help here. (I used this post as a case example for their use). That means, now you can do:
say $dom->find('TR TD')->first(qr/sample/)->previous;
rather than the last 4 lines of the example above.
This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.
<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>
I guess you'd need to replace whatever <PRE> sample</PRE> would be with another expression, but you haven't provided enough information about that here.
Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.
You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:
(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
which means:
(?<=TD>) - look behind from the position where you are and assert that there is a tag
[^>]+ - match everything that is not the end of a tag
(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>) - and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD> (closing tag, optional whitespace characters and your condition)
The result of this match is the text matched by #2.
Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE> tag):
use Modern::Perl;
my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html
say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;
Output:
This is a TD cell

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?
/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.
If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...

Using a REGEX to replace words within a sub-match

I hope this isn't a repetition...
I need a regex to do what should be a fairly simple task. I have code for an HTML table, and I want to replace all <td> tags with <th> tags in the first row of the table, i.e. within the first set of <tr> </tr> tags. The table might look something like this:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<td>Capacity %</td>
<td>Tension V</td>
<td>Acid kg/l</td>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
and I want:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<th>Capacity %</th>
<th>Tension V</th>
<th>Acid kg/l</th>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
I've tried regexes similar to this:
/(<table>\n<tr>\n)(.+?)(</tr>)
...and then tried to rebuild the table row using back references, but I can't seem to apply the regex to the multiple
</?td>
matches that there might be.
I'm doing this in javascript, which means I can't use look-behinds (although if anyone has a look behind solution I'd be interested in seeing it anyway...).
Thanks in advance for any help.
You could do it if your regex engine supports indefinite repetition inside lookbehind assertions, for example in .NET (C#):
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that we can match this before the current position:
<table # <table
(?: # followed by...
(?! # (unless there's an intervening
</table # </table
| # or
</tr # </tr)
) # (End of lookahead assertion)
. # any character
)* # any number of times
) # (End of lookbehind assertion)
<td # Then match <td",
"<th", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
works on your example. But even in .NET, I wouldn't use a regex for it, it's just too brittle. Better manipulate the DOM directly, that's what it's there for.
You can't do this with a single regex. Since regex basically works line-by-line, and you've got a special condition ("only on the first "), you'll need to write some conditional logic along with regex to make it work.