Regex - Find and replace an url inside href attribute

Regex - Find and replace an url inside href attribute - regex

I have a xlsx/csv file, which I am trying to modify it's contents with notepad++.
Exactly a url inside href. Ex:
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7609_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/6/7981_Datasheet--de.pdf""
etc...
After replace, I want them to look like this:
href=""/docs/7521_Datasheet--de.pdf""
href=""/docs/7609_Datasheet--de.pdf""
href=""/docs/7981_Datasheet--de.pdf""
Right now, I have this pattern on find:
(?<=href=(""|''))[^"']+(?=(.pdf""|.pdf''))
EDIT:
After trying the given examples no string matches. Here is full cell text:
"<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""10""><tbody><tr>
<td align=""left"" valign=""top"">
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td>
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td align=""left"" valign=""top"" class=""DocRepCell1""><img src=""/catalog/pdf.gif"" alt="" "" border=""0""></td>
<td align=""left"" width=""97%"" valign=""middle"" class=""DocRepCell2""><span class=""NavigationButtonMoreInfos"">Produktinformation breite</span> </td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell3"">0,1 MB</td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell4"">
<a class=""NavigationButtonMoreInfos"" target=""_blank"" href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf"">herunterladen</a></td></tr>
</tbody></table></td></tr></tbody>
</table></td></tr>
</tbody></table></td></tr>
</tbody></table>"

You can try the following find and replace in regex mode:
Find:
^href=""/.*?(\d+_Datasheet.*\.pdf"")$
Replace:
href=""/docs/$1
Note that the find pattern could be made more generic if it doesn't work on more of your data. But in general we would need some concrete way of identifying the start of the suffix which you wish to retain in the match. If my answer doesn't work for you, then state where it fails and provide logic which allows the suffix to be identified.

Here's a way to just match the part you want to replace with the path /docs
Find what :
^href=["']+\K(/.*?)(?=/\d+_[\w-]+\.pdf["']+$)
Replace with :
/docs
Search mode : Regular Expression (best with ". matches new lines" checked off)

Related

How to match only one string and not another

This is String 1:
<td class="AAA"><span class="BBB">Text1</span></td>
I want to remove the span so it looks like this:
<td class="BBB">Text1</td>
Which is easy enough with this regex:
Search: <td class="AAA"><span class="BBB">(.*)</span></td>
Replace: <td class="BBB">$1</td>
The problem: Sometimes the string looks like this (String 2):
<td class="AAA"><span class="BBB">Text1</span>-<span class="BBB">Text2</span></td>
which also matches because of the 2 closing tags. But I don't want it to be matched at all. How do I find only String 1?

Instead of matching any character in your matching group, match all characters aside from the open <:
Search: <td class="AAA"><span class="BBB">([^<]*)</span></td>
Replace: <td class="BBB">$1</td>
This is assuming your Text1 doesn't contain the < character.

Select URL in HTML table with regular expression

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?

There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed

Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

How to create a regex to match everything inside and including <div>...</div>?

This is the sample text that I'm working with. I'm using Coda to do a find and replace...
<td width="20%"><div > Item #</div></td>
<td width="20%"><div > Pole Tip</div></td>
<td width="20%"><div > Length</div></td>
<td width="20%"><div > Test Weight (lbs.)</div></td>
<td width="20%"><div > Price</div></td>
I want to get rid of the div tags that markup the text inside the td.
Ex...I want to change this:
<td width="20%"><div > Item #</div></td>
to this:
<td width="20%">Item #</td>
So far I have this as a regex:
<div >[\s\w\(\)#]*</div>
However this matches all of the above in my sample text EXCEPT:
<td width="20%"><div > Test Weight (lbs.)</div></td>
In my regex, I even tried to add the ( and )...what am I doing wrong?

In Reply to Andy, I agree that Data Parsing of Well-Formed Markup should be kept to DOM Navigational tools. XML for sure, or HTML>XML Converters are good. I don't know what Miles is working with, but I frequently work with HTML that is so malformed that it can't be parsed by Markup parsers.
In some of my Regex tutorials on Document Parsing, I discuss the Regex Trim pattern, which is simply Zero or More Whitespace {\s*}. Though you might shy away from it because it adds a tiny bit of length to the Regex Pattern, there is virtually zero efficiency loss. That being said...
(<td[^>]*>)\s*<div[^>]*>\s*((?:[^<]*(?(?!</div>\s*</td>)<))*)\s*</div>\s*(</td>)
Replace this with $1$2$3 and you win, as well as get back a clean result. Of course, you can replace or remove as many Trims (\s*) as you like, just a personal preference if I am parsing Documents or Malformed Markup.

Thats because you missed the . This works just fine
<div >[\s\w\(\)#.]*</div>

how to match a particular tag value and the get the result from the previous tag after matching?

Input file:
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
I like to match the tag <TD><PRE> sample</PRE></TD> and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>
Output:
This is a TD cell
I tried with the below code:
MY $Output = m/<TD.*?\/TD>/;
I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it.
Thanks in advance.

Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM) but for building a tree, try something like HTML::Tree.
EDIT:
So I decided to see if I could do this with Mojo::DOM, and it worked rather nicely:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.10.0;
use Mojo::DOM;
my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML
my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];
You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.
Edit Nov 1, 2012
Mojolicious 3.54 was just released and it gave Mojo::DOM the new next and previous methods, which help here. (I used this post as a case example for their use). That means, now you can do:
say $dom->find('TR TD')->first(qr/sample/)->previous;
rather than the last 4 lines of the example above.

This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.
<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>
I guess you'd need to replace whatever <PRE> sample</PRE> would be with another expression, but you haven't provided enough information about that here.
Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.

You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:
(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
which means:
(?<=TD>) - look behind from the position where you are and assert that there is a tag
[^>]+ - match everything that is not the end of a tag
(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>) - and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD> (closing tag, optional whitespace characters and your condition)
The result of this match is the text matched by #2.

Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE> tag):
use Modern::Perl;
my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html
say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;
Output:
This is a TD cell

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?

/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.

If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - Find and replace an url inside href attribute - regex

Here's a way to just match the part you want to replace with the path /docs Find what : ^href=["']+\K(/.*?)(?=/\d+_[\w-]+\.pdf["']+$) Replace with : /docs Search mode : Regular Expression (best with ". matches new lines" checked off)

Related

How to match only one string and not another

Select URL in HTML table with regular expression

How to create a regex to match everything inside and including <div>...</div>?

how to match a particular tag value and the get the result from the previous tag after matching?

Regex: Match a <tr> that contains a string

Categories

Resources