regex challenge for htaccess 301 redirect - regex

I have a number of similar Tag and Category URLs that I would like to redirect in one clean regex statement (if possible). Example URLs are:
table, th, td {
border: 1px solid black;
}
<table style="width:100%">
<tr>
<th>URL</th>
<th>Redirect To</th>
</tr>
<tr>
<td>/category/categoryname-2/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname-2/page/3/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname/page/3/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/tag/tagname-2/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname-2/page/3/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname/page/3/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/blog/page/5/</td>
<td>/blog/</td>
</tr>
</table>
Notice some of the tag and category names have "-2" at the end of the name, which I would like removing.
I have had a good attempt at doing this but am not getting very far; the -2 piece is stumping me unfortunately, hence turning to the expertise here on SO.

I think I've got it. I think this covers all category and tag redirects. *'ve explained what I think it's doing, but could be mistaken and perhaps have the solution by accident as much as design.
Also it's hard to fully test due to browser caching and CDN caching, but seems to work when incognito:
RedirectMatch 301 ^/(category|tag)/(([a-z]+)|([a-z]+-[a-z]+))(?:-2)?/(?:page/.*)?$ /blog/$1/$2/
(category|tag) matches the words category or tag. This returns $1 (group 1)
(([a-z]+)|([a-z]+-[a-z]+))(?:-2)?
Some categories have a - in the name; therefore I have to handle the -2 (which we want rid of) and genuine - (which we want to keep).
(([a-z]+)|([a-z]+-[a-z]+)) - This returns group $2. The | in the middle says either return what is before the | or what is after.
([a-z]+) - This finds the first unbroken text block, i.e. stops when it hits a number or punctuation (a - in this case).
([a-z]+-[a-z]+) - This finds the unbroken string, then a hyphen, then another unbroken string. This accounts for tag names with a hyphen. If there isn't a text block after the hyphen, this would return nothing...which we want to happen.
(?:-2)? This says there might be a -2. The question mark at the end signifies there might be but it's fine if there isn't. The ?: inside the bracket says to ignore the -2 if it's there.
(?:page/.*)? Looks for the word "page" and anything after that and, if found, puts in an "ignore" group. The question mark at the end means this doesn't need to be there.

Related

Regex not matching, need help figuring out why

The regex is:
<td class="description">(?<lineItem>[\w/ -]+)<\/td>\s+<td class="descriptionFormat">(?:[\w\$%\.,()\/ ]*)</td>\s+<td class="amount">[ ]*(?<lineValue>-?\$(?:\d{1,3},?)+\.\d{2})</td>
Ignore case, multi line, and single line are enabled.
An example of a statement that matches is:
<td class="description">Fuel Cost Adjustment</td>
<td class="descriptionFormat">18,640 KWH at -$0.00044</td>
<td class="amount">$-1.36</td>
But this one does not match:
<td class="description">Fuel Cost Adjustment</td>
<td class="descriptionFormat">18,640 KWH at -$0.00044 (25/30 Days)</td>
<td class="amount">$-6.84</td>
I'd really appreciate if anyone could tell me what's wrong.
Thank you!
(<td class="description">)(?<lineItem>[\w/ -]+)<\/td>\s*<td class="descriptionFormat">(?:[\w\$%\-\.,\(\)\/ ]*)\s*</td>\s*<td class="amount">\s*(?<lineValue>-?\$(-|)(?:\d{1,3},?)+\.\d{2})</td>
should match, but you might consider updating this a bit since html parsing with regex is rather tricky; unless you´re 100% percent sure the inputdata wont change in any way

Select URL in HTML table with regular expression

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?
There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed
Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

how to match a particular tag value and the get the result from the previous tag after matching?

Input file:
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
I like to match the tag <TD><PRE> sample</PRE></TD> and if it is matched i like to get the result from the previous tag which is <TD>This is a TD cell</TD>
Output:
This is a TD cell
I tried with the below code:
MY $Output = m/<TD.*?\/TD>/;
I am able to match the tag but unable to get the result from the previous tag by matching the same.Can any one let me out with it.
Thanks in advance.
Since you will need to go backwards, I think that probably building a full tree might be needed. Normally I recommend a DOM-style HTML parser (see Mojo::DOM) but for building a tree, try something like HTML::Tree.
EDIT:
So I decided to see if I could do this with Mojo::DOM, and it worked rather nicely:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.10.0;
use Mojo::DOM;
my $dom = Mojo::DOM->new->xml(1)->parse(<<'HTML');
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
HTML
my $collection = $dom->find('TR TD');
my $i = -1; # so that first increment makes 0
$collection->first(sub{$i++; /sample/;});
say $collection->[$i-1];
You have to force XML parsing since HTML5 doesn't use upper case tags, but the rest should be self explanatory.
Edit Nov 1, 2012
Mojolicious 3.54 was just released and it gave Mojo::DOM the new next and previous methods, which help here. (I used this post as a case example for their use). That means, now you can do:
say $dom->find('TR TD')->first(qr/sample/)->previous;
rather than the last 4 lines of the example above.
This isn't really a good problem for regex. The best you can do with a single expression is to match both cells and capture the contents of the first cell in a group. e.g.
<TD>(.*?)</TD>\s*<TD><PRE> sample</PRE></TD>
I guess you'd need to replace whatever <PRE> sample</PRE> would be with another expression, but you haven't provided enough information about that here.
Using a html parser which can actually traverse the document tree would be a better option if you need to do this more generically.
You can use lookbehind and lookahead to assert that a text is preceded or followed by another - the lookarounds are zero-width assertions which means that they don't capture anything:
(?<=TD>)[^>]+(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>)
which means:
(?<=TD>) - look behind from the position where you are and assert that there is a tag
[^>]+ - match everything that is not the end of a tag
(?=</TD>\s*<TD><PRE>\s*sample</PRE></TD>) - and look ahead from the position where you are and assert that the following text is </TD>\s*<TD><PRE>\s*sample</PRE></TD> (closing tag, optional whitespace characters and your condition)
The result of this match is the text matched by #2.
Although we're often cautioned against writing our own html regexs against using mature html parsers, sometimes the former may do the job. See if this option helps (and you may want to match a little more of the <PRE> tag):
use Modern::Perl;
my $html = <<'html';
<TABLE BORDER="7" CELLPADDING="10">
<TR>
<TD>This is a TD cell</TD>
<TD><PRE> sample</PRE></TD>
<TH>This is a TH cell</TH>
</TR>
<TR>
<TH VALIGN="TOP">Text aligned top</TH>
<TH>Image in TH cell with default alignments ---></TH>
<TH><IMG SRC="blylplne.gif" ALT="airplane"></TH>
</TR>
</TABLE>
html
say $html =~ m|<TD>(.*?)</TD>.*<TD><PRE>|is;
Output:
This is a TD cell

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?
/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.
If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...

Using a REGEX to replace words within a sub-match

I hope this isn't a repetition...
I need a regex to do what should be a fairly simple task. I have code for an HTML table, and I want to replace all <td> tags with <th> tags in the first row of the table, i.e. within the first set of <tr> </tr> tags. The table might look something like this:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<td>Capacity %</td>
<td>Tension V</td>
<td>Acid kg/l</td>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
and I want:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<th>Capacity %</th>
<th>Tension V</th>
<th>Acid kg/l</th>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
I've tried regexes similar to this:
/(<table>\n<tr>\n)(.+?)(</tr>)
...and then tried to rebuild the table row using back references, but I can't seem to apply the regex to the multiple
</?td>
matches that there might be.
I'm doing this in javascript, which means I can't use look-behinds (although if anyone has a look behind solution I'd be interested in seeing it anyway...).
Thanks in advance for any help.
You could do it if your regex engine supports indefinite repetition inside lookbehind assertions, for example in .NET (C#):
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that we can match this before the current position:
<table # <table
(?: # followed by...
(?! # (unless there's an intervening
</table # </table
| # or
</tr # </tr)
) # (End of lookahead assertion)
. # any character
)* # any number of times
) # (End of lookbehind assertion)
<td # Then match <td",
"<th", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
works on your example. But even in .NET, I wouldn't use a regex for it, it's just too brittle. Better manipulate the DOM directly, that's what it's there for.
You can't do this with a single regex. Since regex basically works line-by-line, and you've got a special condition ("only on the first "), you'll need to write some conditional logic along with regex to make it work.