Regex not matching, need help figuring out why - regex

The regex is:
<td class="description">(?<lineItem>[\w/ -]+)<\/td>\s+<td class="descriptionFormat">(?:[\w\$%\.,()\/ ]*)</td>\s+<td class="amount">[ ]*(?<lineValue>-?\$(?:\d{1,3},?)+\.\d{2})</td>
Ignore case, multi line, and single line are enabled.
An example of a statement that matches is:
<td class="description">Fuel Cost Adjustment</td>
<td class="descriptionFormat">18,640 KWH at -$0.00044</td>
<td class="amount">$-1.36</td>
But this one does not match:
<td class="description">Fuel Cost Adjustment</td>
<td class="descriptionFormat">18,640 KWH at -$0.00044 (25/30 Days)</td>
<td class="amount">$-6.84</td>
I'd really appreciate if anyone could tell me what's wrong.
Thank you!

(<td class="description">)(?<lineItem>[\w/ -]+)<\/td>\s*<td class="descriptionFormat">(?:[\w\$%\-\.,\(\)\/ ]*)\s*</td>\s*<td class="amount">\s*(?<lineValue>-?\$(-|)(?:\d{1,3},?)+\.\d{2})</td>
should match, but you might consider updating this a bit since html parsing with regex is rather tricky; unless you´re 100% percent sure the inputdata wont change in any way

Related

regex challenge for htaccess 301 redirect

I have a number of similar Tag and Category URLs that I would like to redirect in one clean regex statement (if possible). Example URLs are:
table, th, td {
border: 1px solid black;
}
<table style="width:100%">
<tr>
<th>URL</th>
<th>Redirect To</th>
</tr>
<tr>
<td>/category/categoryname-2/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname-2/page/3/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/category/categoryname/page/3/</td>
<td>/blog/category/categoryname/</td>
</tr>
<tr>
<td>/tag/tagname-2/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname-2/page/3/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/tag/tagname/page/3/</td>
<td>/blog/tag/tagname/</td>
</tr>
<tr>
<td>/blog/page/5/</td>
<td>/blog/</td>
</tr>
</table>
Notice some of the tag and category names have "-2" at the end of the name, which I would like removing.
I have had a good attempt at doing this but am not getting very far; the -2 piece is stumping me unfortunately, hence turning to the expertise here on SO.
I think I've got it. I think this covers all category and tag redirects. *'ve explained what I think it's doing, but could be mistaken and perhaps have the solution by accident as much as design.
Also it's hard to fully test due to browser caching and CDN caching, but seems to work when incognito:
RedirectMatch 301 ^/(category|tag)/(([a-z]+)|([a-z]+-[a-z]+))(?:-2)?/(?:page/.*)?$ /blog/$1/$2/
(category|tag) matches the words category or tag. This returns $1 (group 1)
(([a-z]+)|([a-z]+-[a-z]+))(?:-2)?
Some categories have a - in the name; therefore I have to handle the -2 (which we want rid of) and genuine - (which we want to keep).
(([a-z]+)|([a-z]+-[a-z]+)) - This returns group $2. The | in the middle says either return what is before the | or what is after.
([a-z]+) - This finds the first unbroken text block, i.e. stops when it hits a number or punctuation (a - in this case).
([a-z]+-[a-z]+) - This finds the unbroken string, then a hyphen, then another unbroken string. This accounts for tag names with a hyphen. If there isn't a text block after the hyphen, this would return nothing...which we want to happen.
(?:-2)? This says there might be a -2. The question mark at the end signifies there might be but it's fine if there isn't. The ?: inside the bracket says to ignore the -2 if it's there.
(?:page/.*)? Looks for the word "page" and anything after that and, if found, puts in an "ignore" group. The question mark at the end means this doesn't need to be there.

How to match only one string and not another

This is String 1:
<td class="AAA"><span class="BBB">Text1</span></td>
I want to remove the span so it looks like this:
<td class="BBB">Text1</td>
Which is easy enough with this regex:
Search: <td class="AAA"><span class="BBB">(.*)</span></td>
Replace: <td class="BBB">$1</td>
The problem: Sometimes the string looks like this (String 2):
<td class="AAA"><span class="BBB">Text1</span>-<span class="BBB">Text2</span></td>
which also matches because of the 2 closing tags. But I don't want it to be matched at all. How do I find only String 1?
Instead of matching any character in your matching group, match all characters aside from the open <:
Search: <td class="AAA"><span class="BBB">([^<]*)</span></td>
Replace: <td class="BBB">$1</td>
This is assuming your Text1 doesn't contain the < character.

How to create a regex to match everything inside and including <div>...</div>?

This is the sample text that I'm working with. I'm using Coda to do a find and replace...
<td width="20%"><div > Item #</div></td>
<td width="20%"><div > Pole Tip</div></td>
<td width="20%"><div > Length</div></td>
<td width="20%"><div > Test Weight (lbs.)</div></td>
<td width="20%"><div > Price</div></td>
I want to get rid of the div tags that markup the text inside the td.
Ex...I want to change this:
<td width="20%"><div > Item #</div></td>
to this:
<td width="20%">Item #</td>
So far I have this as a regex:
<div >[\s\w\(\)#]*</div>
However this matches all of the above in my sample text EXCEPT:
<td width="20%"><div > Test Weight (lbs.)</div></td>
In my regex, I even tried to add the ( and )...what am I doing wrong?
In Reply to Andy, I agree that Data Parsing of Well-Formed Markup should be kept to DOM Navigational tools. XML for sure, or HTML>XML Converters are good. I don't know what Miles is working with, but I frequently work with HTML that is so malformed that it can't be parsed by Markup parsers.
In some of my Regex tutorials on Document Parsing, I discuss the Regex Trim pattern, which is simply Zero or More Whitespace {\s*}. Though you might shy away from it because it adds a tiny bit of length to the Regex Pattern, there is virtually zero efficiency loss. That being said...
(<td[^>]*>)\s*<div[^>]*>\s*((?:[^<]*(?(?!</div>\s*</td>)<))*)\s*</div>\s*(</td>)
Replace this with $1$2$3 and you win, as well as get back a clean result. Of course, you can replace or remove as many Trims (\s*) as you like, just a personal preference if I am parsing Documents or Malformed Markup.
Thats because you missed the . This works just fine
<div >[\s\w\(\)#.]*</div>

Vb.net help me with regex please

I havent worked with regex before... But I need to parse values in about 500 urls and I need regex for automate it.
Each site contains about 10 values, I need to separate them to own list.
1.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#E93393;" href="/meanings/Example1.html">Example1</a> </td>
2.
<td width="78" style="padding-left:9px;" align="left"><a style="font-weight:bold;color:#004EFF;" href="/meanings/Example2.html">Example2</a> </td>
So, I need to get those 2 values to separate list. It should look for color code to determine in which list value goes.
Could somebody help me? :)
NO..NO..NO..
Regex doesnt work for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack

Regex: Match a <tr> that contains a string

I am trying to match all <tr> elements that contain the word "Source", but when the other attributes (colspan/width/height, contained <td>s and their attributes, etc.) are unknown. (I know this can be done with a javascript/jQuery selector, but I am just processing the HTML for a non-javascript context.)
Example of target:
<tr>
<td>Don't affect this</td>
</tr>
<tr>
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
(This is what I want to change it to:)
<tr>
<td>Don't affect this</td>
</tr>
<tr class="source">
<td colspan="3" width="288" height="57"><strong>Sources:</strong> Author</td>
</tr>
Here are regex patterns I have tried that haven't worked:
/<tr>((?:.*?)Source(?:s?):(?:.*?))<\/tr>/gmi,
No matches.
/<tr>((?:[\s\S]*?)Source(?:s?):(?:[\s\S]*?))<\/tr>/gmi,
Matches the first tr, but not the second.
I think there's regex principle I may be failing to grasp here, about greediness or something related. Any suggestions?
/<tr[^>]*>(?:(?!<|source)[\s\S])*(?:<(?!\/?tr)[^>]*>(?:(?!<|source)[\s\S])*)*source[\s\S]*?<\/tr>/i
Are you sure you can't use jQuery for this? :P But seriously, this will be easier to grasp if I put it in terms of Friedl's "unrolled loop" idiom:
opening normal ( special normal * ) * closing
opening: <tr[^>]*> - the opening <tr> tag
normal: (?:(?!<|source)[\s\S])* - zero or more of any characters, with the lookahead to make sure each time that the character is not the beginning of a tag or the word "source"
special: <(?!\/?tr)[^>]*> - any tag except another opening <tr> or a closing </tr>. By consuming a complete tag, we avoid false positives on the word "source" in the name or value of an attribute.
closing: source - The only other thing it could possibly encounter here is a <tr> or </tr> tag, which would indicate a failed match for our purposes. Finding "source" before one of those tags is how we know we've found a match. (The rest of the regex, [\s\S]*?<\/tr>, merely consumes the remainder of the tag so you can retrieve it via group[0].)
A <tr> there isn't necessarily invalid, of course; it could be the beginning of a nested TR element, presumably within a nested TABLE element. If that TR contains the word "source", the regex will match it on a separate match attempt. It will match only the innermost, complete TR tag with the word "source" in it.
As usual when using regexes on HTML, I'm making several simplifying assumptions involving well-formedness, SGML comments, CDATA sections, etc., etc. Caveat emptor.
If you are using a library like jQuery you do not even need to use a regex:
$('tr:contains("Source")').something...