regex wont find a match - regex

I am trying to pull some info here is my regex
<tr>
<td>([^<]+)<i><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/i><sup id="([^<]+)" class="([^<]+)"><a href="([^<]+)"><span>[<\/span>1<span>]<\/span><\/a><\/sup><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td>([^<]+)<\/td>
<td>([^<]+)<\/td>
</tr>
here is sample html
<tr>
<td><i>3Xtreme</i><sup id="cite_ref-18" class="reference"><span>[</span>18<span>]</span></sup></td>
<td>989 Studios</td>
<td>989 Studios</td>
<td>1999-03-31<sup>NA</sup></td>
<td>NA</td>
</tr>
As of now i just want to get the data to find matches.. Can you see any reason why it would not match this?
for all the haters....
I dont care about your options on if i should use regex on html or not.. For this case it will work great. I have one page , the data i need is in a table. Once i can get the data i will save it to my db and never have to use the regex again.. Soooo if your comment or answer is about your option on using regex with html.. dont post.

...Second line:
<td>([^<]+)<i>
cannot hope to match:
<td><i>
as you put a '+' equivalent to '{1,}' while there is nothing between your tags. Didn't check the rest of your regex, but anyway it can't work.
Edit:
Please also correct the "([^<]+)" and so on (I hope you see why)... And edit your regex when you correct it.
Edit 2:
Seeing as it's quite a disaster (sorry but it's the truth :/): please consider replacing all your ([^<]+) things that won't work for all your cases by a simple (.*?)
Edit 3:
[ and ] must be escaped. (\d will help you catch numbers)

<span>[<\/span>1<span>]<\/span>
Lots of problems here: you must escape the brackets and obviously 1 won't match 18

Related

Unable to accurately search a particular text in a html tag using Python

I have the below regex to identify text in a html tag that doesn't yields the result expected.
HTML Tag:
<td>Issue Amount</td>
<td>:</td>
<td>20,000,000.00</td>
Find = re.findall(?<=Issue Amount</td> <td>:</td> <td>) [0-9,]),soup_string)[0]
I need to get the numerical value 20,000,000.00 from this tag.
Any advise what am I doing wrong here. I did try couple of other ways but with no success.
Do not under any circumstances try to parse XML with a regex unless you wish to invoke rite 666 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.
Use an HTML parsing library see this page for some ways to do it.
However in your case you have mucked up your regex by looking for a space between your </td> and <td> tags. Whereas your data has carriage returns. You can use the \s meta-character to look for any white space character
Below is the regex piece that helped me get the desired output. Thanks all for your inputs.
(?<=Issue Amount[td\W]{21})([\d,.]+)

Regex Multiple Lines AHK

I am trying to workout a solution to get a regexmatch to a string that are in between 2 text, I came up with a working solution of
RegExMatch(clipboard,"STUFF<p class=""figure"">(.*)</p></div><",match)
that matches the things from (gave me 2140)
<div><div>STUFF<p class="figure">2140</p></div></div>
but this is all in a single line, I have no idea how do I apply this to a code that are multiple line, such as
<tr>
<td>Qty</td>
<td>:</td>
<td>
310
</td>
</tr>
</table>
I would like to get 310, how should my regex be? I couldn't figure out how, I've tried with \s\s but to no avail. Please help
EDIT
I seem to get a hang of \s* function now, I tried it slowly to see where it omits, as I go on with
<td>Qty</td>\s*<td>(.*?)</td>
it gave me :
but I couldn't get it pass the <td>:</td> part, it always returns blank, I am wondering if my (.?) should be (.\s?) instead?
I tried
RegExMatch(clipboard,"<td>Qty</td>\s*<td>(.\s*?)</td>\s*</tr‌​>",match)
but to no avail
The basic idea is that you add \s* where ever you think a space might come. It will match zero or space like characters. (like tabs, new lines etc). For example,
\s*<tr>\s*<td>Qty</td>\s*<td>:</td>\s*<td>(.*?)</td>/s*</tr>

how to separate two regexp (for taking text from brackets in commented area)?

I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.

How to exclude an HTML tag in regex search

I'm using powerGrep to find instances of these:
BLOCK 1:
<td class="danish">kat
<?php audioButton("../../audio/words/dog","dog");?></td>
But there are also instances of these among my files:
BLOCK 2
<td class="danish">kat</td>
<td><?php audioButton("../../audio/words/dog","dog");?></td>
I want to find BLOCK 1 only, not BLOCK 2.
I've tried using (?!), as in
.*<td class(.*?)>(.*)(?!</td>)
.*<\?php audioButton\("(.*)/.*",".*"\);.*\?></td>
Can I somehow exclude the </td> tag from the search, so that BLOCK 2 is ignored?
I am not sure why you really want to match the first cell tag only, but to perform this you can try:
<td[^>]*>\s*(?:.(?!</td>))+\s*<\?php audioButton\("[^"]*/(.+?)","(.*?)"\);.*?\?>.*?</td> with g flag.
see it working here: https://regex101.com/r/tI4rL4/1
[EDIT]
You can also see the replacement of dog to cat here: https://regex101.com/r/tI4rL4/2
It may not be perfect, since the question is a bit too vague, but it works. If you need to refine a bit or adjust or need a bit of explanation you can ask me!

ASP.net REGEX Question: Find specific match, then skip everything to end tag

strRegex = New StringBuilder
strRegex.Append("<td class=""[\s\w\W]*?"">(?<strTKOWins>[^<]+)[\s]*?<span
class='[\s\w\W]*?'>(T)KOs[\s\w\W]*?</span>[\s\S]*</td>")
Regex = New System.Text.RegularExpressions.Regex(strRegex.ToString,
RegexOptions.None)
Matches = Regex.Match(results, strRegex.ToString)
This is my code. I want to match:
[? what ? Please insert here what you want to match]
The problem is that after the end of the SPAN tag, I want to skip everything inside the Table Cell and skip all the way to the end tag </td>
How can I do that?
i have no idea what you are trying to do. but this regex will find a tablecell, with a span inside of it then go to its corresponding closing tag. fill in all the specifics you need to and change it how you need to....
for eg,
text:
<td class="td class"> anything at all in here?! <span class="span class">span text</span>text in the tablecell?</td>
regex:
<td\s+class=".*?">.*?<span\s+class=".*?">.*?</span>.*?</td>
no idea what all this "strTKOWins" crap is or whether you want specific stuff in your span found?
(T)KOs[\s\w\W]*?
guess i cant really help until you respond anyways....