xpath+ regex: matches text - regex

I'm trying to write an xpath such that only nodes with text with numbers alone will be returned.
I wanted to use regex and was hoping this would work
td[matches(text(),'[\d.]')]
Can anyone please help me understand what am I doing wrong here
<tr>
<td>1</td>
<td>10</td>
<td>a</td>
</tr>

seams that you are missing quantification, [\d.] will match only 1 character, so 1 should be selected, 10 on the other site requires something like +, so try your regex like:
td[matches(text(),'\d+')]
Also that . in regex will make it capture non-digit characters, do not add that one.
You can test all your regex queries on regex101.

AFAIK so far Selenium support XPath 1.0 only, so matches() is not supported.
You can try below instead:
//td[number(.) >= 0 or number(.) < 0]
To match table cells with integers

Replace:
td[matches(text(),'[\d+]')]
with:
td[matches(text(),'\d+')]
Note: regex works only in xPath 2.0

Related

python Non greedy regular expression searching too many data

String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:
>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>
Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?
Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing
Use [^>] instead of .:
>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>
(see demo)
Or, better, use HTMLParser.
EDIT: This regex will match even sub-tags:
(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)

Regular expression is correct, but doesn't work in Notepad++

I would like to drop a table cell from all of our XSL templates.
The code is the following:
<td width="100"><img src="/logos/code.png" border="0" width="100"/></td>
The code.png is different in every file. My regex is the following:
\<td.*\>\<img.*\/logos\/.*png.*\/\>\<\/td\>
I tested the expression on https://regex101.com/ and it matches to the above string, but when I try to find & replace with Notepad++, it gives me no match.
My xsl is all in one line, so line break cannot be the problem. Can someone help me, and give me a pattern that works in NP++?
You must not espace < and >.
Here is your regex : <td.*?><img.*?\/logos\/.*?png.*?\/><\/td>.
I also added ? to our .* to ensure it won't act as greedy.

how to separate two regexp (for taking text from brackets in commented area)?

I have some html page, it looks like:
<span>Some text</span>
<p>And again</p>
<table>
<thead>
<tr>
<th>Text</th>
<th>Text [some text]</th>
<th>Text</th>
</tr>
</thead>
<tbody>
<!--[content-->
<tr>
<td>again some txt but with [this]</td>
<td>in this td the same situation [oops]</td>
<td>hello [world]</td>
</tr>
<!--content]-->
</tbody>
</table>
<span>here is [the text]</span>
I need to take text from square brackets, but just in commented fields. I have 2 reg exp and they are work fine, but separately.
/[^[\]]+(?=])/g - this is for text in brackets;
(?=<!--\[content)([\s\S]*?content]-->) - for commented fields.
But I can't combine it. I was trying this (?=<!--\[content)([^[\]]+(?=]))([\s\S]*?content]-->) but it's not works. I don't know much regexp, how can I combine it?
UPD: for output I need text in brackets only between commented fields (this, oops, world).
First, I might start from some simple one:
(?<=\[)[^\]\[]*(?=\])(?=[\s\S]*?<!--content\]-->)
Explanation
(?<=\[)[^\]\[]*(?=\]) match text inside any square brackets,
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
Its sound so make sense right! BUT anyway, check this out DEMO1. yeah...it didn't work. So, the question is why???
In the regex above there is still some problem about the lookahead assertion, as I mentioned before in the previous explanation:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by a closed content tag.
This is WRONG, it should be:
(?=[\s\S]*?<!--content\]-->) lookahead to any string that followed by any open or closed content tags.
So, the conclusion our issue is the regex [\s\S]*? sometimes it just matches "more than one content tags".
Workaround
To prevent the above issue, we can put another negative lookaheads of the open content tags to be coupled with every characters that will be generated by [\s\S]*. Thus, we get:
(?<=\[)[^\]\[]*(?=\])(?=(?:(?!<!--\[content-->)[\s\S])*?<!--content\]-->)
Notice that
[\s\S]*
is just modified to
(?:(?!<!--\[content-->)[\s\S])*?
which means (?!<!--\[content-->) is spawned to be in front of every characters that generated by [\s\S]*. For example if [\s\S]* generates ABCDEF..., the negative lookahead will be spawned in this way:
(?!<!--\[content-->)A(?!<!--\[content-->)B(?!<!--\[content-->)C(?!<!--\[content-->)D(?!<!--\[content-->)E(?!<!--\[content-->)F...
Finally, please check the DEMO2. See that right? it's just work!
DISCLAIMER: My regex here will be work fine under only the simple examples that you were provided on the question. For the another complex such as some recursive structure, I can not guarantee that.

Select URL in HTML table with regular expression

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?
There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed
Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

regex wont find a match

I am trying to pull some info here is my regex
<tr>
<td>([^<]+)<i><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/i><sup id="([^<]+)" class="([^<]+)"><a href="([^<]+)"><span>[<\/span>1<span>]<\/span><\/a><\/sup><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td><a href="([^<]+)" title="([^<]+)">([^<]+)<\/a><\/td>
<td>([^<]+)<\/td>
<td>([^<]+)<\/td>
</tr>
here is sample html
<tr>
<td><i>3Xtreme</i><sup id="cite_ref-18" class="reference"><span>[</span>18<span>]</span></sup></td>
<td>989 Studios</td>
<td>989 Studios</td>
<td>1999-03-31<sup>NA</sup></td>
<td>NA</td>
</tr>
As of now i just want to get the data to find matches.. Can you see any reason why it would not match this?
for all the haters....
I dont care about your options on if i should use regex on html or not.. For this case it will work great. I have one page , the data i need is in a table. Once i can get the data i will save it to my db and never have to use the regex again.. Soooo if your comment or answer is about your option on using regex with html.. dont post.
...Second line:
<td>([^<]+)<i>
cannot hope to match:
<td><i>
as you put a '+' equivalent to '{1,}' while there is nothing between your tags. Didn't check the rest of your regex, but anyway it can't work.
Edit:
Please also correct the "([^<]+)" and so on (I hope you see why)... And edit your regex when you correct it.
Edit 2:
Seeing as it's quite a disaster (sorry but it's the truth :/): please consider replacing all your ([^<]+) things that won't work for all your cases by a simple (.*?)
Edit 3:
[ and ] must be escaped. (\d will help you catch numbers)
<span>[<\/span>1<span>]<\/span>
Lots of problems here: you must escape the brackets and obviously 1 won't match 18