python Non greedy regular expression searching too many data - regex

String: '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
I want to search and get only first "td" tag which contains text: "str2". so I tried two different non greedy expressions as below:
>>> mystring = '<td attr="0">str1</td><td attr="5">str2</td><td attr="7">str2</td><td attr="9">str4</td>'
>>> print re.search("(<td.*?str2.*?</td>)",mystring).group(1)
<td attr="0">str1</td><td attr="5">str2</td>
>>> print re.search(".*(<td.*?str2.*?</td>).*",mystring).group(1)
<td attr="7">str2</td>
Here I was expecting output as "<td attr="5">str2</td>", because I have used non greedy expression in regular expression. What is wrong here and how to fetch the expected search result?
Note: I can not use html parser because my actual data-set is not so much formatted for xml parsing

Use [^>] instead of .:
>>> print re.search("(<td[^>]*?>str2.*?</td>)",mystring).group(1)
<td attr="5">str2</td>
(see demo)
Or, better, use HTMLParser.
EDIT: This regex will match even sub-tags:
(<td[^<]*?(?:<(?!td)[^<]*?)*str2.*?</td>)

Related

Extract multiple variable values from a single regular expression

I want to extract ID and Name from a single regular expression, but I'm not able to get the correct response
<a href="/profiles/6635/Name"
I have used below regular expression
<a href="/profiles/(.*?)/(.*?)"
As #WiktorStribiżew suggested, you should fix your regular expression to
<a href="/profiles/([^/]+)/([^/]+)"
But also use $1$ and $2$ to get both values in in Template field, for example
$1$$2$
Will save to variable concatenated value - 6635Name
What you use <a href="/profiles/(.*?)/(.*?)" is fine to capture ID and name from <a href="/profiles/6635/Name" because a lazy way (non-greedy) (.*?) you use will match only between profiles/ and the second / same like using [^\/]+ and then between / and " so , check again that you put everything right .
You may need to escape / like this \/so , change it to :
<a href="\/profiles\/(.*?)\/(.*?)"
This is your same regex here DEMO
And if you need to make sure with java tester use this tool :Java regex tester

xpath+ regex: matches text

I'm trying to write an xpath such that only nodes with text with numbers alone will be returned.
I wanted to use regex and was hoping this would work
td[matches(text(),'[\d.]')]
Can anyone please help me understand what am I doing wrong here
<tr>
<td>1</td>
<td>10</td>
<td>a</td>
</tr>
seams that you are missing quantification, [\d.] will match only 1 character, so 1 should be selected, 10 on the other site requires something like +, so try your regex like:
td[matches(text(),'\d+')]
Also that . in regex will make it capture non-digit characters, do not add that one.
You can test all your regex queries on regex101.
AFAIK so far Selenium support XPath 1.0 only, so matches() is not supported.
You can try below instead:
//td[number(.) >= 0 or number(.) < 0]
To match table cells with integers
Replace:
td[matches(text(),'[\d+]')]
with:
td[matches(text(),'\d+')]
Note: regex works only in xPath 2.0

How to modify (.+?) to ignore \n, \t, or print integers only? (Regex - Python 3.x)

I want to retrieve the amount of funding from a website with the following htmltext:
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt>
<dd class="">10.000 €</dd><dt>
I use regex with Python 3 and the following source code:
regex = '<dt>Funding:(.+?) €</dd>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
But it delivers only the following result:
['\\n\\t\\t</dt><dd class="">10.000']
If I try to include \\n\\t\\t</dt><dd class=""> in the regex expression like this:
regex = '<dt>Funding:\n\t\t</dt><dd class="">(.+?) €</dd>'
It just returns []. Any other modification that I tried with (.+?) doesn't deliver any or a better result. How can I modify the (.+?) expression in order to get the following result for print(price)
['10.000']
You should absolutely use an HTML parser, but since I know nothing about them for this specific case you've shown the following should work:
regex = '<dt>Funding:.*?(\d+(?:\.\d+)?)\s*€</dd>'
Why should you use an HTML parser? Because as soon as your HTML isn't formed exactly how you're expecting it you'll start getting incorrect results. Imagine for example using the above regex with the following HTML:
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt> //start matching here
<dd class="">€</dd> //value is missing
</span>
...
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt>
<dd class="">€</dd> //matches the value from the next result down
</span>

Invert match with regular expressions

How to exclude style attribute from HTML string with regular expressions?
For example if we have following inline HTML string:
<html><body style="background-color:yellow"><h2 style="background-color:red">This is a heading</h2><p style="background-color:green">This is a paragraph.</p></body></html>
When apply the regular expression matching, matched result should look like:
<html><body ><h2 >This is a heading</h2><p >This is a paragraph.</p></body></html>
You can't parse HTML with regular expressions because HTML is not regular.
Of course you can cut corners at your own peril, for example by searching for style\s*=\s*"[^"]*" and replacing that with nothing, but that will remove any occurence of style="anything" from your text.
You simply need to replace the style tags with nothing, here's an example how to do so with PHP:
$text = preg_replace('/\s+style="[^"]*"/', '', $text);
It is mostly answered that regex's in most cases are not suitable for HTML, so you should provide the language in which you plan to implement this.
However a regex like this will replace the heading:
<h2\s+style="background-color:red">
// replace with
<h2>
The regex for the paragraph tag is analogous (replace 'h2' with 'p' and 'red' with 'green').

Replace text, Jython, Regex

I am processing my website and wanting to change some things on the pages.
I am wanting to replace the following string:
in the
<SPAN class="Bold">
More...
</SPAN>
column to your right.
Some times is does not have the <span> tags :
in the
More...
column to your right.
I would like to replace this with "below". I tried doing this with a simple replace() in python but because sometime the text does not have the <span> tag and is on multiple lines it does not seem to work. My only thought is using regular expressions but I am not up to speed with regex's, could anyone lend a hand?
Thanks
Eef
Assuming you have the html text in the string "foo", the code to do this in Python would be like:
import re
#re.DOTALL is used to make the . match all characters including newline
regexp = re.compile('in the.*?More\.\.\..*?column to your right\.', re.DOTALL)
re.sub(regexp, 'below', foo)
Try this:
import re
pattern = re.compile('(?:<SPAN class="Bold">\s*)?More\.\.\.(?:\s*</SPAN>)?')
str = re.sub(pattern, 'below', str)
The (?:…) syntax is a non-capturing grouping which cannot be referenced as a backreference.