Replace text, Jython, Regex - regex

I am processing my website and wanting to change some things on the pages.
I am wanting to replace the following string:
in the
<SPAN class="Bold">
More...
</SPAN>
column to your right.
Some times is does not have the <span> tags :
in the
More...
column to your right.
I would like to replace this with "below". I tried doing this with a simple replace() in python but because sometime the text does not have the <span> tag and is on multiple lines it does not seem to work. My only thought is using regular expressions but I am not up to speed with regex's, could anyone lend a hand?
Thanks
Eef

Assuming you have the html text in the string "foo", the code to do this in Python would be like:
import re
#re.DOTALL is used to make the . match all characters including newline
regexp = re.compile('in the.*?More\.\.\..*?column to your right\.', re.DOTALL)
re.sub(regexp, 'below', foo)

Try this:
import re
pattern = re.compile('(?:<SPAN class="Bold">\s*)?More\.\.\.(?:\s*</SPAN>)?')
str = re.sub(pattern, 'below', str)
The (?:…) syntax is a non-capturing grouping which cannot be referenced as a backreference.

Related

Regex to select the string between two texts

I have a code like:
<p>Also: <a>text 1</a></p> <p><a> text 2 </a></p>
I am using a regex like this, I just want to remove until the first </P>
<p>Also:(.*?)</p>
and the output is
empty
How do I select until the first </p> from <p>Also?
I think you want a regex like this:
/(?<=<p>Also:).+?(?=<\/p>)/i
[Regex Demo]
or
/^.*?(?<=<p>Also:).+?(?=<\/p>)/gi
[Regex Demo]
I tried VB.NET and found that your regex pattern works for your input, However, When I tried in regexr.com I found that the foward slash "/" should be escaped.
You could try this:
<p>Also:(.+?)<\/p>
Note: For HTML, I won't recommend you to use regex. It's better to use an HTML parser depending on your programming language.

Find & replace multiple keywords defined string

I'm trying to remove the following string/line in my SQL database:
<p><span style="font-size:16px"><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
String will always start with <p> and end with </p>
String will always contain these words, in the same order: The, quick, brown. But they might be separated by something else (space, or other HTML tags)
String is part of field with more text, nested HTML tags, so the solution must ignore higher level <p></p> tags.
We are talking about +20k matches, no manual edits solutions please :)
I have already tried doing it with RegExp but I can't filter for multiple keywords (AND operator).
I can export my DB to a sql file so I can use any solution you would recommend, Windows/Linux, text editor, js script etc. but I would appreciate the simplest and elegant solution.
I think you have to restrict .* by a non-efficient but more precise (?:(?!<\/?p[^<]*>).)* that will force to match the words inside 1 <p> tag:
(?i)<p>(?:(?!<\/?p[^<]*>).)*the(?:(?!<\/?p[^<]*>).)*?quick(?:(?!<\/?p[^<]*>).)*?brown(?:(?!<\/?p[^<]*>).)*?<\/p>
See demo
This expression ^<p>.*The.*quick.*brown.*</p>\$ worked for me:
[root#fedora ~]# grep "^<p>.*The.*quick.*brown.*</p>\$" test1.txt
<p><span style="font-size:16px"><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
<p><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
<p>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</p>
[root#fedora ~]#
You can use the following in any editor (say notepad++) or javascript or any PCRE engine with g, m, i modifiers to match:
^<p>.*?the.*?quick.*?brown.*?<\/p>$
Used .* instead of .+ because of your statement they MIGHT be separated by something else
and replace with '' (empty string)

Sublime: replace everything between quotes

I need some help with Regular expression to Search and Replace in Sublime to do the following.
I have HTML-code with links like
href="http://www.example.com/test=123"
href="http://www.example.com/test=6546"
href="http://www.example.com/test=3214"
I want to replace them with empty links:
href=""
href=""
href=""
Please help me to create a Reg. ex. filter to match my case. I guess it would sound like "starts with Quote, following with http:// .... ends with Quote and has digitals and '=' sign", but I'm not very confident of how to write this in Reg. ex. way.
(?<=href=")[^"]*
Try this.Replace by empty string.
See demo.
https://regex101.com/r/sH8aR8/40

How to modify (.+?) to ignore \n, \t, or print integers only? (Regex - Python 3.x)

I want to retrieve the amount of funding from a website with the following htmltext:
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt>
<dd class="">10.000 €</dd><dt>
I use regex with Python 3 and the following source code:
regex = '<dt>Funding:(.+?) €</dd>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
But it delivers only the following result:
['\\n\\t\\t</dt><dd class="">10.000']
If I try to include \\n\\t\\t</dt><dd class=""> in the regex expression like this:
regex = '<dt>Funding:\n\t\t</dt><dd class="">(.+?) €</dd>'
It just returns []. Any other modification that I tried with (.+?) doesn't deliver any or a better result. How can I modify the (.+?) expression in order to get the following result for print(price)
['10.000']
You should absolutely use an HTML parser, but since I know nothing about them for this specific case you've shown the following should work:
regex = '<dt>Funding:.*?(\d+(?:\.\d+)?)\s*€</dd>'
Why should you use an HTML parser? Because as soon as your HTML isn't formed exactly how you're expecting it you'll start getting incorrect results. Imagine for example using the above regex with the following HTML:
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt> //start matching here
<dd class="">€</dd> //value is missing
</span>
...
</span></p></div></div><dl class="medium">
<dt>Funding:\n\t\t</dt>
<dd class="">€</dd> //matches the value from the next result down
</span>

regex needed to match anything within p tags

I need a regular expression to match anything that is within <p> tags so for example if I had some text:
<p>Hello world</p>
The regex would match the Hello world part
in javascript:
var str = "<p>Hello world</p>";
str.search(/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/)
in php:
$str = "<p>Hello world</p>";
preg_match_all("/<\s*p[^>]*>([^<]*)<\s*\/\s*p\s*>/", $str);
These will match something as complex as this
< p style= "font-weight: bold;" >Hello world < / p >
EDIT: Don't do it. Just don't.
See this question
If you insist, use <p>(.+?)</p> and the result will be in the first group. It is not perfect, but no regexp solution to HTML parsing problem will ever be.
E.g (in python)
>>> import re
>>> r = re.compile('<p>(.+?)</p>')
>>> r.findall("<p>fo o</p><p>ba adr</p>")
['fo o', 'ba adr']
It seems that the above proposed solutions will fail either:
to return text within <p>...</p> tags whenever it contains other tags like <a>, <em>, etc.
or
to distinguish between <p> and <path> or
to include tags with attributes like <p class="content">
Consider using this regex:
<p(|\s+[^>]*)>(.*?)<\/p\s*>
Resulting text will be captured in group 2.
Obviously, this solution won't work properly whenever closing tag </p> will be for some reason enclosed in comment tags <p> ... <!-- ... </p> ... -->
You can use this in Python as a comprehensive solution:
import re
import bs4
import requests
page = requests.get(link)
page_content = bs4.BeautifulSoup(page.content,'html.parser')
result = page_content.find_all('p')
Regex:
<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>
This will work for any pair of tags.
e.g <p class="foo">hello<br/></p>
The \1 makes sure that the opening tag matches the closing tag.
The content between the tags is captured in \2.
For anybody looking into this Regex or any other regex to match specific HTML tags, this Regex below will work as needed:
<\s*p[^>]*>(.*?)<\s*\/\s*p\s*>
This will match strings like the below strings as mentioned in xzyfer's answer:
<p>I would like <b>all</b> the text!</p> < p style= "font-weight: bold;" >Hello world < / p >
Link to the Regex on Regex101 here: https://regex101.com/r/kjpLII
If you would like to use the Regex for other HTML tags instead of just p tags you can change the p's in the Regex to whichever HTML tag you wish to match:
<\s*div[^>]*>(.*?)<\s*\/\s*div\s*>