Find & replace multiple keywords defined string - regex

I'm trying to remove the following string/line in my SQL database:
<p><span style="font-size:16px"><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
String will always start with <p> and end with </p>
String will always contain these words, in the same order: The, quick, brown. But they might be separated by something else (space, or other HTML tags)
String is part of field with more text, nested HTML tags, so the solution must ignore higher level <p></p> tags.
We are talking about +20k matches, no manual edits solutions please :)
I have already tried doing it with RegExp but I can't filter for multiple keywords (AND operator).
I can export my DB to a sql file so I can use any solution you would recommend, Windows/Linux, text editor, js script etc. but I would appreciate the simplest and elegant solution.

I think you have to restrict .* by a non-efficient but more precise (?:(?!<\/?p[^<]*>).)* that will force to match the words inside 1 <p> tag:
(?i)<p>(?:(?!<\/?p[^<]*>).)*the(?:(?!<\/?p[^<]*>).)*?quick(?:(?!<\/?p[^<]*>).)*?brown(?:(?!<\/?p[^<]*>).)*?<\/p>
See demo

This expression ^<p>.*The.*quick.*brown.*</p>\$ worked for me:
[root#fedora ~]# grep "^<p>.*The.*quick.*brown.*</p>\$" test1.txt
<p><span style="font-size:16px"><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
<p><strong>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</span></strong></p>
<p>The quick brown </strong></span><strong><span style="font-size:16px">fox jumps.</p>
[root#fedora ~]#

You can use the following in any editor (say notepad++) or javascript or any PCRE engine with g, m, i modifiers to match:
^<p>.*?the.*?quick.*?brown.*?<\/p>$
Used .* instead of .+ because of your statement they MIGHT be separated by something else
and replace with '' (empty string)

Related

Regex to delete js HTML attributes

I've got this file from google that has these js attributes like jsname="data", jscontroller="data" etc.
I'd like to use Atom's find and replace with regex feature to replace all attributes beginning with js*="*" with blanks.
How would the regex for this be?
So <div class="l-o-c-qd" jsname="name" jscontroller="somecontroller">Text</div>
will be <div class="l-o-c-qd">Text</div>
Search correct RegEx corresponding to js*="*" en replace it with nothing (check space before/after for avoid double spaces after replacement)

Regex Match All Characters Between Tags on nth occurrence

I need to match text between two tags, but starting at a specific occurrence of the tag.
Imagine this text:
Some long <br> text goes <br> here. And some <br> more can <br> go here.<br>
In my example, I would like to match here. And some.
I successfully matched the text between the first occurrence (between the first and second br tags) with:
<br>(.*?)<br>
But I am looking for the text in the next match (which would be between the second and third br tags). This is probably more obvious than I realize, but Regex is not my strong suite.
Just extend your regex:
<br>(.*?)<br>(.*?)<br>
or, for an unlimited number of matches, and trimming the spaces:
<br>\s*(.*?)(?=\s*<br>)
EDIT: Now that I see that you are parsing an HTML document, be aware that regular expressions may not be the best tool for that job, especially if your parsing requirements are complex.

Notepad++ ( perl ) regex match multiple line pattern

I want to remove a div from a couple hundred html files
<div id="mydiv">
blahblah blah
more blah blah
more html
<some javascript here too>
</div>
I thought that this would do the job but it doesn't
<div(.*)</div>
Does anyone know which is the proper regex for this?
Regex
<div[^>]+>(.*?)</div>
Don't forget to check the option . matches newline like in the image below :
Alternatively, you can use this regex also: <div[^>]+>([\s\S]*?)</div> with or without the checkbox checked.
Discussion
Since * metacharacter is greedy, you need to tell him to take as few as possible characters (use of ?).
Check that the divs you want to remove DO NOT contain nested div. In that case, the regex at the start of my answer won't help you.
If you face this case, I'd suggest you using an html parser.

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/

regex for all characters on yahoo pipes

I have an apparently simple regex query for pipes - I need to truncate each item from it's (<img>) tag onwards. I thought a loop with string regex of <img[.]* replaced by blank field would have taken care of it but to no avail.
Obviously I'm missing something basic here - can someone point it out?
The item as it stands goes along something like this:
sample text title
<a rel="nofollow" target="_blank" href="http://example.com"><img border="0" src="http://example.com/image.png" alt="Yes" width="20" height="23"/></a>
<a.... (a bunch of irrelevant hyperlinks I don't need)...
Essentially I only want the title text and hyperlink that's why I'm chopping the rest off
Going one better because all I'm really doing here is making the item string more manageable by cutting it down before further manipulation - anyone know if it's possible to extract a href from a certain link in the page (in this case the 1st one) using Regex in Yahoo Pipes? I've seen the regex answer to this SO q but I'm not sure how to use it to map a url to an item attribute in a Pipes module?
You need to remove the line returns with a RegEx Pipe and replace the pattern [\r\n] with null text on the content or description field to make it a single line of text, then you can use the .* wildcard which will run to the end of the line.
http://www.yemkay.com/2008/06/30/common-problems-faced-in-yahoo-pipes/