replacing a bunch of lines in a bunch of files - regex

Let's say I have some thousands of HTML files with some text inside 'em (articles, actually). Besides, let's say there are all sorts of scripts, styles, counters, other crap inside these HTMLs, somewhere above the actual text.
And my task is to replace everything that goes from the very beginning until a certain tag – i.e., we start with <head> and end with <div class="StoryGoesBelow"> with a clear
<html>
<head>
</head>
<body>
block.
Is there any regex way I can do this? Vim? Any other editor? Scripting language?
Thanks.

The simplest regex for this would be (?s)\A.*?(?=<div class="StoryGoesBelow">) (assuming you want to keep the <div> tag). Replace that with the text from your question.
Explanation:
(?s) # Allow the dot to match newlines
\A # Anchor the search at the start of the string
.*? # Match any number of characters, as few as possible
(?=<div class="StoryGoesBelow">) # and stop right before this <div>
This will fail, of course, if the text <div class="StoryGoesBelow"> could also occur in a comment or a literal string somewhere above the actual tag.

Related

Regex Substitution to add a class to a specific html tag

I need to do a regex find-replace on the post content of a wordpress site in order to change all existing <h4> tags to <h2> tags. I then need to style the <h4> tags to look like <h2> tags.
My plan was to add a class to the new <h2> tags...
<h4> Some poorly written html </h4>
becomes
<h2 class="pseudo-h4"> Some poorly written html </h2>
I feel like this should be doable with regex, but I just cannot seem to grok the more advanced parts of regex. My current working approach is to use this regex (?<=h4)(.+class=") to capture the 'class=' part of any h4 opening tag and then use $1pseudo-h4 as the substitution string. Once that is done I can go back and replace all h4s without regex because those which are "pseudo-h4s" will already be marked by the class.
I have a few problems...
1 - wp-cli is hanging when I try to run this on wp_posts. Maybe this is normal?
2 - $1pseudo-h4 with a space on the end is needed prevent my class from concatenating with the next class, but when i pass the argument with a space on the end i get "unknown --regex  parameter"
3 - In my tester it worked, but I dont actually know why this pattern wont match the tag of a previous element, for instance...
<h4>Sup<h4><p class="extra-cheese">Bla bla<p>
my lookbehind should see the <h4> and .+ should go through as many characters as it needs to hit the "class=" section right?

How to Match Redundant Lines From Contenteditable Div in Regex

I'm trying to process the html inside a contenteditable div. It might look like:
<div>Hi I'm Jack...</div>
<div><br></div>
<div><br></div>
<div>More text.</div> *<div><br></div>*
*<div><br></div>**<div><br></div>*
*<div><br></div>*
*<div>
<br>
</div>*
What regex expression would match all trailing <div><br></div> but not the ones sandwiched between useful divs containing text, i.e., <div> text (not html) </div>?
I have enclosed all expressions I want to match in asterisks. The asterisk are for reference only and are not part of my string.
Thanks,
Jack
You can use the pattern:
(?:<div>[\n\s]*<br>[\n\s]*<\/div>)(?!.*?<div>[^<]+<\/div>)
You can try it here.
Let me know if this works for all your cases and I will write a detailed explanation of the pattern.

Find a block of descriptive text inside html using regex

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as
<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (<?php ?> <%php ?> <% %>). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />
I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.
The closest I've gotten so far is:
(([^.<]){1,500})<
Which still misses on things like periods and other characters before and after the string.
Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".
Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:
<div class="itemBanner"> - explicit match
() - parathentical wrap for referencing, e.g. match[1]
.*? - any length of characters, non-greedily (as few as possible)
<\/div> - explicit match, with escaped '/'
to form this Ruby regex:
item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]
Note: The exact regex will depend on the implementation you're using.

REGEX Pattern - How do I match upto a certain tag in html

I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/

Writing Regex pattern for HTML tags

I'm very new to PHP writing and regular expressions. I need to write a Regex pattern that will allow me to "grab" the headlines in the following html tags:
<title>My news</title>
<h1>News</h1>
<h2 class=\"yiv1801001177first\">This is my first headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is another headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the third headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the last headline</h2> <p>This is a summary of a fascinating article.</p>
So I need a pattern to match all the <h2> tags. This is my first attempt at writing a pattern, and I'm seriously struggling...
/(<h+[2])>(.*?)\<\/h2>/ is what I've attempted. Help is much appreciated!
I'm not too familiar with PHP, but in cases like this it's usually easier to use XML parser (which will automatically detect <h2> as well as <h2 class="whatever"> rather than regex, which you'll have to add a bunch of special cases to. Javascript, for example has XML DOM exactly for this purpose, I'd be surprised if PHP didn't have something similar.
The easiest way to do it via regex is
#<h2\b[^>]*>(.*?)</h2>#is
This will match any h2 tag and capture its contents in backreference $1. I've used # as a regex delimiter to avoid escaping the / later on in the regex, and the is options to make the regex case-insensitive and to allow newlines within the tag's contents.
There are circumstances where this regex will fail, though, as pointed out correctly by others in this thread.
I have only checked in RegexBuddy, there following regex works:
<h2.*</h2>