I have multiple html documents and each one has many occurrences of
<a name="pIDsomestring">
where 'somestring' varies with each occurrence.
I want to delete the entire tag, as well as the
</a>
closing HTML tag that immediately follows it, but importantly, not the text inside the anchor tag.
Is there an easy way to do this with sed?
HTML is much more complicated than what can be parsed with sed. Two pieces of HTML can be absolutely equivalent, and yet look completely different as far as a sed command is concerned. For example, you can't really write a sed command that will recognize that these two are equivalent:
<a name="foo">bar</a>
<A
NAME = "foo"
><!-- </A> --bar</>-- -->
(The </>, if you're wondering, means </a> in this case. And heh, even Stack Overflow's syntax highlighter gets confused by the <!-- comment -- not-a-comment -- comment --> notation.)
The above is a pathological example, of course, but even perfectly-ordinary real-world HTML often has line-breaks and other whitespace in random places that have no effect on the HTML but a great deal of effect on a sed command.
But if you're just doing a one-off task where you can manually verify the results afterward, you can try something like this:
's#<a name="[^"]*">\(\([^<]\|<[^/]\|</[^a]\|</a[^>]\)*\)</a>#\1#g'
which will usually work as long as the whole thing is on one line.
Related
I need to extract an address which will change on every new page from a sample like this. So I need a regex to extract 100 E Faith Ter from the following html code snippet.
<span style="..." class="addr">100 E Faith Ter<br>
<span class="locality">Maitland</span>,
<span class="region">FL</span>
<span class="postal-code">32751</span>
</span>
I am using Javascript.
You don't specify a language, and regular expressions are pretty language agnostic, but they differ in specifying how they deal with multiple lines. In javascript: /^.*$/m selects the first line.
Having updated your question to be full HTML instead of raw text, you can use:
^\<.+?\>(.+?)\<br\>$
and retrieve the first parenthesized submatch (be sure you use the multiline option)
The Pony He Comes!!
A regex is not necessary for the whole thing. Instead, just use strip all HTML tags - if you're using PHP, strip_tags does this nicely, otherwise you can regex it replacing <[^>]+> with an empty string. You should get the plain text of the address. You can then split this on its separate lines.
Or you could just be this guy:
I have a tag such as the following:
<div style="position:absolute;opacity:0.5" class="header">Home</div>
(there may or may not be a style or other attribute) and using sed I need to convert it to a span where the id of the span is the class of the div:
<span style="position:absolute;opacity:0.5" id="header">Home</span>
I know how to do this in PHP but unfortunately my Linux is lacking :).
The regex to find the eligible DIVs is something along:
#<div .* id=(.*)>.*</div>#
but I don't know how to write the replacement part, mainly because I need to keep the content between the div tag name and the id. It's 4:45 am so that may have something to do with it as well :p.
I'd appreciate any help on this, thank you.
Using sed, and if you want more specific handling:
sed '/<div/{s/<div /<span /;s/ class *=/ id =/;s!</div!</span!}' input
still, this assumes start and close tags are on the same line, and there is a single div tag on that line. Also it assumes that the class attribute is the only one on that line.
A more strict command is:
sed 's!<div\([^>]*\) class *= *\([^>]*\)>\([^<]*\)</div>!<span\1 id=\2>\3</span>!g' input
sed 's/div/span/;s/id/class/' foo.html
Will output
<span style="position:absolute;opacity:0.5" class="header">Home</div>
Where foo.html is your document
PAY ATTENTION
This will replace only the first uccurence of div and id. If you want to replace all, you have to add "g" char at the end of each substitution pattern, like s/div/span/g
And, not less important, if you want to overwrite your document (so if you want to replace occurence "in place") you have to proced in the following way sed -ie 's/div/span/;s/id/class/' foo.html
Last thing: as correcly Basile Starynkevitch says in the comments, maybe sed isn't the best choice
I have some html which I want to grab between 2 tags. However nested tags exist in the html so looking for wouldn't work as it would return on the first nested div.
Basically I want my regex to..
Match some text literally, followed by ANY character upto another literal text string. So my question is how do I get [^<]* to continue matching until it see's the next div.
such as
<div id="test"[^<]*<div id="test2"
Example html
<div id="test" class="whatever">
<div class="wrapper">
<fieldset>Test</fieldset><div class="testclass">some info</div>
</div>
<!-- end test div--></div>
</div>
<div id="test2" class="endFind">
In general, I suspect you want to look at "greedy" vs "lazy" in your regex, assuming that's supported by your platform/language.
For example, <div[^>]*>(.*?)</div> would make $1 match all the text inside a div, but would try to keep it as small as possible. Some people call *? a "lazy star".
But it seems you're looking to find the text within a div that is before the start of the first nested div. That would be something like <div[^>]*>(.*?)<div
Read about greedy vs lazy here and check to make sure that whatever language you're using supports it.
$ php -r '$text="<div>Test<div>foo</div></div>\n"; print preg_replace("/<div[^>]*>(.*?)<div.*/", "\$1", $text);'
Test
$
Regex is not capable of parsing HTML. If this is part of an application, you're doing something wrong. If you absolutely have to parse a document, use a html/xml parser.
If you're trying to screen scrape something and don't want to bother with a parser, look for identifying marks in the page you're scraping. For example, maybe the embedded div ends just before the one you want to match, so you could match </div></div> instead.
Alternatively, here's a regex that meets your requirements. However, it is very fragile: it will break if, for example, #test's children have children, or the html isn't valid, or I missed something, etc, etc ...
/<div id="test"[^<]*(<([^ >]+).+<\/$2>[^<]*)*<\/div>/
I'm very new to PHP writing and regular expressions. I need to write a Regex pattern that will allow me to "grab" the headlines in the following html tags:
<title>My news</title>
<h1>News</h1>
<h2 class=\"yiv1801001177first\">This is my first headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is another headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the third headline</h2> <p>This is a summary of a fascinating article.</p> <h2>This is the last headline</h2> <p>This is a summary of a fascinating article.</p>
So I need a pattern to match all the <h2> tags. This is my first attempt at writing a pattern, and I'm seriously struggling...
/(<h+[2])>(.*?)\<\/h2>/ is what I've attempted. Help is much appreciated!
I'm not too familiar with PHP, but in cases like this it's usually easier to use XML parser (which will automatically detect <h2> as well as <h2 class="whatever"> rather than regex, which you'll have to add a bunch of special cases to. Javascript, for example has XML DOM exactly for this purpose, I'd be surprised if PHP didn't have something similar.
The easiest way to do it via regex is
#<h2\b[^>]*>(.*?)</h2>#is
This will match any h2 tag and capture its contents in backreference $1. I've used # as a regex delimiter to avoid escaping the / later on in the regex, and the is options to make the regex case-insensitive and to allow newlines within the tag's contents.
There are circumstances where this regex will fail, though, as pointed out correctly by others in this thread.
I have only checked in RegexBuddy, there following regex works:
<h2.*</h2>
I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.
Here's what I get:
<p>
[hide]
<img.../>
[/hide]
</p>
or
<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>
Here's what I'd like:
[hide]
<img.../>
[/hide]
or
[imagelist]
<img .../>
<img .../>
[/imagelist]
I've tried:
preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!
EDIT:
When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future.
Here's a similar question
Thanks!
Matt Mueller
You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.
The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.
Try this regex:
'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'
Explanation:
\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.
\[/\2\] matches a corresponding losing tag.
.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.