Grab contents of div with regex in Powershell - regex

I have a directory of similar structured HTML files (two examples given):
File-1.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>bar</p></div></div>
<div class="baz">baz</div>
</body>
</html>
File-2.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>apple<br>banana</p></div></div>
<div class="baz">baz</div>
</body>
</html>
I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:
For File-1.html: bar
For File-2.html: apple banana
I now have:
$directory = "C:\Users\Public\Documents\Sandbox\HTML"
foreach ($file in Get-ChildItem($directory))
{
$content = Get-Content $file.fullname
$test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')
echo $test[0]
}
This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>. In other words, the regex does not stop until the last </div>. How can I let it only grab what in the <div class="bar"> div?

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible".
(?si)<div class="bar">(.*?)</div>

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

I have content like below
enter code here
<div class="content1">
<ul>
<li>line1</li>
<li>line2</li>
<li>line3</li>
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I want to strip all li tags within and retain contents inside it. like below
enter code here
<div class="content1">
<ul>
line1
line2
line3
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I have about 500 html files to edit.Is there any Regex code to achieve this in notepad++.
You can use a regex like this
<li>(.*?)<\/li>
With the replacement string:
$1
Working demo
The regex to match those tags are
\<li\>
\<\/li\>
The backslashes are used to treat special characters as 'normal' characters.
If you use terminal you can use stream edit which is
sed 's/\<li\>//' input.txt > output.txt
But in notepad++ i believe you can ctrl find and replace

Regex to replace all html tag except br and p tag perl

I have a string in that i will be getting so many html tag i want to replace them with space .How can we do this please suggest me .This is my string :
Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.
I have tried this but this is not working accordingly:
$string =~ s/(<((?!br|p)[^>]+)>)//ig;
You need to deal with closing tags:
use Modern::Perl;
my $str = 'Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.';
$str =~ s~<(?!/?\s*br|/?\s*p)[^>]+>~~ig;
say $str;
You could also use package HTML::StripTags:
use HTML::StripTags qw(strip_tags);
my $str = 'Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.';
my $allowed_tags = '<p><br>';
say strip_tags( $str, $allowed_tags );
In your regular expression, you did not mention substitution character or delimiter. In your case, you should substitute with space.
Regular expression is:
$msg =~s/(<((?!br|p)[^>]+)>)/ /ig;

PowerShell regular expression to get all the HTML tags

I have a string with HTML tags. I have to write PowerShell script to split this string using regular expression for HTML tags both opening and closing. I have tried many times but with no luck.
<([A-Z][A-Z0-9])[^>]>
I have tried this for opening tags. But it only removes the '<' and '>' from string not the whole tag.
My string is something like this:
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
My desired output is: some text inside. This is text inside font. this is h1 text. This is a new paragraph.
Not sure how you're doing your split, but it shouldn't be that difficult:
$Text =
#'
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
'#
$text -split '<.+?>' -match '\S'
some text inside.
this is text inside font.
this is h1 text.
This is a new paragraph.

Perl regexp to find an element inside an element

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below
This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example
I have tried the below code. But it gets only up to first </div> of <div id="subclass1">
Could any help me to solve this?
Code I tried to capture is:
<div id="class1">(?:(?!<\/div>).)*?</div>
Use a proper HTML parser.
use strict;
use warnings;
use feature qw( say );
use XML::LibXML qw( );
my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($html);
my $root = $doc->documentElement();
for my $div ($root->findnodes('//div[#id="class1"]')) {
say "[", $div->toString(), "]";
}
$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is This is example
You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)
$re = qr{
(
<div[^>]*>
(?:(??{$re}) | [^<>]*)*
</div>
)
}x;
print "$1\n" if(/$re/o);
A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.
<div id=".+?">.*</div> should work for you.
http://regexr.com?33336

Delete the content between HTML tags including the tags themselves in Perl

There are about 100 files and I need to go through each of them and delete all the data which is between <style> and </style> + delete these tags too.
For example
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
should become
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
Also, in some files the style pattern is like
<style type="text/css"> blah </style>
or
<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
I need to remove all 3 patterns. How do I do this in Perl?
use strict;
use warnings;
use XML::LibXML qw( );
my $qfn = 'a.html';
my $doc = XML::LibXML->load_html( location => $qfn );
my $root = $doc->documentElement();
for my $style_node ($root->findnodes('//style')) {
$style_node->parentNode()->removeChild($style_node);
}
{
open(my $fh, '>', $qfn)
or die;
print($fh $doc->toStringHTML());
}
It correctly handles:
style elements with attributes or spaces in the tag,
style elements that span more than one line,
style tags that span more than one line,
lines that contain part of a style element and something else,
documents with multiple style elements,
something that looks like a style tags in attribute values,
something that looks like a style tags in CDATA blocks, and
something that looks like a style tags in comments.
As of this update, the other solutions only handle 2 or 3 of these.
Ikegami is right, you really should use at least an HTML/XML parser to do this task. Personally I like using the Mojo::DOM parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. This is a pretty easy one for it however:
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
my $content = <<'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END
my $dom = Mojo::DOM->new( $content );
$dom->find('style')->pluck('remove');
print $dom;
The pluck method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be
$dom->find('style')->each(sub{ $_->remove });
which is a little more understandable but less cute.
After reading your edit that you have to deal with more that just your basic form, I have to stress even further that this is why you use a parser for modifying HTML rather than let your regex grow to ridiculous proportions.
Now lets say that the $content variable also contained these lines
<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
<link rel="icon" href="somefile.jpg">
where you want to remove the first one, and not the second. You can do this in one of two ways.
$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );
This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link tags which have rel=stylesheet. You can however use CSS3 selectors to only find those elements, however, and since Mojo::DOM has full CSS3 selector support you can do
$dom->find('link[rel=stylesheet]')->pluck('remove');
CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line
$dom->find('style, link[rel=stylesheet]')->pluck('remove');
and get rid of all your offensive stylesheets in one fell swoop!
One more possible solution is to use HTML::TreeBuilder.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder 5; # Ensure weak references in use
foreach my $file_name (#ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
# print "Hey, here's a dump of the parse tree of $file_name:\n";
# $tree->dump; # a method we inherit from HTML::Element
foreach my $e ($tree->look_down(_tag => "style")) {
$e->delete();
}
foreach my $e ($tree->look_down(_tag => "link", rel => "stylesheet")) {
$e->delete();
}
print "And here it is, bizarrely rerendered as HTML:\n",
$tree->as_HTML, "\n";
# Now that we're done with it, we must destroy it.
$tree = $tree->delete; # Not required with weak references
}
One way using sed:
sed '/<style>/,/<\/style>/d' file.txt
Results:
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
perl -lne 'print unless(/<style>/.../<\/style>/)' your_file
tested below:
> cat temp
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
> perl -lne 'print unless(/<style>/.../<\/style>/)' temp
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
>
if you want to do it inplace,then:
perl -i -lne 'print unless(/<style>/.../<\/style>/)' your_file
I figured out one way, you can try the following:
#! /usr/bin/perl -w
use strict;
my $line = << 'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END
$line =~ s{<style[^>]*.*?</style>.}{}gs;
print $line;