Grab contents of div with regex in Powershell

Grab contents of div with regex in Powershell - regex

I have a directory of similar structured HTML files (two examples given):
File-1.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>bar</p></div></div>
<div class="baz">baz</div>
</body>
</html>
File-2.html
<html>
<body>
<div class="foo">foo</div>
<div class="bar"><div><p>apple<br>banana</p></div></div>
<div class="baz">baz</div>
</body>
</html>
I am trying to create a Powershell script to return the contents of the bar div, stripped from all html:
For File-1.html: bar
For File-2.html: apple banana
I now have:
$directory = "C:\Users\Public\Documents\Sandbox\HTML"
foreach ($file in Get-ChildItem($directory))
{
$content = Get-Content $file.fullname
$test = [regex]::matches($content, '(?i)<div class="bar">(.*)</div>')
echo $test[0]
}
This returns however <div class="bar"><div><p>bar</p></div></div><div class="baz">baz</div>. In other words, the regex does not stop until the last </div>. How can I let it only grab what in the <div class="bar"> div?

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible".
(?si)<div class="bar">(.*?)</div>

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

I have content like below
enter code here
<div class="content1">
<ul>
<li>line1</li>
<li>line2</li>
<li>line3</li>
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I want to strip all li tags within and retain contents inside it. like below
enter code here
<div class="content1">
<ul>
line1
line2
line3
</ul>
</div>
<div class="content2">
<ul>
<li>line4</li>
<li>line5</li>
<li>line6</li>
</ul>
</div>
I have about 500 html files to edit.Is there any Regex code to achieve this in notepad++.

You can use a regex like this
<li>(.*?)<\/li>
With the replacement string:
$1
Working demo

The regex to match those tags are
\<li\>
\<\/li\>
The backslashes are used to treat special characters as 'normal' characters.
If you use terminal you can use stream edit which is
sed 's/\<li\>//' input.txt > output.txt
But in notepad++ i believe you can ctrl find and replace

Regex to replace all html tag except br and p tag perl

I have a string in that i will be getting so many html tag i want to replace them with space .How can we do this please suggest me .This is my string :
Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.
I have tried this but this is not working accordingly:
$string =~ s/(<((?!br|p)[^>]+)>)//ig;

You need to deal with closing tags:
use Modern::Perl;
my $str = 'Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.';
$str =~ s~<(?!/?\s*br|/?\s*p)[^>]+>~~ig;
say $str;
You could also use package HTML::StripTags:
use HTML::StripTags qw(strip_tags);
my $str = 'Wrong html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p> and the string is <br> after that test.';
my $allowed_tags = '<p><br>';
say strip_tags( $str, $allowed_tags );

In your regular expression, you did not mention substitution character or delimiter. In your case, you should substitute with space.
Regular expression is:
$msg =~s/(<((?!br|p)[^>]+)>)/ /ig;

PowerShell regular expression to get all the HTML tags

I have a string with HTML tags. I have to write PowerShell script to split this string using regular expression for HTML tags both opening and closing. I have tried many times but with no luck.
<([A-Z][A-Z0-9])[^>]>
I have tried this for opening tags. But it only removes the '<' and '>' from string not the whole tag.
My string is something like this:
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
My desired output is: some text inside. This is text inside font. this is h1 text. This is a new paragraph.

Not sure how you're doing your split, but it shouldn't be that difficult:
$Text =
#'
<Div id="div1">
<Div>
some text inside.
</Div>
<font>this is text inside font.
</font>
<h1>this is h1 text.
</h1>
<p>
This is a new paragraph.
</p>
</Div>
'#
$text -split '<.+?>' -match '\S'
some text inside.
this is text inside font.
this is h1 text.
This is a new paragraph.

Perl regexp to find an element inside an element

I need to find through regular expression from <div id="class1"> to end of </div>. I may also have as many <div> within its text inside it. Please find the code below
This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example
I have tried the below code. But it gets only up to first </div> of <div id="subclass1">
Could any help me to solve this?
Code I tried to capture is:
<div id="class1">(?:(?!<\/div>).)*?</div>

Use a proper HTML parser.
use strict;
use warnings;
use feature qw( say );
use XML::LibXML qw( );
my $html = 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example';
my $parser = XML::LibXML->new();
my $doc = $parser->parse_html_string($html);
my $root = $doc->documentElement();
for my $div ($root->findnodes('//div[#id="class1"]')) {
say "[", $div->toString(), "]";
}

$ echo 'This is example <div id="class1">This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is </div> This is example' | sed -n 's/<div id="class1">\(.*\)<\/div>/\1/p'
This is example This is <div id="subclass1">This is </div> <div id="subclass2">This is </div> This is This is example

You should use appropriate HTML/XML parser. If you want to do it with regex for any reason, nested regex helps you. (Check perldoc perlre for detail.)
$re = qr{
(
<div[^>]*>
(?:(??{$re}) | [^<>]*)*
</div>
)
}x;
print "$1\n" if(/$re/o);

A lot of people always say "Use a proper HTML parser" to parse HTML and not regex. What some people fail to realize is that there are requirements to be met and those requirements might require regex.
<div id=".+?">.*</div> should work for you.
http://regexr.com?33336

Delete the content between HTML tags including the tags themselves in Perl

There are about 100 files and I need to go through each of them and delete all the data which is between <style> and </style> + delete these tags too.
For example
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
should become
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
Also, in some files the style pattern is like
<style type="text/css"> blah </style>
or
<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
I need to remove all 3 patterns. How do I do this in Perl?

use strict;
use warnings;
use XML::LibXML qw( );
my $qfn = 'a.html';
my $doc = XML::LibXML->load_html( location => $qfn );
my $root = $doc->documentElement();
for my $style_node ($root->findnodes('//style')) {
$style_node->parentNode()->removeChild($style_node);
}
{
open(my $fh, '>', $qfn)
or die;
print($fh $doc->toStringHTML());
}
It correctly handles:
style elements with attributes or spaces in the tag,
style elements that span more than one line,
style tags that span more than one line,
lines that contain part of a style element and something else,
documents with multiple style elements,
something that looks like a style tags in attribute values,
something that looks like a style tags in CDATA blocks, and
something that looks like a style tags in comments.
As of this update, the other solutions only handle 2 or 3 of these.

Ikegami is right, you really should use at least an HTML/XML parser to do this task. Personally I like using the Mojo::DOM parser. This is a Document-Object Model interface to your HTML and it supports CSS3 selectors, making it really flexible when you need it. This is a pretty easy one for it however:
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
my $content = <<'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END
my $dom = Mojo::DOM->new( $content );
$dom->find('style')->pluck('remove');
print $dom;
The pluck method is a little confusing, but its really just a shorthand for the doing a method on each resultant object. The analogous line could be
$dom->find('style')->each(sub{ $_->remove });
which is a little more understandable but less cute.
After reading your edit that you have to deal with more that just your basic form, I have to stress even further that this is why you use a parser for modifying HTML rather than let your regex grow to ridiculous proportions.
Now lets say that the $content variable also contained these lines
<link rel="stylesheet" type="text/css" href="$url_path/gridsorting.css">
<link rel="icon" href="somefile.jpg">
where you want to remove the first one, and not the second. You can do this in one of two ways.
$dom->find('link')->each( sub{ $_->remove if $_->{rel} eq 'stylesheet' } );
This mechanism uses the object methods (and Mojo::DOM exposes attributes as hash keys) to remove only the link tags which have rel=stylesheet. You can however use CSS3 selectors to only find those elements, however, and since Mojo::DOM has full CSS3 selector support you can do
$dom->find('link[rel=stylesheet]')->pluck('remove');
CSS3 selector statements can be joined with a comma to find all tags matching either selector, so we can simply include the line
$dom->find('style, link[rel=stylesheet]')->pluck('remove');
and get rid of all your offensive stylesheets in one fell swoop!

One more possible solution is to use HTML::TreeBuilder.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder 5; # Ensure weak references in use
foreach my $file_name (#ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
# print "Hey, here's a dump of the parse tree of $file_name:\n";
# $tree->dump; # a method we inherit from HTML::Element
foreach my $e ($tree->look_down(_tag => "style")) {
$e->delete();
}
foreach my $e ($tree->look_down(_tag => "link", rel => "stylesheet")) {
$e->delete();
}
print "And here it is, bizarrely rerendered as HTML:\n",
$tree->as_HTML, "\n";
# Now that we're done with it, we must destroy it.
$tree = $tree->delete; # Not required with weak references
}

One way using sed:
sed '/<style>/,/<\/style>/d' file.txt
Results:
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>

perl -lne 'print unless(/<style>/.../<\/style>/)' your_file
tested below:
> cat temp
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
> perl -lne 'print unless(/<style>/.../<\/style>/)' temp
<html>
<head> <title> Example </title> </head>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
>
if you want to do it inplace,then:
perl -i -lne 'print unless(/<style>/.../<\/style>/)' your_file

I figured out one way， you can try the following:
#! /usr/bin/perl -w
use strict;
my $line = << 'END';
<html>
<head> <title> Example </title> </head>
<style>
p{color: red;
background-color: #FFFF;
}
div {......
...
}
</style>
<body>
<p> hi I'm a paragraph. </p>
</body>
</html>
END
$line =~ s{<style[^>]*.*?</style>.}{}gs;
print $line;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grab contents of div with regex in Powershell - regex

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use ? for a non-greedy match meaning "zero or more — preferably as few as possible". (?si)<div class="bar">(.?)</div>

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

Regex to replace all html tag except br and p tag perl

PowerShell regular expression to get all the HTML tags

Perl regexp to find an element inside an element

Delete the content between HTML tags including the tags themselves in Perl

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Grab contents of div with regex in Powershell - regex

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use *? for a non-greedy match meaning "zero or more — preferably as few as possible". (?si)<div class="bar">(.*?)</div>

Related

How to remove li tags with in Particular DIV tag in notepad ++ using regex

Regex to replace all html tag except br and p tag perl

PowerShell regular expression to get all the HTML tags

Perl regexp to find an element inside an element

Delete the content between HTML tags including the tags themselves in Perl

Categories

Resources

By default, quantifers are greedy. They will try to match as much as possible still allowing the remainder of the regular expression to match. Use ? for a non-greedy match meaning "zero or more — preferably as few as possible". (?si)<div class="bar">(.?)</div>