Deleting HTML-Blocks with Regular-Expression - regex

I try to delete all HTML-Blocks which are closed.
I mean e.g. the following block is to delete, since it is closed <> ... </>
<b> some text </b>
But if it isn't closed (it lacks </>) , then it won't be to delete.
Below is a snippet of HTML-Code which is to process:
<div id="MyDiv">div,
<strong>
<span>span2, </span> <-- This is to delete
<em> Some text for em
<div> Some text for div </div> <-- This is to delete
<p><b>b, <span id="MySpan"> Some text for span ...
After processing it should look like something as follows:
<div id="MyDiv">div,
<strong>
<em> Some text for em
<p><b>b, <span id="MySpan">span1,
I need a regular-expression statement to acomplish it. E.g. something as follows:
var sHTML = $('#MyDiv').html();
sHTML = sHTML.replace(/^<.*>.*?<\/.*>/ig, '');
Thanks in advance.

<([^>]*)>[^><]*<\/\s*\1\s*>|<(\w+)\s+[^>]*>[^><]*<\/\s*\2\s*>
Try this.Replace by ``.
See demo.
http://regex101.com/r/hQ1rP0/79

Nvm this works for every case or i am pretty sure it should
(<[^>]*>[^<]*<[^>]*>)

Assuming your html is in a file called test.html, here's a perl one-liner:
perl -pi -e 's/<.*>.*<\/.*>//g' test.html

Related

Regex -> detect a pattern -> move it to the start of the line

It is the first time that I use this platform because it is impossible for me to find the solution.
I have this html code:
<img ...></img><a ...><span ...
I need this:
<a ...><img ...></img><span ...
Where ... would be the content of the pattern (like <img.*.</img>) because it will be done in a bulk way and the information changes. The file has this format:
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
As you can guess, I need to put the <img> tag inside the <a> tag. I tried to take the pattern <a.*.> and move it to the beginning of the line but I have not succeeded.
You generally should not be using regex to manipulate HTML content, which might be nested and have other complexities. However, assuming your <img> and <a> tags always be just one level, you could try the following find and replace, in Sed:
echo "<img ...></img><a ...><span ..." | sed 's/\(<img[^>]*><\/img>\)\(<a[^>]*>\)/\2\1/'
This prints:
<a ...><img ...></img><span ...
Here is a more general solution, also easier to read:
Find: (<img[^>]*><\/img>)(<a[^>]*>)
Replace: $2$1
Demo
This solution simply captures, in two separate groups $1 and $2, the <img> and <a> tags. Then, in the replacement, it swaps the two tags to give you the order you want.
In the end I solved it like this:
sed -i -E "s/(<img.*)(<a .*.>)/\2\1/" file.txt

find and everything including other tags with regex

I am trying to find everything from one div to the start of another and include everything in between, even if there is a line break and even if there is other tags and php functions in between.
I want it to start the search at <div="constant-strip"> and end on <?php include....
and i want it to delete everything inside the and everything that comes after <div="constant-strip"> until it reaches the <?php include
even though there are other <?php and <div> tags between those.
I have searched everywhere, but all the regex and wildcard etc. searches i can find, people want to stop at the end of the div and don't include divs that are inside it or only apply to text etc...
all the ([^<]) and i've tried ([\s\S])+ and all those, but none of them work
basically i want to change this:
<div id="constant_strip" class="clearfix">
<div class="clearfix"><a href="<?php echo $division ?>_brands.php">
<img src="images/people.png" width="22" height="21" style="<?php echo $stripColour ?>" />View our suppliers</a></div>
<div id="call">Call us: 021 323 4088</div>
</div>
</div>
<?php include('/footer.php'); ?>
and turn it into just this: <?php include('/footer.php'); ?>
the problem is that it doesn't have exactly the same information on every page
The following regex will match the middle part, you want to replace:
(?<=<div id="constant_strip" class="clearfix">)[\s\S]*?(?=<\?php include)

Remove file name portion of local file URL

I have an HTML document which includes links to a hundred or so local files. I want to use either sed, awk or perl (in that preferred order) to remove the filename portion of the URL up to the last backslash in the URL. In the example below I'm only showing a portion of the HTML code forming the path of the local file.
Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
After Processing Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
In testing I have tried different regular expression combinations to accomplish this however I'm only getting ".dmg" or it and everything to the left of .dmg and I really only want to remove the "SoftwarePackageName.dmg" portion. BTW In some cases it's "SoftwarePackageName.zip" and there may be a space in the "CompanyName" or "SoftwarePackageName.dmg" shown as "%20". I've also reviewed "Questions that may already have your answer" shown when creating this post.
EDIT: I appreciate the time taken to try and help and certainly understand the difficulty when due to policy I cannot provide more then the example I did and as such I'll just manually edit the html document. I've already taken to much of my time and others on this. Will just have to do more reading on regex for the next time. Thanks to all that contributed. :)
You could try the below sed command.
sed 's/\(<a href="[^."]*\/\)[^."\/]*\.[^."\/]*">/\1">/g' file
modded
I've deleted the previous sed regex (I have no way to test it).
Instead, I'm posting a expanded regex (verbose) that should help get you started.
# Unknown extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2
# Known extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2
# Replacement: $1$2
( # (1 start), Tag and Url part to keep
<a \s+ [^>]*? href \s* = \s*
( ["'] ) # (2), Quote
[^>]*?
/ # End of directories
) # (1 end)
( # (3 start), Throw away filename
[^/."'>]+ # - Filename (not /."'> chars)
\. # - Dot
# - Extension and parameters
# ----------------------------
# Use one of these lines (but not both)
# Known extensions ->
#dmg \b [^/."'>]*
# Unknown extensions ->
[^/."'>]+
) # (3 end)
\2 # Backref to Quote
Sed should not use much of a different substitute structure s///g.
It might be the case that you have to escape parenthesis meta characters. But I think that's
it for this regex. These regex are in the raw state.
Here they are used in a sample Perl program. It could easily be done useing Perl from the command line.
use strict;
use warnings;
$/ = undef;
my $html = <DATA>; # slurp in the entire file
my $htmlcopy = $html;
$html =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2|\1\2|g;
print "Replaced using Unknown extensions:\n", $html, "\n";
$htmlcopy =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2|\1\2|g;
print "Replace using Known extensions:\n", $htmlcopy, "\n\n";
__DATA__
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Output >>
Replaced using Unknown extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/">
<a rel="nofollow" class="external free" href="http://www.ielts.org/">
<a href="/w/" title="IELTS">
<a href="/wiki/" class="image">
<a href="/w/" title="Edit section: IELTS characteristics">
<a href="/w/" class="new" title="Band score (page does not exist)">
<a href="/w/" title="Edit section: IELTS test structure">
<a href="/wiki/" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Replace using Known extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Try this:
sed 's|\(<a href="file:///[^>]*/\).*">|\1">|g'
Demo:
$ sed 's|\(<a href="file:///[^>]*/\).*\.\(dmg\|zip\)">|\1">|g' <<EOF
> <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
> foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg"> baz quux
> EOF
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/"> baz quux
First I want to say again how much I truly appreciate the time taken by those who tried to help! Secondly, it pains me to have to say that nothing that was presented worked in the real-world application and I, at the least, attribute that to you all not having the actual file I wanted to modify to work with and sorry I was not allowed to provide it. Yes your demos worked however unfortunately they were not at all representative of the actual html coding in the document and maybe because the "Generator" was "Cocoa HTML Writer" from a RTF Document this might have had something to do with it, just not sure at the moment. Even if I took just one complete line that included the Example code, placing it by itself in a file and then processing it nonetheless all solutions presented failed. I wish I could provide the file or take the time to figure out why in this real-world use it fails, however I can't.
Some background on the Document is when originally created as an RTF Document in TextEdit the FQP of the target file was included because the version of OS X would open the target file however in later versions of OS X it only opens Finder to the location of the target file. As such there is no longer a need to use the FQP to the target file just the path to its location. This actually makes it easier to update the RTF Document over time. At times this RTF Document is exported to an HTML Document to be modified and then saved as an RTF Document. As I mention earlier maybe the "Generator" being "Cocoa HTML Writer" from a RTF Document in TextEdit is in part to blame why processing failed with the proposed solutions.
Anyway the reason for my long winded reply is to put this issue in proper perspective and also explain how I resolved this issue. As I had previously mentioned I was just going to manually edit the file however after the generous help already presented I wanted to find some automated solutions and I did.
The primary constant was the Example code previously presented so focusing only on it here is the command line I used to process the file.
grep -o 'file:///[^"]*' Build_Out_Template.html | rev | cut -d / -f 1 | rev | while read LINE; do sed -i "s/${LINE}//" Build_Out_Template.html; done
Using "grep -o 'file:///[^"]*'" enabled me to extract just the target portion of the lines in the document. I piped it through rev to reverse the character order and piped it through cut which gave be only the portion up to the first slash in the reversed line (after the last slash in the original line) and then had to pipe it again through rev for obvious reasons. It was then piped through a loop where sed used a very simple instruction vs a complex regex using literally just the SoftwarePackageName.dmg etc. file name. While much more time was spent on this then manually editing the file nonetheless I took it as a challenge and will remember that sometimes the thinking out-side-the-box solution is faster and easier and I'll remember this for some other application if/when needed.
Thanks again to all who tried to help, it's truly appreciated.

Strip everything except for the complete anchor tag - Perl

I am needing to parse an HTML file and remove everything except for the anchor tags in their entirety. So for example:
<html>
<body>
<p>boom</p>
Example
</body>
</html>
I only need to keep:
Example
I am using cURL to retrieve the html and a small snippet of code I found that strips everything but the anchor text of the tag. This is what I am using:
curl http://www.google.com 2>&1 | perl -pe 's/\<.*?\>//g'
Is there a simple command line way to do this? My end goal is to put this into a bash script and execute it. I am having a very difficult time understanding regular expressions and perl.
Using Mojolicious command line tool mojo:
mojo get http://www.google.com 'a'
Outputs:
<a class="gb1" href="http://www.google.com/imghp?hl=en&tab=wi">Images</a>
<a class="gb1" href="http://maps.google.com/maps?hl=en&tab=wl">Maps</a>
<a class="gb1" href="https://play.google.com/?hl=en&tab=w8">Play</a>
<a class="gb1" href="http://www.youtube.com/?tab=w1">YouTube</a>
<a class="gb1" href="http://news.google.com/nwshp?hl=en&tab=wn">News</a>
<a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>
<a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>
<a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
<a class="gb4" href="/preferences?hl=en">Settings</a>
<a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a>
Install Google Chrome
Advanced search
Language tools
Chromebook: For students
Advertising Programs
Business Solutions
+Google
About Google
Privacy & Terms
For a helpful 8 minute introductory video, check out: Mojocast Episode 5
Using Mojolicious, as #Miller above, but more exactly select the <a ... rel= :
If you have an html file
perl -Mojo -E 'say $_ for x(b("my.html")->slurp)->find("a[rel]")->each'
or for the online resource
perl -Mojo -E 'say $_ for g("http://example.com")->dom->find("a[rel]")->each'
#or
perl -Mojo -E 'g("http://example.com")->dom->find("a[rel]")->each(sub{say $_})'
If you want more granular control over your HTML, then you can use HTML::TagParser module available on CPAN.
use strict;
use warnings;
use HTML::TagParser;
my $html = HTML::TagParser->new( '<html>
<body>
<p>boom</p>
Example
</body>
</html>' );
my #list = $html->getElementsByTagName( "a" );
for my $elem ( #list ) {
my $name = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$name";
for my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
print $text eq "" ? " />" : ">$text</$name>" , "\n";
}
Output:
Example
Ingy döt Net's pQuery deserves a mention:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say pQuery($_)->toHtml})'
Just the links:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say $_->{href}})'
Although - unlike mojo - there's no command line tool (i.e. not yet - it's not that kind of tool per se and is still "under construction"), it's a module to have on your watch list.

delete html comment tags using regexp

This is how my text (html) file looks like
<!--
| |
| This is a dummy comment |
| please delete me |
| asap |
| |
________________________________
| -->
this is another line
in this long dummy html file...
please do not delete me
I'm trying to delete the comment using sed :
cat file.html | sed 's/.*<!--\(.*\)-->.*//g'
It doesn't work :( What am I doing wrong?
Thank you very much for your help!
patrickmdnet has the correct answer. Here it is on one line using extended regex:
cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'
Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92
http://www.catonmat.net/blog/sed-one-liners-explained-part-three/
One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.
You would better off using existing code instead of rolling your own.
http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed
#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall#ukonline.co.uk>
/<!--/!b
:a
/-->/!{
N
ba
}
s/<!--.*-->//
(from http://sed.sourceforge.net/grabbag/scripts/)
See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.
http://www.perlmonks.org/?node_id=500603
I think you can do this with awk if you want. Start:
[~] $ more test.txt
<!--
An HTML style comment
-->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
Result of the awk:
[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
Improving (hopefully) on the awk-based answer provided by eldarerathis --
The code below addresses the concern raised by john-jones.
In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.
$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
for example
$ cat test.txt
<!--
An HTML style comment
-->
<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text
<div>
<p>blah</p>
</div>
<!-- Whoops
Another comment -->
<span>Something</span>
<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->
<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>
$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'
Some other text
<div>
<p>blah</p>
</div>
<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div> </div>