Regex -> detect a pattern -> move it to the start of the line - regex

It is the first time that I use this platform because it is impossible for me to find the solution.
I have this html code:
<img ...></img><a ...><span ...
I need this:
<a ...><img ...></img><span ...
Where ... would be the content of the pattern (like <img.*.</img>) because it will be done in a bulk way and the information changes. The file has this format:
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
As you can guess, I need to put the <img> tag inside the <a> tag. I tried to take the pattern <a.*.> and move it to the beginning of the line but I have not succeeded.

You generally should not be using regex to manipulate HTML content, which might be nested and have other complexities. However, assuming your <img> and <a> tags always be just one level, you could try the following find and replace, in Sed:
echo "<img ...></img><a ...><span ..." | sed 's/\(<img[^>]*><\/img>\)\(<a[^>]*>\)/\2\1/'
This prints:
<a ...><img ...></img><span ...
Here is a more general solution, also easier to read:
Find: (<img[^>]*><\/img>)(<a[^>]*>)
Replace: $2$1
Demo
This solution simply captures, in two separate groups $1 and $2, the <img> and <a> tags. Then, in the replacement, it swaps the two tags to give you the order you want.

In the end I solved it like this:
sed -i -E "s/(<img.*)(<a .*.>)/\2\1/" file.txt

Related

How to use sed to safely find and replace every instance of a regex match?

Let's say I have an html file that contains the following scenarios;
1. <p style="1">test</p>
2. <p style="2"><p style="3">test</p></p>
3. <p style="4">test</p><p style="5">test</p>
4. <td style="6"><p style="7">test</p></td>
5. <td style="8"><p style="9">test</p><p style="10">test</p></td>
I want to develop a way to find each instance of <p style="test"> and replace it with <p>. I already know that if I wanted to find each one, I would use a regex like <p .+?> or something similar <p .+?(?=>)> which would get me anything that starts with <p contains any character after that, and ends in >.
Here's what I've tried so far;
sed -r 's/<p .+?>\b/<p>/'
While this works for scenario one and four just fine, it starts to get very questionable on every other scenario that would contain more than one <p ...>.
sed -r 's/\b<p .+?>\b/<p>/' This doesn't work at all.
I won't list every possible thing I've tried here as I don't think it would bring any meaningful data to someone versed in sed. I know very little about how to use it and what its capabilities are.
What's the best way to go about this? Thanks!
As mentioned in a comment, a tool that actually understands HTML is a better choice than trying to hack something together with regular expressions.
Example perl script using HTML::TreeBuilder module that strips style attributes from p tags:
#!/usr/bin/env perl
use warnings;
use strict;
use HTML::TreeBuilder;
use Data::Dumper;
# Takes the HTML file to process as a command line argument; outputs on
# standard output.
my $tree = HTML::TreeBuilder->new_from_file($ARGV[0]);
die "Unable to parse '$ARGV[0]': $!\n" unless defined $tree;
# Remove style attributes from all p tags with one
foreach my $tag ($tree->look_down('style', qr//)) {
$tag->attr('style', undef) if $tag->tag eq 'p';
}
print $tree->as_HTML(undef, ' ');

Remove file name portion of local file URL

I have an HTML document which includes links to a hundred or so local files. I want to use either sed, awk or perl (in that preferred order) to remove the filename portion of the URL up to the last backslash in the URL. In the example below I'm only showing a portion of the HTML code forming the path of the local file.
Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
After Processing Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
In testing I have tried different regular expression combinations to accomplish this however I'm only getting ".dmg" or it and everything to the left of .dmg and I really only want to remove the "SoftwarePackageName.dmg" portion. BTW In some cases it's "SoftwarePackageName.zip" and there may be a space in the "CompanyName" or "SoftwarePackageName.dmg" shown as "%20". I've also reviewed "Questions that may already have your answer" shown when creating this post.
EDIT: I appreciate the time taken to try and help and certainly understand the difficulty when due to policy I cannot provide more then the example I did and as such I'll just manually edit the html document. I've already taken to much of my time and others on this. Will just have to do more reading on regex for the next time. Thanks to all that contributed. :)
You could try the below sed command.
sed 's/\(<a href="[^."]*\/\)[^."\/]*\.[^."\/]*">/\1">/g' file
modded
I've deleted the previous sed regex (I have no way to test it).
Instead, I'm posting a expanded regex (verbose) that should help get you started.
# Unknown extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2
# Known extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2
# Replacement: $1$2
( # (1 start), Tag and Url part to keep
<a \s+ [^>]*? href \s* = \s*
( ["'] ) # (2), Quote
[^>]*?
/ # End of directories
) # (1 end)
( # (3 start), Throw away filename
[^/."'>]+ # - Filename (not /."'> chars)
\. # - Dot
# - Extension and parameters
# ----------------------------
# Use one of these lines (but not both)
# Known extensions ->
#dmg \b [^/."'>]*
# Unknown extensions ->
[^/."'>]+
) # (3 end)
\2 # Backref to Quote
Sed should not use much of a different substitute structure s///g.
It might be the case that you have to escape parenthesis meta characters. But I think that's
it for this regex. These regex are in the raw state.
Here they are used in a sample Perl program. It could easily be done useing Perl from the command line.
use strict;
use warnings;
$/ = undef;
my $html = <DATA>; # slurp in the entire file
my $htmlcopy = $html;
$html =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2|\1\2|g;
print "Replaced using Unknown extensions:\n", $html, "\n";
$htmlcopy =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2|\1\2|g;
print "Replace using Known extensions:\n", $htmlcopy, "\n\n";
__DATA__
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Output >>
Replaced using Unknown extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/">
<a rel="nofollow" class="external free" href="http://www.ielts.org/">
<a href="/w/" title="IELTS">
<a href="/wiki/" class="image">
<a href="/w/" title="Edit section: IELTS characteristics">
<a href="/w/" class="new" title="Band score (page does not exist)">
<a href="/w/" title="Edit section: IELTS test structure">
<a href="/wiki/" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Replace using Known extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Try this:
sed 's|\(<a href="file:///[^>]*/\).*">|\1">|g'
Demo:
$ sed 's|\(<a href="file:///[^>]*/\).*\.\(dmg\|zip\)">|\1">|g' <<EOF
> <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
> foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg"> baz quux
> EOF
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/"> baz quux
First I want to say again how much I truly appreciate the time taken by those who tried to help! Secondly, it pains me to have to say that nothing that was presented worked in the real-world application and I, at the least, attribute that to you all not having the actual file I wanted to modify to work with and sorry I was not allowed to provide it. Yes your demos worked however unfortunately they were not at all representative of the actual html coding in the document and maybe because the "Generator" was "Cocoa HTML Writer" from a RTF Document this might have had something to do with it, just not sure at the moment. Even if I took just one complete line that included the Example code, placing it by itself in a file and then processing it nonetheless all solutions presented failed. I wish I could provide the file or take the time to figure out why in this real-world use it fails, however I can't.
Some background on the Document is when originally created as an RTF Document in TextEdit the FQP of the target file was included because the version of OS X would open the target file however in later versions of OS X it only opens Finder to the location of the target file. As such there is no longer a need to use the FQP to the target file just the path to its location. This actually makes it easier to update the RTF Document over time. At times this RTF Document is exported to an HTML Document to be modified and then saved as an RTF Document. As I mention earlier maybe the "Generator" being "Cocoa HTML Writer" from a RTF Document in TextEdit is in part to blame why processing failed with the proposed solutions.
Anyway the reason for my long winded reply is to put this issue in proper perspective and also explain how I resolved this issue. As I had previously mentioned I was just going to manually edit the file however after the generous help already presented I wanted to find some automated solutions and I did.
The primary constant was the Example code previously presented so focusing only on it here is the command line I used to process the file.
grep -o 'file:///[^"]*' Build_Out_Template.html | rev | cut -d / -f 1 | rev | while read LINE; do sed -i "s/${LINE}//" Build_Out_Template.html; done
Using "grep -o 'file:///[^"]*'" enabled me to extract just the target portion of the lines in the document. I piped it through rev to reverse the character order and piped it through cut which gave be only the portion up to the first slash in the reversed line (after the last slash in the original line) and then had to pipe it again through rev for obvious reasons. It was then piped through a loop where sed used a very simple instruction vs a complex regex using literally just the SoftwarePackageName.dmg etc. file name. While much more time was spent on this then manually editing the file nonetheless I took it as a challenge and will remember that sometimes the thinking out-side-the-box solution is faster and easier and I'll remember this for some other application if/when needed.
Thanks again to all who tried to help, it's truly appreciated.

Deleting HTML-Blocks with Regular-Expression

I try to delete all HTML-Blocks which are closed.
I mean e.g. the following block is to delete, since it is closed <> ... </>
<b> some text </b>
But if it isn't closed (it lacks </>) , then it won't be to delete.
Below is a snippet of HTML-Code which is to process:
<div id="MyDiv">div,
<strong>
<span>span2, </span> <-- This is to delete
<em> Some text for em
<div> Some text for div </div> <-- This is to delete
<p><b>b, <span id="MySpan"> Some text for span ...
After processing it should look like something as follows:
<div id="MyDiv">div,
<strong>
<em> Some text for em
<p><b>b, <span id="MySpan">span1,
I need a regular-expression statement to acomplish it. E.g. something as follows:
var sHTML = $('#MyDiv').html();
sHTML = sHTML.replace(/^<.*>.*?<\/.*>/ig, '');
Thanks in advance.
<([^>]*)>[^><]*<\/\s*\1\s*>|<(\w+)\s+[^>]*>[^><]*<\/\s*\2\s*>
Try this.Replace by ``.
See demo.
http://regex101.com/r/hQ1rP0/79
Nvm this works for every case or i am pretty sure it should
(<[^>]*>[^<]*<[^>]*>)
Assuming your html is in a file called test.html, here's a perl one-liner:
perl -pi -e 's/<.*>.*<\/.*>//g' test.html

Strip everything except for the complete anchor tag - Perl

I am needing to parse an HTML file and remove everything except for the anchor tags in their entirety. So for example:
<html>
<body>
<p>boom</p>
Example
</body>
</html>
I only need to keep:
Example
I am using cURL to retrieve the html and a small snippet of code I found that strips everything but the anchor text of the tag. This is what I am using:
curl http://www.google.com 2>&1 | perl -pe 's/\<.*?\>//g'
Is there a simple command line way to do this? My end goal is to put this into a bash script and execute it. I am having a very difficult time understanding regular expressions and perl.
Using Mojolicious command line tool mojo:
mojo get http://www.google.com 'a'
Outputs:
<a class="gb1" href="http://www.google.com/imghp?hl=en&tab=wi">Images</a>
<a class="gb1" href="http://maps.google.com/maps?hl=en&tab=wl">Maps</a>
<a class="gb1" href="https://play.google.com/?hl=en&tab=w8">Play</a>
<a class="gb1" href="http://www.youtube.com/?tab=w1">YouTube</a>
<a class="gb1" href="http://news.google.com/nwshp?hl=en&tab=wn">News</a>
<a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>
<a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>
<a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
<a class="gb4" href="/preferences?hl=en">Settings</a>
<a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a>
Install Google Chrome
Advanced search
Language tools
Chromebook: For students
Advertising Programs
Business Solutions
+Google
About Google
Privacy & Terms
For a helpful 8 minute introductory video, check out: Mojocast Episode 5
Using Mojolicious, as #Miller above, but more exactly select the <a ... rel= :
If you have an html file
perl -Mojo -E 'say $_ for x(b("my.html")->slurp)->find("a[rel]")->each'
or for the online resource
perl -Mojo -E 'say $_ for g("http://example.com")->dom->find("a[rel]")->each'
#or
perl -Mojo -E 'g("http://example.com")->dom->find("a[rel]")->each(sub{say $_})'
If you want more granular control over your HTML, then you can use HTML::TagParser module available on CPAN.
use strict;
use warnings;
use HTML::TagParser;
my $html = HTML::TagParser->new( '<html>
<body>
<p>boom</p>
Example
</body>
</html>' );
my #list = $html->getElementsByTagName( "a" );
for my $elem ( #list ) {
my $name = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$name";
for my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
print $text eq "" ? " />" : ">$text</$name>" , "\n";
}
Output:
Example
Ingy döt Net's pQuery deserves a mention:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say pQuery($_)->toHtml})'
Just the links:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say $_->{href}})'
Although - unlike mojo - there's no command line tool (i.e. not yet - it's not that kind of tool per se and is still "under construction"), it's a module to have on your watch list.

Perl script to search and replace multiple lines in multiple html files

I have many html files in a folder. I need to somehow remove a <div id="user-info" ...>...</div> from all of them. As far as I know I need to use a Perl script for that, but I don't know Perl to do that. Could someone get it for me?
Here is how the "bad" code looks like:
<div id="user-info" class="logged-in">
<a class="icon icon-key-delete" href="https://test.dev/login.php?0,logout=1">Log Out</a>
<a class="icon icon-user-edit" href="https://test.dev/control.php">Control Center</a>
</div> <!-- end of div id=user-info -->
Thank you in advance!
Using XML::XSH2:
for { glob '*.html' } {
open :F html (.) ;
delete //div[#id="user-info" and #class="logged-in"] ;
save :b ;
}
perl -0777 -i.withdiv -pe 's{<div[^>]+?id="user-info"[^>]*>.*?</div>}{}gsmi;' test.html
-0777 means split on nothing, so slurp in whole file (instead of line by line, the default for -p
-i.withdiv means alter files in place, leaving original with extension .withdiv (default for -p is to just print).
-p means pass line by line (except we are slurping) to passed code (see -e)
-e expects code to run.
man perlrun or perldoc perlrun for more info.
Here's another solution, which will be slightly more familiar to people that know jquery, as the syntax is similar. This uses Mojolicious' ojo module to load up the html content into a Mojo::DOM object, transform it, and then print that transformed version:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { say x(scalar(read_file $_))->at("#user-info")->replace("")->root; }' test.html test2.html test*.html
To replace content directly:
perl -Mojo -MFile::Slurp -E 'for (#ARGV) { write_file( $_, x(scalar(read_file $_))->at("#user-info")->replace("")->root ); }' test.html
Note, this won't JUST remove the div, it will also re-write the content based on Mojo's Mojo::DOM module, so tag attributes may not be in the same order. Specifically, I saw <div id="user-info2" class="logged-in"> rewritten as <div class="logged-in" id="user-info2">.
Mojolicious requires at least perl 5.10, but after that there's no non-core requirements.