Remove file name portion of local file URL - regex

I have an HTML document which includes links to a hundred or so local files. I want to use either sed, awk or perl (in that preferred order) to remove the filename portion of the URL up to the last backslash in the URL. In the example below I'm only showing a portion of the HTML code forming the path of the local file.
Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
After Processing Example:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
In testing I have tried different regular expression combinations to accomplish this however I'm only getting ".dmg" or it and everything to the left of .dmg and I really only want to remove the "SoftwarePackageName.dmg" portion. BTW In some cases it's "SoftwarePackageName.zip" and there may be a space in the "CompanyName" or "SoftwarePackageName.dmg" shown as "%20". I've also reviewed "Questions that may already have your answer" shown when creating this post.
EDIT: I appreciate the time taken to try and help and certainly understand the difficulty when due to policy I cannot provide more then the example I did and as such I'll just manually edit the html document. I've already taken to much of my time and others on this. Will just have to do more reading on regex for the next time. Thanks to all that contributed. :)

You could try the below sed command.
sed 's/\(<a href="[^."]*\/\)[^."\/]*\.[^."\/]*">/\1">/g' file

modded
I've deleted the previous sed regex (I have no way to test it).
Instead, I'm posting a expanded regex (verbose) that should help get you started.
# Unknown extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2
# Known extension: (<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2
# Replacement: $1$2
( # (1 start), Tag and Url part to keep
<a \s+ [^>]*? href \s* = \s*
( ["'] ) # (2), Quote
[^>]*?
/ # End of directories
) # (1 end)
( # (3 start), Throw away filename
[^/."'>]+ # - Filename (not /."'> chars)
\. # - Dot
# - Extension and parameters
# ----------------------------
# Use one of these lines (but not both)
# Known extensions ->
#dmg \b [^/."'>]*
# Unknown extensions ->
[^/."'>]+
) # (3 end)
\2 # Backref to Quote
Sed should not use much of a different substitute structure s///g.
It might be the case that you have to escape parenthesis meta characters. But I think that's
it for this regex. These regex are in the raw state.
Here they are used in a sample Perl program. It could easily be done useing Perl from the command line.
use strict;
use warnings;
$/ = undef;
my $html = <DATA>; # slurp in the entire file
my $htmlcopy = $html;
$html =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.[^/."'>]+)\2|\1\2|g;
print "Replaced using Unknown extensions:\n", $html, "\n";
$htmlcopy =~ s|(<a\s+[^>]*?href\s*=\s*(["'])[^>]*?/)([^/."'>]+\.dmg\b[^/."'>]*)\2|\1\2|g;
print "Replace using Known extensions:\n", $htmlcopy, "\n\n";
__DATA__
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Output >>
Replaced using Unknown extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/">
<a rel="nofollow" class="external free" href="http://www.ielts.org/">
<a href="/w/" title="IELTS">
<a href="/wiki/" class="image">
<a href="/w/" title="Edit section: IELTS characteristics">
<a href="/w/" class="new" title="Band score (page does not exist)">
<a href="/w/" title="Edit section: IELTS test structure">
<a href="/wiki/" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">
Replace using Known extensions:
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
<a rel="nofollow" class="external text" href="http://www.ielts.org/researchers/analysis-of-test-data/">
<a rel="nofollow" class="external text" href="http://qiyas.sa/Sites/English/Tests/LanguageTests/Pages/Standardized-Test-for-English-Proficiency-(STEP).aspx">
<a rel="nofollow" class="external free" href="http://www.ielts.org/about_us.aspx">
<a href="/w/index.php?title=IELTS&redirect=no" title="IELTS">
<a href="/wiki/File:IELTS_logo.svg" class="image">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=1" title="Edit section: IELTS characteristics">
<a href="/w/index.php?title=Band_score&action=edit&redlink=1" class="new" title="Band score (page does not exist)">
<a href="/w/index.php?title=International_English_Language_Testing_System&action=edit&section=2" title="Edit section: IELTS test structure">
<a href="/wiki/University_of_St._Andrews" title="University of St. Andrews" class="mw-redirect">
<a rel="nofollow" class="external text" href="http://bandscore.ielts.org/search.aspx">
<a rel="nofollow" class="external text" href="http://www.bristol.ac.uk/university/governance/policies/admissions/language-requirements.html#toc05">
<a href="#cite_ref-11">
<a href="/wiki/Special:BookSources/1405833122" class="internal mw-magiclink-isbn">

Try this:
sed 's|\(<a href="file:///[^>]*/\).*">|\1">|g'
Demo:
$ sed 's|\(<a href="file:///[^>]*/\).*\.\(dmg\|zip\)">|\1">|g' <<EOF
> <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg">
> foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/SoftwarePackageName.dmg"> baz quux
> EOF
<a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/">
foo bar <a href="file:///Volumes/VolumeName/Download/Mac%20Software/CompanyName/"> baz quux

First I want to say again how much I truly appreciate the time taken by those who tried to help! Secondly, it pains me to have to say that nothing that was presented worked in the real-world application and I, at the least, attribute that to you all not having the actual file I wanted to modify to work with and sorry I was not allowed to provide it. Yes your demos worked however unfortunately they were not at all representative of the actual html coding in the document and maybe because the "Generator" was "Cocoa HTML Writer" from a RTF Document this might have had something to do with it, just not sure at the moment. Even if I took just one complete line that included the Example code, placing it by itself in a file and then processing it nonetheless all solutions presented failed. I wish I could provide the file or take the time to figure out why in this real-world use it fails, however I can't.
Some background on the Document is when originally created as an RTF Document in TextEdit the FQP of the target file was included because the version of OS X would open the target file however in later versions of OS X it only opens Finder to the location of the target file. As such there is no longer a need to use the FQP to the target file just the path to its location. This actually makes it easier to update the RTF Document over time. At times this RTF Document is exported to an HTML Document to be modified and then saved as an RTF Document. As I mention earlier maybe the "Generator" being "Cocoa HTML Writer" from a RTF Document in TextEdit is in part to blame why processing failed with the proposed solutions.
Anyway the reason for my long winded reply is to put this issue in proper perspective and also explain how I resolved this issue. As I had previously mentioned I was just going to manually edit the file however after the generous help already presented I wanted to find some automated solutions and I did.
The primary constant was the Example code previously presented so focusing only on it here is the command line I used to process the file.
grep -o 'file:///[^"]*' Build_Out_Template.html | rev | cut -d / -f 1 | rev | while read LINE; do sed -i "s/${LINE}//" Build_Out_Template.html; done
Using "grep -o 'file:///[^"]*'" enabled me to extract just the target portion of the lines in the document. I piped it through rev to reverse the character order and piped it through cut which gave be only the portion up to the first slash in the reversed line (after the last slash in the original line) and then had to pipe it again through rev for obvious reasons. It was then piped through a loop where sed used a very simple instruction vs a complex regex using literally just the SoftwarePackageName.dmg etc. file name. While much more time was spent on this then manually editing the file nonetheless I took it as a challenge and will remember that sometimes the thinking out-side-the-box solution is faster and easier and I'll remember this for some other application if/when needed.
Thanks again to all who tried to help, it's truly appreciated.

Related

Search pattern between tags in html

I need to get value from a tag with specific title.
I have this command.
sed -n 's/title="view quote">\(.*\)<\/a>/\1/p' index.html
This is part of index.html and i need that 'Everything in life is luck'
<a title="view quote" href="https://www.brainyquote.com/quotes/donald_trump_106578" class="oncl_q">
<img id="qimage_106578" src="./Donald Trump Quotes - BrainyQuote_files/donaldtrump1.jpg" class="bqphtgrid" alt="Everything in life is luck. - Donald Trump">
</a>
</div>
Everything in life is luck.
Donald Trump
</div>
And i need all this vlaues to fill in array in bash.
Your sed command is mostly good - just missing .* at each end of regex to remove additional head and tail.
This command extract all values with your specific title:
sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html
To put into an array:
IFS=$'\n' array=( $(sed -n 's/.*title="view quote">\(.*\)<\/a>.*/\1/p' index.html) )
To verify your result array:
for ((i=0;i<${#array[#]};i++)); do
echo ${array[$i]}
done

Regex -> detect a pattern -> move it to the start of the line

It is the first time that I use this platform because it is impossible for me to find the solution.
I have this html code:
<img ...></img><a ...><span ...
I need this:
<a ...><img ...></img><span ...
Where ... would be the content of the pattern (like <img.*.</img>) because it will be done in a bulk way and the information changes. The file has this format:
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
.....
<img ...></img><a ...><span ...
As you can guess, I need to put the <img> tag inside the <a> tag. I tried to take the pattern <a.*.> and move it to the beginning of the line but I have not succeeded.
You generally should not be using regex to manipulate HTML content, which might be nested and have other complexities. However, assuming your <img> and <a> tags always be just one level, you could try the following find and replace, in Sed:
echo "<img ...></img><a ...><span ..." | sed 's/\(<img[^>]*><\/img>\)\(<a[^>]*>\)/\2\1/'
This prints:
<a ...><img ...></img><span ...
Here is a more general solution, also easier to read:
Find: (<img[^>]*><\/img>)(<a[^>]*>)
Replace: $2$1
Demo
This solution simply captures, in two separate groups $1 and $2, the <img> and <a> tags. Then, in the replacement, it swaps the two tags to give you the order you want.
In the end I solved it like this:
sed -i -E "s/(<img.*)(<a .*.>)/\2\1/" file.txt

Deleting HTML-Blocks with Regular-Expression

I try to delete all HTML-Blocks which are closed.
I mean e.g. the following block is to delete, since it is closed <> ... </>
<b> some text </b>
But if it isn't closed (it lacks </>) , then it won't be to delete.
Below is a snippet of HTML-Code which is to process:
<div id="MyDiv">div,
<strong>
<span>span2, </span> <-- This is to delete
<em> Some text for em
<div> Some text for div </div> <-- This is to delete
<p><b>b, <span id="MySpan"> Some text for span ...
After processing it should look like something as follows:
<div id="MyDiv">div,
<strong>
<em> Some text for em
<p><b>b, <span id="MySpan">span1,
I need a regular-expression statement to acomplish it. E.g. something as follows:
var sHTML = $('#MyDiv').html();
sHTML = sHTML.replace(/^<.*>.*?<\/.*>/ig, '');
Thanks in advance.
<([^>]*)>[^><]*<\/\s*\1\s*>|<(\w+)\s+[^>]*>[^><]*<\/\s*\2\s*>
Try this.Replace by ``.
See demo.
http://regex101.com/r/hQ1rP0/79
Nvm this works for every case or i am pretty sure it should
(<[^>]*>[^<]*<[^>]*>)
Assuming your html is in a file called test.html, here's a perl one-liner:
perl -pi -e 's/<.*>.*<\/.*>//g' test.html

Strip everything except for the complete anchor tag - Perl

I am needing to parse an HTML file and remove everything except for the anchor tags in their entirety. So for example:
<html>
<body>
<p>boom</p>
Example
</body>
</html>
I only need to keep:
Example
I am using cURL to retrieve the html and a small snippet of code I found that strips everything but the anchor text of the tag. This is what I am using:
curl http://www.google.com 2>&1 | perl -pe 's/\<.*?\>//g'
Is there a simple command line way to do this? My end goal is to put this into a bash script and execute it. I am having a very difficult time understanding regular expressions and perl.
Using Mojolicious command line tool mojo:
mojo get http://www.google.com 'a'
Outputs:
<a class="gb1" href="http://www.google.com/imghp?hl=en&tab=wi">Images</a>
<a class="gb1" href="http://maps.google.com/maps?hl=en&tab=wl">Maps</a>
<a class="gb1" href="https://play.google.com/?hl=en&tab=w8">Play</a>
<a class="gb1" href="http://www.youtube.com/?tab=w1">YouTube</a>
<a class="gb1" href="http://news.google.com/nwshp?hl=en&tab=wn">News</a>
<a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>
<a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>
<a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>More</u> »</a>
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
<a class="gb4" href="/preferences?hl=en">Settings</a>
<a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a>
Install Google Chrome
Advanced search
Language tools
Chromebook: For students
Advertising Programs
Business Solutions
+Google
About Google
Privacy & Terms
For a helpful 8 minute introductory video, check out: Mojocast Episode 5
Using Mojolicious, as #Miller above, but more exactly select the <a ... rel= :
If you have an html file
perl -Mojo -E 'say $_ for x(b("my.html")->slurp)->find("a[rel]")->each'
or for the online resource
perl -Mojo -E 'say $_ for g("http://example.com")->dom->find("a[rel]")->each'
#or
perl -Mojo -E 'g("http://example.com")->dom->find("a[rel]")->each(sub{say $_})'
If you want more granular control over your HTML, then you can use HTML::TagParser module available on CPAN.
use strict;
use warnings;
use HTML::TagParser;
my $html = HTML::TagParser->new( '<html>
<body>
<p>boom</p>
Example
</body>
</html>' );
my #list = $html->getElementsByTagName( "a" );
for my $elem ( #list ) {
my $name = $elem->tagName;
my $attr = $elem->attributes;
my $text = $elem->innerText;
print "<$name";
for my $key ( sort keys %$attr ) {
print " $key=\"$attr->{$key}\"";
}
print $text eq "" ? " />" : ">$text</$name>" , "\n";
}
Output:
Example
Ingy döt Net's pQuery deserves a mention:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say pQuery($_)->toHtml})'
Just the links:
perl -MpQuery -E 'pQuery("http://www.ubu.com/sound/barthes.html")
->find("a")->each(sub{say $_->{href}})'
Although - unlike mojo - there's no command line tool (i.e. not yet - it's not that kind of tool per se and is still "under construction"), it's a module to have on your watch list.

Sed program - deleted strings reappearing?

I'm stumped. I have an HTML file that I'm trying to convert to plain text and I'm using sed to clean it up. I understand that sed works on the 'stream' and works one line at a time, but there are ways to match multiline patterns.
Here is the relevant section of my source file:
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
I would like this to be made into the following plaintext format:
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000
The key is that City, Region, and Post code are all on one line now.
I use sed -f commands.sed file.html > output.txt and I believe that the following sed program (commands.sed) should put it in that format:
#using the '#' symbol as delimiter instead of '/'
#remove tags
s#<.*>\(.*\)</.*>#\1#g
#remove the nbsp
s#\( \)*##g
#add a newline before the address (actually typing a newline in the file)
s#\(123 street\)#\
\1#g
#and now the command that matches multiline patterns
#find 'City',read in the next two lines, and separate them with spaces
/City/ {
N
N
s#\(.*\)\n\(.*\)\n\(.*\)#\1 \2 \3#g
}
Seems to make sense. Tags are all stripped and then three lines are put into one.
Buuuuut it doesn't work that way. Here is the result I get:
My Name
123 street
City <span class="region">Region</span> <span class="postal-code">1A1 A1A</span>
my#email.ca
000-000-0000
To my (relatively inexperienced) eyes, it looks like sed is 'forgetting' the changes it made (stripping off the tags). How would I solve this? Is the solution to write the file after three commands and re-run sed for the fourth? Am I misusing sed? Am I misunderstanding the 'stream' part?
I'm running Mac OS X 10.4.11 with the bash shell and using the version of sed that comes with it.
I think you're confused. Sed operates line-by-line, and runs all commands on the line before moving to the next. You seem to be assuming it strips the tags on all lines, then goes back and runs the rest of the commands on the stripped lines. That's simply not the case.
See RegEx match open tags except XHTML self-contained tags ... and stop using sed for this.
Sed is a wonderful tool, but not for processing HTML. I suggest using Python and BeautifulSoup, which is basically built just for this sort of task.
If you have only one data block per php file, try the following (using sed)
kent$ cat t
<h1 class="fn" id="myname">My Name</h1>
<span class="street-address">123 street</span>
<span class="locality">City</span>
<span class="region">Region</span>
<span class="postal-code">1A1 A1A</span>
<span class="email">my#email.ca</span>
<span class="tel">000-000-0000</span>
kent$ sed 's/<[^>]*>//g; s/ //g' t |sed '1G;3{N;N; s/\n/ /g}'
My Name
123 street
City Region 1A1 A1A
my#email.ca
000-000-0000