How to match internal links with Regex? - regex

I am trying to build a regex which will match every line which doesn't contain the word "stylesheet" in it, and has an "a href" which has a value NOT starting with http or www.
This is how far I got, but it doesn't seem to do what I want:
grep -rin "href=\"\/*\/*\/|^((?!stylesheet).)*$" *.html
The goal is that this would be caught:
<a href="/api_supplier/">
<a href="/other-internal-link/abc/">
but this wouldn't:
<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">
The ultimate goal of mine is to append "index.html" at the end of every internal link, so they would look like this:
<a href="/api_supplier/index.html">
<a href="/other-internal-link/abc/index.html">

This regex may do the job :
^(.*a href)((?!http|www|stylesheet).)*$

A perl way that append index.html to the right urls:
~cat file.txt
<a href="/api_supplier/">
<a href="/other-internal-link/abc/">
<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">
~perl -ape 's~^(?!.*stylesheet).*?\bhref="/[^"]+\K~index.html~' file.txt
<a href="/api_supplier/index.html">
<a href="/other-internal-link/abc/index.html">
<a href="http://github.com/">
<a href="www.github.com/index.html">
<a href="/other-internal-link/test/" rel="stylesheet">
If you want to do the replacement in-place, use the -i option:
perl -i -ape 's~^(?!.*stylesheet).*?\bhref="/[^"]+\K~index.html~' file.txt

Related

How can I fix this regex in order to get html tag only from a particular url?

Hello I have a html file with several img tags:
<img src="https://www.pokeyplay.com/imagenes/backend/publicidad.gif" alt="Publicidad" align="left" />
<img src="https://www.pokeyplay.com/imagenes/backend/spacer.gif" alt="sp" />
<img src="imagenes/backend/etiqueta-pyp-pokedex.gif" alt="P&P PokéDex" width="184" height="100" />
<img src="imagenes/backend/spacer.gif" alt="sp" />
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
In order to stract all img tags I am using the following regexp:
'<img[^>]* src=\"([^\"]*)\"[^>]*>'
But I want to extract only all IMG tags from urpgstatic.com
How can do this?
I did several tries like this:
<img.*?src="(http[s]?:\/\/)urpgstatic.com?([^\/\s]+\/)(.*)[png]$"[^\>]+>
Thanks
Try this
<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>
Demo
Also, this will work with grep
grep -iP '<img[^>]*(?=\"https?:\/\/(www\.)?urpgstatic\.com)\"([^\"]*)\"[^>]*>' index.html
You may use this grep command:
grep -ioE '<img [^>]*src="https?://(www\.)?urpgstatic\.com/[^>]*>' file.html
<img src="http://urpgstatic.com/img_library/pokemon_sprites/187.png" style="vertical-align:middle" />
Though please remember that parsing HTML using regex may be error prone and using a HTML parser such as DOM in php is more reliable.
RegEx Details:
<img [^>]*src=: Match <img <anything-except->src= text
"https?://: Match http://orhttps://`
(www\.)?urpgstatic\.com/: Match optional www. followed by urpgstatic.com/

Replacing HTML text elements with increment variable

In the below HTML part, I want to replace, whenever a text is found, with an incremental variable:
<li class="cat-item">
<a href="#" >Beautiful Reclessness</a>
</li>
<li class="cat-item">
<a href="#" >Comfort vs. Appearance</a>
</li>
<li class="cat-item">
<a href="#" >Highlights of the Runway</a>
<ul class='children'>
<li class="cat-item">
<a href="#" >Christian Louboutin Show</a>
</li>
<li class="cat-item">
<a href="#" >Givenchy F/W 2016</a>
</li>
<li class="cat-item">
<a href="#" >Spring by Gaultier</a>
To this using the x++ increment:
<li class="cat-item">
<a href="#" >x1</a>
</li>
<li class="cat-item">
<a href="#" >x2</a>
</li>
<li class="cat-item">
<a href="#" >x3</a>
<ul class='children'>
<li class="cat-item">
<a href="#" >x4</a>
</li>
<li class="cat-item">
<a href="#" >x5</a>
</li>
<li class="cat-item">
<a href="#" >x6</a>
Is there a way in Notepad++ or Vim (looking for in between > <) to do find the text contents using REGEX and replace them with an x counter?
Simple vim answer:
Open the file—vim filename
Set up a convenience variable—:let num=1
Do the replaclement—:g/href/execute printf("normal! citx%d", num) | let num=num+1
The :global command allows one to perform an operation all lines matching a pattern (in this case, href). The operation we want to do is change the text inside the <a> tag to x followed by the contents of num, and increment num.
execute lets us build a command line from strings; I often combine with printf() because I find it easier to read. normal! is an Ex-command that lets us execute normal-mode commands. cit is a vim'ism for "change inside tag" from normal mode. Then we just feed it the appropriate replacement text (x%d) and increment the counter.
If you're wondering how I came up with this, it's a pretty well-established pattern among vimmers. In practice, it took me probably about a minute to get the whole sequence done (faster if I used it more often), so it isn't one of those "spend 30 minutes trying to write a good regex" answers—this can be done in a live editing session without too much thought, if the person editing has a good grasp of vim fundamentals.
Hope that helps.
Download python script plugin
plugins > python script > new script > save as "increment.py"
Develop your regex at regex101 or somewhere else and write the script
i=0
def increment(match):
global i
i=i+1
return "x"+str(i)
editor.rereplace('(?<=>)\\b[^><]+', increment)
Save and run your script: plugins > python script > scripts > increment
A slightly different approach on vim:
:let c=1 | g/a href="#" >\zs.*\ze</ s//\='x'.c/g | let c+=1
Using \zs and \ze we can select the pattern we want to remove. The counter
will gives the number sequence concatenated with space:
\='x'.c ................. concatenate 'x' with the counter

Replace dynamic string with notepad++

I have a html page that has list of href="<myUrls>" tags now im using angular so I need to find and replace it to (click)="redirectTo('myUrls')" is it possible to do it with notepad++?
Example :
mypage.html
<a href="home.html" class="myClass1 myClass2">
<a href="myProfile.html" class="myClass3 myClass4>
<a href="aboutUs.html" class="myClass1 myClass2">
<a href="gallery.html" class="myClass3 myClass4>
want this code to be replaced as
<a (click)="redirectTo('home.html')" class="myClass1 myClass2">
<a (click)="redirectTo('myProfile.html')" class="myClass3 myClass4">
<a (click)="redirectTo('aboutUs.html')" class="myClass1 myClass2">
<a (click)="redirectTo('gallery.html')" class="myClass3 myClass4">
Ctrl+H
Find what: href="([^"]+)
Replace with: \(click\)="redirectTo\('$1'\) Don't forget to escape the parenthesis
check Wrap around
check Regular expression
DO NOT CHECK . matches newline
Replace all
Explanation:
href=" : literally
( : start group 1
[^"]+ : 1 or more any character that is not a quote
) : end group 1
Result for given example:
<a (click)="redirectTo('home.html')" class="myClass1 myClass2">
<a (click)="redirectTo('myProfile.html')" class="myClass3 myClass4>
<a (click)="redirectTo('aboutUs.html')" class="myClass1 myClass2">
<a (click)="redirectTo('gallery.html')" class="myClass3 myClass4>

regex to remove recurring instances of comment tag

Hello I want to remove all recurring instances of comment tag which occurs in a data.
Data which I am using is mentioned below
<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->
The regex I am using just captures the first instance but I want all instances to be captured.
<!--.*\s.*-->
You could use something like so: <!--.+?--> (Example here). Make sure that you have the sg flag enabled.
The s flag would allow the period character to also match new line feeds, thus allowing you to capture comments which span multiple lines.
The g flag will apply the pattern globally, that is, to the entire text.
You didn't specify the language you're using but for php you can use /<!--.*?-->/s , i.e.:
$html = '<!-- <li><a class="topitemlink" href="/About-Us/Career-Centre.aspx">Career Centre</a></li>
<li><img alt="" width="7" height="22" src="/images/common/separator.gif" /></li>-->
<li><a class="topitemlink" href="/ContactUs">Contact Us</a> <!-- <ul class="topcontactusmenu"><li>Contact Us</li><li>Contact the IR Team</li><li>Contact the Media Team</li></ul> --></li>
</ul>
</div>
<!--<img width="92" height="40" src="/ABMB/media/MyLibrary/Shared/Images/bizSmart_logo.gif" alt="" /><img width="76" height="40" src="/ABMB/media/MyLibrary/Shared/Images/sabah-run2015_top-icon.jpg" alt="" />-->';
$html = preg_replace('/<!--.*?-->/s', '', $html);
echo $html;
/*<li><a class="topitemlink" href="/ContactUs">Contact Us</a> </li>
</ul>
</div>*/
DEMO:
https://ideone.com/It6HvW
EXPLANATION:
<!--.*?-->
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Greedy quantifiers; Regex syntax only
Match the character string “<!--” literally «<!--»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “-->” literally «-->»

How can I extract URLs from html content with ruby regexp?

Lets go directly with an example since it is not easy to explain:
<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> random strings - 4 <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&a_aid=&a_bid=&chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style
In the above content i want to extract from
javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')
the string "f6a1ok3n4d4p" and "site2.com" then make it as
http://site2.com/f6a1ok3n4d4p
and same for
javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com')
to become
http://site1.com/zsgn82c4b96d
I need it to be done with ruby regex
This should give you some insight of how to do it.
https://regex101.com/r/wD4oT8/2
javascript:show\(\'(.*?)'.*?\'([^\']*)\'\) will capture the first argument as $1, last part within ' as $2, so you get what you want by substituting as $2/$1.
That's the regex part of it, and, of course, you can adjust the regex as you see fit, for example, to include the usage of " (javascript:show\((?:\'|\")(.*?)(?:\'|\").*?\'([^\'\"]*)(?:\'|\")\) or allow only with 3 arguments.
/yourregex/.match(yourstring) will extract the information you need.