regex to remove hyperlinks - regex

Input:
source http://www.emaxhealth.com/1275/misdiagnosing from here http://www.cancerresearchuk.org/about-cancer/type recounting her experiences and thoughts blog http://fty720.blogspot.com even carried the new name. She was far from home.
From the about input I want to remove the hyperlinks. Below is the regex that I am trying
http://[\w|\W|\d|\s]*(?=[ ])
This regex will encompass all characters,digits and whitespaces after encountering the word 'http' and will continue till first blank space.
Unfortunately, it is not working as expected. Please do help me find out my error.Thanks

Try this sed command
sed 's/http[^ ]\+//g' FileName
Output :
source from here recounting her experiences and thoughts blog even carried the new name. She was far from home.

To find the hyperlink use:
\b(https?)://[A-Z0-9+&##/%?=~_|$!:,.;-]*[A-Z0-9+&##/%=~_|$]
or:
If you want to find the html a tag use:
<a\b[^>]*>(.*?)</a>

Related

How to use Regex to replace a tag in a word document with Powershell

First post on stackoverflow for me so sorry if something is out of norm or similar ^^
Currently I'm trying to find a way to read vouchers out of a .csv that I get from my pfsense.
Plan is to read it out of the .csv and write it down in a Word document so that secretaries can print it out and give them out to coworkers.
So far I have no problems replacing names and room numbers, all I gotta do now is to find a way to replace the voucher codes, but since they obviously always change I tried to use regex, here's the current state of that part of my code:
if ($Vouchers -match '((\d|\w){11})*') {
$matches.0 }
ReplaceTag –Document $Doc -FindText ‘<Vouchers>’ -replacewithtext $matches
The regex itself is working perfectly fine (already tested it on regex101) so I guess it's the code.
I'm assuming that it's trying to literally match "((\d|\w){11})*" instead of using the pattern :\
Any kinda help would be welcomed!

grep complete resource url within a file

I have to search and extract within a file addresses like these:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.it-dz.5373092572.05373092574.12739786322/enclosure.m4v
They are 38 links with only the last serie of digit which change.
I tried with this regexp:
grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.m4v]'
it extract all the urls present in the file which point to an m4v file but not the complete url it get a partial url as follow:
http://deimos.apple.com/WebObjects/Core.woa/DownloadRedirectedTrackPreview/unina.
Where am I wrong?
I can't figure out why it happens.
Thanks a lot for your effort.
Your regex and your extracted filename do not match. The filename that you list does not begin with:
http://ex
Which your regex requires. you could change your regex to something more like this which would match your URL:
'http://(?:[a-z0-9+-]+/)*[a-z0-9+-]+\.m4v'
Sorry Jonathan it was a typing mistake while I posted in my regex was correctly used dei and not ex as written.
But the problem persisted.
Marc opened my mind.
I knew how the address starts so I tried with
grep -io 'http://dei/.m4v'
no success :-(
fedorqui gave the last hint, maybe the problem was a dot
so I tried
grep -io 'http://deimos./.m4v' :-D
and it did the trick!
Now I have the file to give to wget to automate multiple file downloads without proprietary softwares needing.
The files are podcasts of juridic lessons released free as in freedom but only in an easy way for who would buy Apple or Microsoft (iTunes).
Now I have the file to give to wget to automate multiples file downloads without soiling my system with emulators and proprietary software.
Thanks to all indeed!!

Capturing content from a string

I am attempting to parse some logs to get the specific catalog numbers for the items viewed. I have broken out all the necessary fields and am now parsing the referer field to get the catalog id of the page viewed.
The strings are in the following formats:
/catalog/AAA1111111
/catalog/BBB-22222-1/
/catalog/CCC-333333/XXX
http://url/catalog/DDD-44444444
http://url/catalog/EEE-555555555/ZZZ
I am using the following regex to strip out the catalog id:
.*\/catalog\/([^\/]+)
The problem is that I cannot stop the regex from grabbing everything after the next forward slash. It looks like it is to greedy?
The results are:
AAA1111111
BBB-22222-1/
CCC-333333/XXX
DDD-44444444
http:EEE-555555555/ZZZ
I've been banging my head on this one for a couple of hours.
I am just looking for a regex that will split out just the catalog id (the string after catalog/.)
Can anyone help guide this old coder in the proper direction?
Many thanks.
using sed
cat catalogs | sed -E 's/.*\/catalog\/([^/]+)\/?.*/\1/g'
results in
AAA1111111
BBB-22222-1
CCC-333333
DDD-44444444
EEE-555555555
note the only modification is matching the trailing stuff
Why using a regex when you can split on "/catalog/", take the last item then split on "/" and take the 1st item ?
In Python, this could be done like this :
line.split('/catalog/')[-1].split('/')[0]
Just wanted to point out that regexp are not the solutions for every string parsing problems.
Often, when you're faced to "greedy" parsing, doing a "manual" modification before using regexp helps

Get value between <b> tag using regex in Yahoo Pipes

I have searched up and down trying to find an answer that will work for me but haven't been able to figure this out. I'm using Yahoo Pipes for this.
Lake Harmony Estates <b>Sleeps: 16</b>
What I need to do is extract the Sleeps: 16 out from the B tag and output just that value and nothing else. I don't suspect this is very hard to do, but given my limited regex knowledge it's giving me troubles. I've tried adapting regex code pertaining to other tags, but just can't seem to get this one to work.
Any help on this would be appreciated. Thanks.
Edit:
Here is my pipe if you wanted to take a look at the regex horrible-ness I've created. The one I'm trying to work though is the item.sleeps, last entry in the 2nd regex
http://pipes.yahoo.com/pipes/pipe.info?_id=567026d850223b0075d80fd3c9bf7e75
This should fit your needs assuming the html isn't ladened with quotes and such. Note that the + will mean that empty <b> tags are ignored. Also, html is not truly passable via regex, so this will only work for basic tags. It should work even if the tag has an ID or a class property, but there are absolutely manners to break this regex.
/<b[^>]*>([^<]+)<\/b>/
I posted this question to Twitter and got a response back that worked for me.
(?s)^.*<b>(.*?)</b>.*
Replace with $1 and have G flag checked.
This solution did everything I needed. I had additional data that I had already excluded in my example that became unnecessary with this regex.

Removing Different URLs with Regex

I am looking to remove a ton of bad spam URL links from my forums using regex in either grep or vim and subsequently using find/replace commands. I am looking for a way to select just the bad URLs to do that.
All of the URLs are different and are preceeded by \n________\n. (Thats 8 underscores)
Here is an example of one of the URLs:
\n________\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
So basically I was trying to use the \n... and the [/URL] as boundaries to select that and everything inbetween. What I came up with is this:
[\\]n[_][_][_][_][_][_][_][_][\\]n.*\[\/URL\]]
Using that does not correctly close the search and selects pretty much everything. I very am new at this and appreciate any insight. Thanks.
Assuming GNU ERE, this should work:
\\n_{8}\\n\s\[URL=(.*)].*\[/URL]
RegexBuddy seems to agree with me:
That said,
> grep -E \\n_{8}\\n\s\[URL=(.*)].*\[/URL] test.txt
doesn't work on my system (Cygwin with GNU grep 2.6.3; test.txt's contents are shown in the screenshot above).
If you want to give sed a chance following will do the job:
sed 's/^.*\(\[URL.*\)$/\1/' file.txt
PS: You can do same :s/^.*\(\[URL.*\)$/\1/ in your vi session as well.
OUTPUT
For the file.txt that contains:
\n__\n[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
It produces:
[URL=http://boxvaporizers.com]Box Vaporizers[/URL]
In Vim this should remove all lines that match the pattern:
:g/\\n\%(\\_\)\{8}\\n \[URL=.\{-}\/URL\]/d
That pattern matches the sample text taken literally, all in one line.
I was actually able to do this in Microsoft Word using the following:
[\\]n_{8}[\\]n?*/URL\]
Thank you for all the input, couldn't have done it without the help!