Regex and Yahoo Pipes: How to replace end of url - regex

Here's the Pipe though you may not need it to answer the question: http://pipes.yahoo.com/pipes/pipe.info?_id=85a288a1517e615b765df9603fd604bd
I am trying to modify all url's as so:
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbf_6073553_th_3.jpg with
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbtv_6073553_1m.mp4
The syntax should be something like:
In item.mediaUrl replace f with tv and In item.mediaUrl replace last 8 characters with 1m.mp4
mlbf_(\d+)_.* replaced w/ mlbtv_$1_1m.mp4
breaks the rss feed though I know I am close
Any idea as to what syntax I need there?

Your regex and replacement look okay to me, assuming the regex is being applied only to the URLs. If it were being applied to the surrounding text as well, the .* would tend to consume a lot more than you wanted. See what happens if you change the regex to this:
mlbf_(\d+)_[\w.]+

I do not know how this yahoo pipes work, but this regex should do it according this site:
Regex:
.*?/([0-9]*)/([0-9]*)/([0-9]*)/mlbf_([0-9]*)_.*
Substitution:
http://mediadownloads.mlb.com/mlbam/$1/$2/$3/mlbtv_$4_1m.mp4

Related

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches.
Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced.
Thanks
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
missed a escape syntax on /a
the reserved word 'href' is captured on
first group
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) :
REGEX101

regular expression in excel for numbers before a slash

In the example below, I need to change everything before the final slash to jreviews/
so in the example below the first line would become
jreviews/159256_0907131531001639107_std.jpg
i am using open office find and replace tool, I see there is an option for regex but i dont know how to do this. How can I find and replace the img.agoda urls and everything thats a number and slash, and replace that with jreviews/ ?
but keeping the numbers after that final slash, because these are the filename.
http://img.agoda.net/hotelimages/159/159256/159256_0907131531001639107_std.jpg
http://img.agoda.net/hotelimages/161/161941/161941_1001051215002307125_std.jpg
http://img.agoda.net/hotelimages/288/288595/288595_111017161615_std.jpg
http://img.agoda.net/hotelimages/289/289890/289890_13081511070014319856_std.jpg
http://img.agoda.net/hotelimages/305/305075/305075_120427175058_std.jpg
http://img.agoda.net/hotelimages/305/305078/305078_120427175537_std.jpg
Regex seems like overkill, at least for your examples. Since they all have the same number of subfolders, a simple Find and Replace with wildcards works for me. Here's how I did it in Excel:
Just replace http://*/*/*/*/ with jreviews/.
Try this:
Replace the below match with "CustomName/"
^.+[/$]

Parsing words separated with hyphen

I require to parse the below string using regular expressions. I came up with two variants, both of which seem a bit ugly to me. Please assist me as to which would be better suited for the job.
The main task is to parse the url in scrapy.
Sample expression -
/article/2014/01/16/hcl-tech-earnings-shares-idINDEEA0F02920140116
Regex -
/article/(\d+)/(\d+)/(\d+)/([0-9A-Za-z-]+)
/article/(\d+)/(\d+)/(\d+)/\w+(-\w+)*
And yes, I need to capture the whole ending expression, so 1st regex has handled that perfectly. I verified both the regex using https://pythex.org/.
Edit -
Expected Format -
/article/(yyyy)/(mm)/(dd)/(words-separated-by-hyphen)
I want to capture all the stuff separated by / after /article
Simply use:
/article/(\d+)/(\d+)/(\d+)/(.*)
The hyphens don't seem to have to do anything with what's in the url so...

How to use regex to replace text between tags in Notepad++

I have a code like this:
<pre><code>Some HTML code</code></pre>
I need to escape the HTML between the <pre><code></code></pre> tags. I have lots of tags, so I thought - why not let regex do it for me. The problem is I don't know how. I've seen lots of examples using Google and Stackoverflow, but nothing I could use. Can someone here help me?
Example:
<pre><code>Some HTML code</code></pre>
To
<pre><code>Some <a href="http">HTML</a> code</code></pre>
Or just a regex so I can replace anything between the <pre><code> and </code></pre> tags one by one. I'm almost certain that this can be done.
This regex will match the parts of the anchor tag
you need to put back:
<pre><code>([^<]*?)(.*?)(.*?)</code></pre>
See a live demo, which shows it matching correctly and also shows the various parts being captured as groups which we'll refer to in the replacement string (see below).
Use the regex above with the following replacement:
<pre><code>\1<a href="\2">\3</a>\4</pre></code>
The \1, \2 etc are the captured groups in the regex that put back what we're keeping from the match.
A regular expression to return "the thing between <pre><code> and </code></pre>" could be
/(?<=<pre><code>).*?(?=<\/code><\/pre>)/
This uses lookaround expressions to delimit the "thing that gets matched". Typically using regex in situations with nested tags is fraught with danger and you are much better off using "real tools" made specifically for the job of parsing xml, html etc. I am a huge fan of Beautiful Soup (Python) myself. Not familiar with Notepad++, so not sure if its dialect of regex matches this expression exactly.

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul
You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.
First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].