Parsing words separated with hyphen - regex

I require to parse the below string using regular expressions. I came up with two variants, both of which seem a bit ugly to me. Please assist me as to which would be better suited for the job.
The main task is to parse the url in scrapy.
Sample expression -
/article/2014/01/16/hcl-tech-earnings-shares-idINDEEA0F02920140116
Regex -
/article/(\d+)/(\d+)/(\d+)/([0-9A-Za-z-]+)
/article/(\d+)/(\d+)/(\d+)/\w+(-\w+)*
And yes, I need to capture the whole ending expression, so 1st regex has handled that perfectly. I verified both the regex using https://pythex.org/.
Edit -
Expected Format -
/article/(yyyy)/(mm)/(dd)/(words-separated-by-hyphen)
I want to capture all the stuff separated by / after /article

Simply use:
/article/(\d+)/(\d+)/(\d+)/(.*)
The hyphens don't seem to have to do anything with what's in the url so...

Related

regular expressions: catch any URLs of the domain example.com

I'm trying to get regexp code for the below case. I tried multiple tries but in vain.
I need to catch any URLs of the domain site.com. Tried using regexp '^site.com/*$
but it does not recognizes it.
i'm just looking for regexp code whichmatches site.com/*
With your expression ^site.com/*$ you match all strings that start with site.com and have zero or more trailing / characters (/*):
If you want to match any strings starting with site.com/ you might want to try ^site\.com/.*$:
There are already a lot of other regex questions regarding domain names on SO, but your question is not clear to me in what context you are trying to do this, or what is the actual goal you want to achieve. If you describe your needs more precisely you could probably find some answers on this forum.
I generally use a helper website like regex101.com.
Also, a few things to note, . has a special meaning in regex meaning any character, and if you wanted to capture site.com/foo you might want to use something where you are not limited to the number of characters by the end. I'd do this with groupings.
^(site\.com\/)(.+)$
You can see this in action here: https://regex101.com/r/AU2iYC/2
Your regex ^site.com/*$ is only matched follow sentences
ex) site.com/ site.com//////// site.com
because * asterisk in regex means Match 0 or more of the preceding token.
so, it should be work
^site.com\/.*$

Postgres invalid regular expression: invalid character range

I'm using the following line in a postgres function:
regexp_replace(input, '[^a-z0-9\-_]+', sep, 'gi');
But I'm getting ERROR: invalid regular expression: invalid character range when I try to use it. The regex works fine in Ruby, is there a reason it'd be different in postgres?
Some regexp parsers will work with a dash (-) in the middle, if after a range like you have it, but others won't. I suspect the postgres regexp parser is in the later class. The canonical way to have the dash in a regexp is to start with it, i.e. change the regexp to '[^-a-z0-9_]+' which might get it past the parser. Some regexp parsers, however, can be really fussy and not accept that, either.
I don't have a postgres to test with, but I expect they'll accept the regexp above and deal correctly. Otherwise you have to find the regexp portion of their manual and understand what it says about this.
I had the same problem
using
\-
instead of only
-
worked to me
For me it worked to move the dash (-) to the end of the list
replaced [A-Za-z0-9-_.+=] with [A-Za-z0-9_.+=-] seems to work
[^[:digit:]\-.]
The above code will work.

Regular expression: find abc.com except xyz.abc.com or #abc.com

In Eclipse I want to find a string, and using the normal search results in hundreds of irrelevant results. So I'm trying to use regular expressions, but they don't give me the proper results up til now.
This is what I need: find "abc.com", but not "xyz.abc.com" or "#abc.com". To make it clear, it should return www.abc.com.
I've tried the following regex but I'm not sure if this is how it should be:
[^#xyz\.]abc.com
Using a negative lookbehind should suit your needs:
(?<!xyz[.]|#)abc[.]com
Every "abc.com" that is not preceded by "xyz." nor by "#".

Regular Expressions with conditions

I have a string that looks like:
this is a string [[and]] it is [[awesome|amazing]]
I have the following regular expression so far:
(?<mygroup>(?<=\[\[).+?(?=\]\]))
I am basically trying to capture everything inside the brackets. However, I need to add another condition that says: If the matched result contains a pipe delimiter then only return the word to the right of the pipe delimiter. If there is no pipe then just return everything inside the brackets.
The parsing result I am looking for given the example above should look like:
and
amazing
Any input is appreciated.
(?<mygroup>(?<=\[\[)([^|\]]*|)?([^|]+?)(?=\]\]))
You could use this regex:
(?<=\[\[[^\]]*?)(?!\w+\|)\w+(?=\]\])
it matches both and and amazing words in your test example. You could check it out, I created a test app on Ideone.
From the regex info page:
The tremendous power and expressivity
of modern regular expressions can
seduce the gullible — or the foolhardy
— into trying to use regexes on every
string‐related task they come across.
My advice: Just grab what is between the brackets and parse it after.
Regular expressions are not the answer to everything. May those who follow after you be spared from deciphering the regex you come up with.

Regex and Yahoo Pipes: How to replace end of url

Here's the Pipe though you may not need it to answer the question: http://pipes.yahoo.com/pipes/pipe.info?_id=85a288a1517e615b765df9603fd604bd
I am trying to modify all url's as so:
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbf_6073553_th_3.jpg with
http://mediadownloads.mlb.com/mlbam/2009/08/12/mlbtv_6073553_1m.mp4
The syntax should be something like:
In item.mediaUrl replace f with tv and In item.mediaUrl replace last 8 characters with 1m.mp4
mlbf_(\d+)_.* replaced w/ mlbtv_$1_1m.mp4
breaks the rss feed though I know I am close
Any idea as to what syntax I need there?
Your regex and replacement look okay to me, assuming the regex is being applied only to the URLs. If it were being applied to the surrounding text as well, the .* would tend to consume a lot more than you wanted. See what happens if you change the regex to this:
mlbf_(\d+)_[\w.]+
I do not know how this yahoo pipes work, but this regex should do it according this site:
Regex:
.*?/([0-9]*)/([0-9]*)/([0-9]*)/mlbf_([0-9]*)_.*
Substitution:
http://mediadownloads.mlb.com/mlbam/$1/$2/$3/mlbtv_$4_1m.mp4