Get all url from text by regex - regex

i need get all urls from the text file using regex. But not all url, url that start by some template. For example. I have text:
{"Field_Name1":"http://google.ru","FieldName2":
"["some text", "http://example.com/view/...&id..&.."]",
"FieldName3": "http://example.com/edit/&id..."}someText"
["some text", "http://example.com/view/...&id..&.."]",
"FieldName3": "http://example.com/view/&id..."}someText2{..}someText.({})
I need take all urls like http://example.com/view/.....
I try use this regex, but it doesn't work. Maybe i have some mistake in it.
^(http|https|ftp)\://example\.com\/view\/+[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?[^\.\,\)\(\s]$
I'm not need pure url checker, I need checker that can get url that start by some template

What about this?
((ftp|http[s]?):\/\/example.com\/view\/.*?)\"
The first part until "/view/" should be clear. The rest ".*?)\"" means, show me everything before a double quote.

I think this will work! I gave it a go on regexr.com and it seemed to select just the url part, given that the text string doesn't actually have multiple periods in a row.
(?!")h.+.+[a-z]*
EDIT: Made a better one, or at least I think I did. Basically the expression says: "look for a quotation mark, and if the next character is an h then include that in the match and also make that the starting point, and then include any characters after that leading to a single period, followed by any lower case letters. There could be a million of them. As long as there was a period before it, you're good, and it wont select beyond that unless theres another period after the string.

Universal:
/(ftp|http|https)\:\/\/([\d\w\W]*?)(?=\")/igm
Template:
/(ftp|http|https)\:\/\/example\.com\/view\/([\d\w\W]*?)(?=\")/igm

Related

Reliably match an url inside a line

I'm having some trouble figuring out what I thought to be pretty simple regex. I'm trying to make a Twitter bot in Python that tweets quotes from some author.
I need it to:
read a quote and an url from a file
parse the quote and the url apart so that it can add quotes marks
around the quote part and use the url part to determine which book
the quote is from and add the relevant book cover
I also need to get the url apart to calculate the tweet length after
twitter shortened the url
One last thing: some quotes might not have url, I need it to identify that and add some random pics as a fallback.
After trials and errors, I came up with this regex that seemed to do the job when I tested it : r'(?P<quote>.*)(?P<link>https.*)?'
Since I don't need to validate url I don't think that I need any complicated regex like the ones I came across in my research.
But when I tried to fire up the bot, I realized it won't parse the quote correctly, and instead catch the whole line as "quote" (and failing to identify the url).
What puzzles me is that it doesn't fail consistently, instead it seems that sometimes it works, and sometimes it doesn't.
Here is an example of what I'm trying to do that fails unreliably: https://regex101.com/r/mODPUq/1/
Here is the whole function I've written:
def parseText(text):
# Separate the quote from the link
tweet = {}
regex = r'(?P<quote>.*)?(?P<link>https.*)?'
m = re.search(regex, text)
tweet = m.groupdict("")
return tweet
[EDIT] Ok I didn't quite solve the problem this way but found a workaround that might not be very elegant but at least seem to do the job :
I have 2 separate functions, one to get the url, the other to split the url out of the line and return the quote alone.
I first call getUrl(), and then only if it returns something that is not None, I call getQuote(). If url == None, I can directly tweet the whole line.
This way the regex part became very straightforward, and it seems to work so far with or without url. I just have one minor issue, when there's no url even if I use str.split('/n') to cut out the newline character it must still be there, because when I add quotes mark the last one is on a newline.
I leave the issue open for now since technically it's not resolved, thanks to those that gave me answer but it doesn't seem to work.
You can also change regex string to r'(?P<quote>.*)?.(?P<link>https.*)' which also takes care of any extra characters between the quote and the link

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)

Regex, optional match in url

I spend a couple of hour with no good result (maybe my mood is not helping about it).
I am trying to build a regex to help me match both urls:
/reservables/imagenes/4/editar/6
/reservables/imagenes/4/subir
As you note above, the last segment in the first url 6 is not present at the end of the second url, because this segments is optional here. So I need to match both urls in one regex, for that, I have tried this:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]+)
That works fine only for the first url. So, reading a few notes about regex it suggest me that I need the ? symbol, right? So, I tried this one, but it did not work:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]+)?
Well, I do not what I am doing wrong.
You want to put the ? around the / as well, like so:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)(?:/([0-9]+))?
You can see that it matches correctly on debuggex.
This one will work:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]*)

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.

Capture string until first caret sign hit in regex?

I am working with legacy systems at the moment, and a lot of work involves breaking up delimited strings and testing against certain rules.
With this string, how could I return "Active" in a back reference and search terms, stopping when it hits the first caret (^)?:
Active^20080505^900^LT^100
Can it be done with an inclusion in the regex of this "(.+)" ? The reason I ask is that the actual regex "(.+)" is defined in a database as cutting up these messages and their associated rules can be set from a front-end system. The content could be anything ('Active' in this case), that's why ".+" has been used in this case.
Rule: The caret sign cannot feature between the brackets, as that would result with it being stored in the database field too, and it is defined elsewhere in another system field.
If you have a better suggestion than "(.+)" will be happy to hear it.
Thanks in advance.
(.+?)\^
Should grab up to the first ^
If you have to include (.+) w/o modifications you could use this:
(.+?)\^(.+)
The first backreference will still be the correct one and you can ignore the second.
A regex is really overkill here.
Just take the first n characters of the string where n is the position of the first caret.
Pseudo code:
InputString.Left(InputString.IndexOf("^"))
^([^\^]+)
That should work if your RE library doesn't support non-greediness.