Reliably match an url inside a line - regex

I'm having some trouble figuring out what I thought to be pretty simple regex. I'm trying to make a Twitter bot in Python that tweets quotes from some author.
I need it to:
read a quote and an url from a file
parse the quote and the url apart so that it can add quotes marks
around the quote part and use the url part to determine which book
the quote is from and add the relevant book cover
I also need to get the url apart to calculate the tweet length after
twitter shortened the url
One last thing: some quotes might not have url, I need it to identify that and add some random pics as a fallback.
After trials and errors, I came up with this regex that seemed to do the job when I tested it : r'(?P<quote>.*)(?P<link>https.*)?'
Since I don't need to validate url I don't think that I need any complicated regex like the ones I came across in my research.
But when I tried to fire up the bot, I realized it won't parse the quote correctly, and instead catch the whole line as "quote" (and failing to identify the url).
What puzzles me is that it doesn't fail consistently, instead it seems that sometimes it works, and sometimes it doesn't.
Here is an example of what I'm trying to do that fails unreliably: https://regex101.com/r/mODPUq/1/
Here is the whole function I've written:
def parseText(text):
# Separate the quote from the link
tweet = {}
regex = r'(?P<quote>.*)?(?P<link>https.*)?'
m = re.search(regex, text)
tweet = m.groupdict("")
return tweet
[EDIT] Ok I didn't quite solve the problem this way but found a workaround that might not be very elegant but at least seem to do the job :
I have 2 separate functions, one to get the url, the other to split the url out of the line and return the quote alone.
I first call getUrl(), and then only if it returns something that is not None, I call getQuote(). If url == None, I can directly tweet the whole line.
This way the regex part became very straightforward, and it seems to work so far with or without url. I just have one minor issue, when there's no url even if I use str.split('/n') to cut out the newline character it must still be there, because when I add quotes mark the last one is on a newline.
I leave the issue open for now since technically it's not resolved, thanks to those that gave me answer but it doesn't seem to work.

You can also change regex string to r'(?P<quote>.*)?.(?P<link>https.*)' which also takes care of any extra characters between the quote and the link

Related

Django - Accepting full sentence in query parameters

I'm putting together a small API that just converts the message coming in to Spongebob mockcase.
I've got everything rolling, but coming back I'm realizing I've been testing with a single value & thus just noticed the following URL entry will not be able to accept spaces/%20.
url(r'^mock/(?P<message>\w+)/$',mock, name='mock'),
I've looked all over, but not sure how to phrase what I'm looking for appropriately to find anything useful. What would I be looking for to accept a full sentence.
Worth noting this will be coming from a chat message, therefore, it will be sent as is, not percent encoded.
You don't really want to put things like that as URL parameters. Instead it should be in the querystring: for example mysite.com/mock/?message=Message+goes+here.
The URL should just be:
url('^mock/$', ...)
and the view then just gets the data from request.GET['message'].
You may fix your immediate issue using
r'^mock/(?P<message>[^/]+)/?$
See the regex demo
Here, [^/]+ matches any one or more chars other than / and the /? matches an optional / at the end of the string ($).

Get all url from text by regex

i need get all urls from the text file using regex. But not all url, url that start by some template. For example. I have text:
{"Field_Name1":"http://google.ru","FieldName2":
"["some text", "http://example.com/view/...&id..&.."]",
"FieldName3": "http://example.com/edit/&id..."}someText"
["some text", "http://example.com/view/...&id..&.."]",
"FieldName3": "http://example.com/view/&id..."}someText2{..}someText.({})
I need take all urls like http://example.com/view/.....
I try use this regex, but it doesn't work. Maybe i have some mistake in it.
^(http|https|ftp)\://example\.com\/view\/+[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?[^\.\,\)\(\s]$
I'm not need pure url checker, I need checker that can get url that start by some template
What about this?
((ftp|http[s]?):\/\/example.com\/view\/.*?)\"
The first part until "/view/" should be clear. The rest ".*?)\"" means, show me everything before a double quote.
I think this will work! I gave it a go on regexr.com and it seemed to select just the url part, given that the text string doesn't actually have multiple periods in a row.
(?!")h.+.+[a-z]*
EDIT: Made a better one, or at least I think I did. Basically the expression says: "look for a quotation mark, and if the next character is an h then include that in the match and also make that the starting point, and then include any characters after that leading to a single period, followed by any lower case letters. There could be a million of them. As long as there was a period before it, you're good, and it wont select beyond that unless theres another period after the string.
Universal:
/(ftp|http|https)\:\/\/([\d\w\W]*?)(?=\")/igm
Template:
/(ftp|http|https)\:\/\/example\.com\/view\/([\d\w\W]*?)(?=\")/igm

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.

Regex help: Identifying websites in text

I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc#xyz.com to abc#. I don't want this, so I modified it to
gsub("*((^#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!#)[^(?!.*#)]*.com",'',testset[10])
I got a 'invalid regex' error. I believe a little help in correcting may get this to work...
I can't believe it. There actually is a simple solution to it.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
This works by:
Start with a space.
Put all sorts of things, except an '#' in.
end with a .com/net/org/info/
Please do look into breaking it! I'm sure there will be cases that will break this as well.
your lookarounds look a bit funny to me: you cant look behind inside a character class and why are you looking ahead? A look behind is imho more appropriate.
I think the following expression should work, although i didn't test it:
gsub("*((?<!#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehinds must have a fixed length, so no multipliers are allowed

regex, find last part of a url

Let's take an url like
www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension
If I want to capture "set_of_random_characters_everything_possible.randomextension" will [^/\n]+$work? (solution taken from Trying to get the last part of a URL with Regex)
My question is: what does the "\n" part mean (it works even without it)? And, is it secure if the url has the most casual combination of characters apart "/"?
First, please note that www.url.com/some_thing/random_numbers_letters_everything_possible/set_of_random_characters_everything_possible.randomextension is not a URL without a scheme like http:// in front of it.
Second, don't parse URLs yourself. What language are you using? You probably don't want to use a regex, but rather an existing module that has already been written, tested, and debugged.
If you're using PHP, you want the parse_url function.
If you're using Perl, you want the URI module.
Have a look at this explanation: http://regex101.com/r/jG2jN7
Basically what is going on here is "match any character besides slash and new line, infinite to 1 times". People insert \r\n into negated char classes because in some programs a negated character class will match anything besides what has been inserted into it. So [^/] would in that case match new lines.
For example, if there was a line break in your text, you would not get the data after the linebreak.
This is however not true in your case. You need to use the s-flag (PCRE_DOTALL) for this behavior.
TL;DR: You can leave it or remove it, it wont matter.
Ask away if anything is unclear or I've explained it a little sloppy.