QRegExp Pattern For URLs - c++

I am trying to match google urls from some text that is stored in a variable, using the pattern below.
The urls use double quotes
QRegExp regExp;
regExp.setPattern("http://www.google.com/(.*)");
I manage to match the url but it unwontedly matches all of the text that is contained after it. I have tried using similar variants like the ones below, but they don't seem to work.
regExp.setPattern("http://www.google.com/(.*)\"is");
regExp.setPattern("http://www.google.com/^(.*)$\"");
Any help to get a regular expression that matches just the url alone.
Thanks in advance

Is there a reason you need/want to use a QRegExp?
You could use a QUrl most likely.

Even though it is impossible for us to know what is around the urls in your text (quotes ? parenthesis ? white spaces ?), we can create a better regular expression by trying to do a negative match of characters that cannot be part of the url:
QRegExp regExp;
regExp.setPattern("http://www.google.com/([^()\"' ]*)");
Then you just need to add more possible characters to this negative character class.

Related

What's the right regular expression to match the exact word at the end of a string and excluding all other urls with more chars at the end?

I have to match an exact string at the end of a url, but not match all other urls that have more characters after that string
I can better explain with example.
I need to match the url having the string 'white' at its end: http//mysite.com/white
But I also need to not match urls having one or more characters postponed to it, like http//mysite.com/white__blue or http//mysite.com/white/yellow or http//mysite.com/white/
How to do that?
Thanks
Regex to match any url*
^(https?:\/\/)?([\da-z\.-]+\.[a-z\.]{2,6}|[\d\.]+)([\/:?=&#]{1}[\da-z\.-]+)*[\/\?]?$
Regex to match a url containing white in the end
^(https?:\/\/)?([\da-z\.-]+\.[a-z\.]{2,6}|[\d\.]+)([\/:?=&#]{1}[\da-z\.-]+)*[\/\?]?white$
You can check the regex here
From regexr.com
It does not match urls(which are not valid anyway) like
httpabrakadabra.co//
http:google.com
http://no-tld-here-folks.a
http://potato.54.211.192.240/
Based on your limited sample inputs, I'd say you could get away with this very minimal pattern:
^http[^\s]+white$
However, depending on what you are truly trying to achieve, what language/function you are implementing this pattern with, and what the full input string looks like, this pattern may need to be refined.
It would be best if you would improve your question to include all of the above relevant information.

Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag:
<BASE href="http://whatreallyhappened.com/">
In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:
BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;
This works, but only if there is only ONE space character in the subject between BASE and href.
I tried to add a quantifier to the space part in the regex (\s), but it did not work.
So how can I make this regex match the URL even if there are several spaces between BASE and href?
You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.
To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.
With lookaround
You can use different ways to try using quantifiers like these:
(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")
Working demo
Without lookaround
By the way, if you want just to get the content within href there is no need of lookaround you just can use:
<BASE\s+href="(.*?)"
Working demo
EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:
((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
^---notice \s ^---notice \s\s ^---notice \s\s\s
I know that this is horrible, but if none of above work you can try with that.

Regex for the value of an HTML Property

I have a load of links that look like this:
Taboola - Content you may like
I want to delete the entire ICON and ADD_DATE attributes and their values.
I'm using sublime with a regex find/replace but I'm not sure how to write the regex to grab everything in between ICON=" AND "
Any help would be appreciated!
This should work (escaping quotes as necessary):
ICON="[^"]*"
The reason ICON=\"(.*)" won't work is that regex can 'be greedy' in what it takes. This means that if it can match more of the string to satisfy the pattern it will.
You can either specify a non greedy search, such as ICON=".*?" or explicitly declare matches on atoms that are not quotes as in the above answer.

How can I write a regular expression match for a string that must contain the following characters: thankYou.sjs?donate_page

I need to create a regex query for a Google Analytics goal. The goal url must contain:
thankYou.sjs?donate_page
However, there are many urls than can render this string, all with other modifiers that I don't care about.
Please advise.
#ExplosionPills: I think you forgot about the special meaning of the question mark.
If you don't escape it, your expression:
^thankYou.sjs?donate_path$
Would match
thankYou.sjsdonate_path
or
thankYou.sjdonate_path
Not to mention the special meaning of dot.
So I guess something like this should work:
thankYou\.sjs\?donate_path
Furthermore if it's possible that the donate_path is not the first in the query string you can use this:
thankYou\.sjs\?([^&]*&)*donate_path
Just the string itself will work. If you want only this string, just use the start/end of string zero-width assertions:
^thankYou\.sjs\?donate_path$

parsing url for specific param value

im looking to use a regular expression to parse a URL to get a specific section of the url and nothing if I cannot find the pattern.
A url example is
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5#c452fds-634d-f424fds-cdsa&bf_action=jildape
I wish to get the bolded text in it.
Currently im using the regex "d=([^#]*)" but the problem is im also running across urls of this pattern:
and im getting the bold section of it
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5&bf_action=jildape
I would prefer it have no matches of this url because it doesnt contain the #
Regexes are not a magic tool that you should always use just because the problem involves a string. In this case, your language probably has a tool to break apart URLs for you. In PHP, this is parse_url(). In Perl, it's the URI::URL module.
You should almost always prefer an existing, well-tested solution to a common problem like this rather than writing your own.
So you want to match the value of the id parameter, but only if it has a trailing section containing a '#' symbol (without matching the '#' or what's after it)?
Not knowing the specifics of what style of regexes you're using, how about something like:
id=([^#&]*)#
regex = "id=([\\w-])+?#"
This will grab everything that is character class[a-zA-Z_0-9-] between 'id=' and '#' assuming everything between 'id=' and '#' is in that character class(i.e. if an '&' is in there, the regex will fail).
id=
-Self explanatory, this looks for the exact match of 'id='
([\\w-])
-This defines and character class and groups it. The \w is an escaped \w. '\w' is a predefined character class from java that is equal to [a-zA-Z_0-9]. I added '-' to this class because of the assumed pattern from your examples.
+?
-This is a reluctant quantifier that looks for the shortest possible match of the regex.
#
-The end of the regex, the last character we are looking for to match the pattern.
If you are looking to grab every character between 'id=' and the first '#' following it, the following will work and it uses the same logic as above, but replaces the character class [\\w-] with ., which matches anything.
regex = "id=(.+?)#"