Write regex without negations - regex

In a previous post I've asked for some help on rewriting a regex without negation
Starting regex:
https?:\/\/(?:.(?!https?:\/\/))+$
Ended up with:
https?:[^:]*$
This works fine but i've noticed that in case I will have : in my URL besides the : from http\s it will not select.
Here is a string which is not working:
sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/:query2
You can notice the :query2
How can I modify the second regex listed here so it will select urls which contain :.
Expected output:
http://websites.com/path/subpath/cc:query2
Also I would like to select everything till the first occurance of ?=param
Input:
sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/cc:query2/text/?=param
Output:
http://websites.com/path/subpath/cc:query2/text/

It is a pity that Go regex does not support lookarounds.
However, you can obtain the last link with a sort of a trick: match all possible links and other characters greedily and capture the last link with a capturing group:
^(?:https?://|.)*(https?://\S+?)(?:\?=|$)
Together with \S*? lazy whitespace matching, this also lets capture the link up to the ?=.
See regex demo and Go demo
var r = regexp.MustCompile(`^(?:https?://|.)*(https?://\S+?)(?:\?=|$)`)
fmt.Printf("%q\n", r.FindAllStringSubmatch("sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/:query2", -1)[0][1])
fmt.Printf("%q\n", r.FindAllStringSubmatch("sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/cc:query2/text/?=param", -1)[0][1])
Results:
"http://websites.com/path/subpath/:query2"
"http://websites.com/path/subpath/cc:query2/text/"
In case there can be spaces in the last link, use just .+?:
^(?:https?://|.)*(https?://.+?)(?:\?=|$)

Related

How can I write a regex that will match a nested [quote] BB tag?

As part of a forum that uses BBCode to store posts, I'm trying to write a way to detect mentions and quotes, in order to notify the users.
I have it working for all cases except nested quotes.
This is my regex so far (Python 2.7):
regex = r'\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote="(.*?)"\].*?\[\/quote\]'
These are my test cases:
# This works fine, I get the `user1` group.
Hello [url=/users/user1/]#Foo Bar[/url]
# This works fine, I get the `user2` and `user3` groups.
[quote="user2"]Test message[/quote] OK [quote="user3"]Test message[/quote]
# This doesn't work as I'd l ike. I only get the `user4` group, but not `user5`.
[quote="user4"][quote="user5"]Test message[/quote][/quote]
How can I modify the regular expression to match also the third test with the nested [quote] block?
Here's a link to regex101 for your convenience: https://regex101.com/r/Ov5SI1/1
Thank you!
A minor change in the original regex will solve your problem. Here is the original regex:
\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote="(.*?)"\].*?\[\/quote\]
Error
Consider the input string:
[quote="user4"][quote="user5"]Test message[/quote][/quote]
The last alternation tries to match it and it does succeed. However, the first match is
[quote="user4"][quote="user5"]Test message[/quote]
Now the next match starts after the [/quote]. It will not start anywhere before since all the previous text is already part of a successful match.
Correction
Solution 1:
Changing this part .*?\[\/quote\] in the original regex to a look ahead will result in successful match of both the user4 and user5.
\[quote=\"(.*?)\"\](?=.*?\[\/quote\])
final regex: \[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote=\"(.*?)\"\](?=.*?\[\/quote\])
Solution 2:
Focusing on just the right part of the alternation - \[quote="(.*?)"\].*?\[\/quote\]
Here only \[quote="(.*?)"\] this is necessary if you want to find any patter of the form [quote="..."]. The remaining portion is unnecessary.
Here is the final regex:
\[url=.*?\/users\/(.*?)\/\]#.*?\[\/url\]|\[quote=\"(.*?)\"\]
Please do remember that the regex must be applied globally to find all the matches.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

How can I match multiple hits between 2 delimiters?

Hi, my fellow RegEx'ers ;)
I'm trying to match multiple Texts between every two quotes
Here's my text:
...random code
someArray[] = ["Come and",
"get me,",
"or fail",
"trying!",
"Yours truly"]
random code...
So far, I managed to get the correct matches with two patterns, executed after each other:
(?s)someArray\[\].*?=.*?\[(.*?)\]
this extracts the text between the two brackets and on the result, I use this one:
"(.*?)"
This is working just fine, but I'd love to get the Texts in one regex.
Any help is highly appreciated!
Consider using \G. With its help, you may match "(.*?)" preceded by either someArray[] = [ or previous match of "(.*?)" (well, strictly speaking previous match of entire regex). Then just grab first capture groups from all matches:
(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"
Demo: https://regex101.com/r/eBQWdU/3
How you grab the first capture groups from depends on the language you're using regex in. For example in PHP you may do something like this:
preg_match_all('/(?:(?s).*someArray\[\].*?=.*?\[|\G[^"\]]+)"(.*?)"/', $input, $matches);
$array_items = $matches[1];
Demo: https://ideone.com/mZgU1x

Get all the characters until a new date/hour is found

I have to parse a lot of content with a regular expression.
The content might, for example, be:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
I have this regular expression that will of course return 2 matches, and the groups that I need - data, hour, name, multi line message:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):([^\d]+)
The problem is that if a number is written inside the message this will not be OK, because the regex will stop getting more characters.
For example in this case this will not work:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you 2 doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
How do I get all the characters until a new date/hour is found?
The problem is with your final capturing group ([^\d]+).
Instead you can use ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
The outer parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a capturing group
The next set of parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a non-capturing group that we want to match 1 to infinite amount of times.
Inside we have a negative look ahead: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+). This says that whatever we are matching cannot include a date.
What we actually capture: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) means we capture every character including a new line.
The entire regex that works looks like this:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
https://regex101.com/r/wH5xR2/2
Use a lookahead for dates and get everything up to that.
/^(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):\s?((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)/sm
I've edited you regex in two ways:
Added ^to the front, ensuring you only start from timestamps on their own line, which should filter out most issues with people posting timestamps
Replaced the last capturing group with ((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)
(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}) is a negative lookahead, with date
(?:(lookahead).)* Looks for any amount of characters that aren't followed by a date anchored to the start of a line.
((?:(lookahead).)*) Just captures the group for you.
It's not that efficient, but it works. Note the s flag for dotall (dot matches newlines) and m flag that lets ^ match at the start of line. ^ is necessary in the lookahead so that you don't stop the match in case someone posts a timestamp, and in the start to make sure you only match dates from the start of a line.
DEMO: https://regex101.com/r/rX8eH0/3
DEMO with flags in regex: https://regex101.com/r/rX8eH0/4

regex optional part in prefix, but do not include it in matches if it present

Problem is easier to be seen in code then described I got following regex
(?<=First(Second)?)\w{5}
and following sample data
FirstSecondText1
FirstText2
I only want matches Text1 & Text2 , I get 3 though, Secon is added, and I don't want that.
Played around, cant seem to get it to work.
You need an additional negative lookahead:
(?<=First(Second)?)(?!Second)\w{5}
If you want to avoid using Second twice, you could do it without lookaround and take the result of the first capturing group:
First(?:Second)?(\w{5})
You can try this regex (?<=First(Second)?)\w{5}$. All you have to do is to add a $ in the end so that the regex would not match the text Secon. You can use this as long as you are sure of the pattern that comes at the end of the input text. In this case it is \w{5}$