Regex for find last incidence that is NOT the end character - regex

I have two URLs from which I need to extract (actually split before) the pagename, i.e. last text string after the last /. For example:
https://example.com/en/pagename
https://example.com/en/pagename/
My current regex can find the last incidence of the "/" character, but when the / is at the end, I need to select the PREVIOUS / in order to break before the pagename. Current regex is:
\/(?!.*\/)

Method 1
I'd guess,
[^/]+/?$
or if you wish to capture the pagename,
([^/]+)/?$
might be OK to look into.
RegEx Demo 1
Method 2
For selecting the forward slash right before the end of the URL, we'd try positive lookahead:
/(?=[^/]+/?$)
RegEx Demo 2
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

Related

Regex to get all data before second last special character

I couldnt find any previously asked question similar to this. I need a regex to get all the data before second last special character.
For example:
suite 1, street 1, zip city, country
I need only suite 1, street 1.
I know how to get the data before just the last special character using [^,]*$ but not the second last one.
You can use the following regex and the first capturing group will have your desired substring:
(.*)(?:,[^,]*){2}$
Demo: https://regex101.com/r/AWpsL3/1
Or if the tool you're using does support capturing groups, you can use the following regex with lookahead instead:
.*(?=(?:,[^,]*){2}$)
Demo: https://regex101.com/r/AWpsL3/4
you can use look ahead
.+(?=.*,.*,)
explanation
.+ matches everything until the position look head starts , if the look ahead does not fail
Positive Look ahead (?=.*,.*,)
asserts two commas exist at the end
check demo
Depending on the implementation of regex, it may not support lookaround (which is what the above solution uses). A work around for this would be to perform a string split on your delimiter character (in this case, comma). Then perform a string join of the first two elements.
mystr = 'suite 1, street 1, zip city, country';
parts = mystr.split(',');
return parts[0]+','+parts[1];
Try (([ \w]+,?){2})(?=,) and dont use global flag (doesn't return after first match)
regex

Extracting address with Regex

I'm trying to looking for Street|St|Drive|Dr and then get all the contents of the line to extract the address:
(?:(?!\s{2,}|\$).)*(Street|St|Drive|Dr).*?(?=\s{2,})
.. but it also matches:
Full match 420-442 ` Tax Invoice/Statement`
Group 1. 433-435 `St`
Full match 4858-4867 `163.66 DR`
Group 1. 4865-4867 `DR`
Full match 11053-11089 ` Permanent Water Saving Plan, please`
Group 1. 11077-11079 `Pl`
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
One option is to use the the word-boundary anchor, \b, to accomplish this:
(?:(?!\s{2,}|\$).)*\b(Street|St|Drive|Dr)\b.*?(?=\s{2,})
If you provide an example of the raw text you're parsing, I'll be able to give additional help if this doesn't work.
Edit:
From the link you posted in a comment, it seems that the \b solution solves your question:
How do i match only whole words and not substrings so it ignores words that contain those words (the first match for example).
However, it seems like there are additional issues with your regex.

Get all the characters until a new date/hour is found

I have to parse a lot of content with a regular expression.
The content might, for example, be:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
I have this regular expression that will of course return 2 matches, and the groups that I need - data, hour, name, multi line message:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):([^\d]+)
The problem is that if a number is written inside the message this will not be OK, because the regex will stop getting more characters.
For example in this case this will not work:
14-08-2015 14:18 : Example : Hello =) How are you?
What are you 2 doing?
14-08-2015 14:19: Example2 : I'm fine thanks!
How do I get all the characters until a new date/hour is found?
The problem is with your final capturing group ([^\d]+).
Instead you can use ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
The outer parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a capturing group
The next set of parenthesis: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) indicate a non-capturing group that we want to match 1 to infinite amount of times.
Inside we have a negative look ahead: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+). This says that whatever we are matching cannot include a date.
What we actually capture: ((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+) means we capture every character including a new line.
The entire regex that works looks like this:
(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):((?:(?!\d{2}-\d{2}-\d{4})[\s\S])+)
https://regex101.com/r/wH5xR2/2
Use a lookahead for dates and get everything up to that.
/^(\d{2}-\d{2}-\d{4})\s?(\d{2}:\d{2})\s?:([^:]+):\s?((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)/sm
I've edited you regex in two ways:
Added ^to the front, ensuring you only start from timestamps on their own line, which should filter out most issues with people posting timestamps
Replaced the last capturing group with ((?:(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}).)*)
(?!^\d{2}-\d{2}-\d{4}\s?\d{2}:\d{2}) is a negative lookahead, with date
(?:(lookahead).)* Looks for any amount of characters that aren't followed by a date anchored to the start of a line.
((?:(lookahead).)*) Just captures the group for you.
It's not that efficient, but it works. Note the s flag for dotall (dot matches newlines) and m flag that lets ^ match at the start of line. ^ is necessary in the lookahead so that you don't stop the match in case someone posts a timestamp, and in the start to make sure you only match dates from the start of a line.
DEMO: https://regex101.com/r/rX8eH0/3
DEMO with flags in regex: https://regex101.com/r/rX8eH0/4

Write regex without negations

In a previous post I've asked for some help on rewriting a regex without negation
Starting regex:
https?:\/\/(?:.(?!https?:\/\/))+$
Ended up with:
https?:[^:]*$
This works fine but i've noticed that in case I will have : in my URL besides the : from http\s it will not select.
Here is a string which is not working:
sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/:query2
You can notice the :query2
How can I modify the second regex listed here so it will select urls which contain :.
Expected output:
http://websites.com/path/subpath/cc:query2
Also I would like to select everything till the first occurance of ?=param
Input:
sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/cc:query2/text/?=param
Output:
http://websites.com/path/subpath/cc:query2/text/
It is a pity that Go regex does not support lookarounds.
However, you can obtain the last link with a sort of a trick: match all possible links and other characters greedily and capture the last link with a capturing group:
^(?:https?://|.)*(https?://\S+?)(?:\?=|$)
Together with \S*? lazy whitespace matching, this also lets capture the link up to the ?=.
See regex demo and Go demo
var r = regexp.MustCompile(`^(?:https?://|.)*(https?://\S+?)(?:\?=|$)`)
fmt.Printf("%q\n", r.FindAllStringSubmatch("sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/:query2", -1)[0][1])
fmt.Printf("%q\n", r.FindAllStringSubmatch("sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/cc:query2/text/?=param", -1)[0][1])
Results:
"http://websites.com/path/subpath/:query2"
"http://websites.com/path/subpath/cc:query2/text/"
In case there can be spaces in the last link, use just .+?:
^(?:https?://|.)*(https?://.+?)(?:\?=|$)

Using regex to find a pattern which does not start with a certain String

I need to regex-match numbers in a certain pattern which works already, but only if there is not (+ right in front of it.
Example Strings I want to have a valid match within: 12, 12.5, 200/300, 200/300/400%, 1/2/3/4/5/6/7
Example Strings I want to have no valid match within: (+10% juice), (+4)
I can already get all the valid matches with (\d+[/%.]?)+, but I need help to exclude the example Strings I want to have no valid match in (which means, only match if there is NOT the String (+ right in front of the mentioned pattern).
Can someone help me out? I have already experimented with the ! (like ?!(\(\+)(\d+[/%.]?)+) but for some reason I can't it to work the way I need it.
(You can use http://gskinner.com/RegExr/ to test regex live)
EDIT: I did maybe use wrong words. I don't want to check if the searchstring does start with (+ but I want to make sure that there is no (+ right in front of my String.
Try regexr with the following inputs:
Match: (\d+[/%.]?)+
Check the checkbox for global (to search for more than one match within the text)
Text:
this should find a match: 200/300/400
this shouldnt find any match at all: (+100%)
this should find a match: 40/50/60%
this should find a match: 175
Currently it will find a match in all 4 lines. I want a regex that does no longer find a match in line 2.
The regex construct you are wanting is "Negative Lookbehind" - See http://www.regular-expressions.info/lookaround.html. A negative lookbehind is defined like (?<!DONTMATCHME) where DONTMATCHME is the expression you don't want to find just before the next bit of the expression. Crutially, the lookbehind bit is not considered part of the match itself
Your expression should be:
(?<![+\d/\.])(\d+[/%.]?)+
Edit - changed negative lookbehind to any character that is not a + or another digit!
Edit #2 - moved the lookbehind outside the main capture brackets. Expanded the range of not acceptable characters before the match to include / & .