Stop when meet special word - Regex - regex

I looking for help with a specific regex : (THE_WORD_I_WANT_TO_FIND)[^.?!\w]+([^.?!\s]+[^.?!\w]+){0,NUMBER_OF_WORDS}(MY_WORD_AT_END)
To explain, i'm looking for a specific word before another word. I have some conditions, I want to delimit to the sentence in which the WORD_AT_END is and to a specific number of word before it.
This regex does the job but I want to add a sentence delimiter : (\s\-\s) (in addition to . ? !).
Example :
Blablabla. A full Reference - Help is available in the Library, or watch the video Tutorial.
with the regex : (Help)[^.?!\w]+([^.?!\s]+[^.?!\w]+){0,}(watch) matchs and (Reference)[^.?!\w]+([^.?!\s]+[^.?!\w]+){0,}(watch) must not match...
Could you please help me?
Thank you !
SOLUTION (Thanks to #MostafaHussein) :
(Help)((?!\s-\s)\s(([\w|\w-|\pL|\pL-])+(?!\s-\s)\s+){0,})?(watch)
Here, - is a sentence delimiter if it is surrounded by two spaces.

The following Regex:
(Help)\s(?!-)(?s).+?(watch)
would match only:
Help is available in the Library, or watch
And not:
Reference - Help is available in the Library, or watch
As there is - will be found after the first word specified followed by a space e.g. Reference -
Update:
this regex will match any sentence as long as it does not contain - (it has to be surrounded by white-spaces)
Help((?!\s-\s)\s(([\w|\w-|\pL|\pL-])+\s+){0,7})?watch
Demo URL
Note: there has to be exactly 7 words before watch without counting Help and nothing matches if there is a - surrounded by spaces, also unicode letter character is taken in consideration so if there is something like ê will be matched correctly.

Related

Regex: extract characters from two patterns

I have the following string:
https://www.google.com/today/sunday/abcde2.hopeho.3345GETD?weatherType=RAOM&...
https://www.google.com/today/monday/jbkwe3.ho4eho.8495GETD?weatherType=WHTDSG&...
I'd like to extract jbkwe3.ho4eho.8495GETD or abcde2.hopeho.3345GETD. Anything between the {weekday}/ and the ?weatherType=.
I've tried (?<=sunday\/)$.*?(?=\?weatherType=) but it only works for the first line and I want to make it applicable to all strings regardless the value of {weekday}.
I tried (?<=\/.*\/)$.*?(?=\?weatherType=) but it didn't work. Could anyone familiar with Regex can lend some help? Thank you!
[Update]
I'm new to regex but I was experimenting it on sublime text editor via the "find" functionality which I think should be PCRE (according to this post)
Try this regex:
(?:sun|mon|tues|wednes|thurs|fri|satur)day\/\K[^?]+(?=\?weatherType)
Click for Demo
Link to Code
Explanation:
(?:sun|mon|tues|wednes|thurs|fri|satur)day - matches the day of a week i.e, sunday,monday,tuesday,wednesday,thursday,friday,saturday
\/ - matches /
\K - unmatches whatever has been matched so far and pretends that the match starts from the current position. This can be used for the PCRE.
[^?]+ - matches 1 or more occurences of any character that is not a ?
(?=\?weatherType) - the above subpattern[^?]+ will match all characters that are not ? until it reaches a position which is immediately followed by a ? followed by weatherType
To make the match case-insensitive, you can prepend the regex with (?i) as shown here
In the examples given, you actually only need to grab the characters between the last forward slash ("/") and the first question mark ("?").
You didn't mention what flavor regex (ie, PCRE, grep, Oracle, etc) you're using, and the actual syntax will vary depending on this, but in general, something like the following (Perl) replacement regex would handle the examples given:
s/.*\/([^?]*)\?.*/$1/gm
There are other (and more efficient) ways, but this will do the job.

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

Regex how to get a full match of nth word (without using non-capturing groups)

I am trying to use Regex to return the nth word in a string. This would be simple enough using other answers to similar questions; however, I do not have access to any of the code. I can only access a regex input field and the server only returns the 'full match' and cannot be made to return any captured groups such as 'group 1'
EDIT:
From the developers explaining the version of regex used:
"...its javascript regex so should mostly be compatible with perl i
believe but not as advanced, its fairly low level so wasn't really
intended for use by end users when originally implemented - i added
the dropdown with the intention of having some presets going
forwards."
/EDIT
Sample String:
One Two Three Four Five
Attempted solution (which is meant to get just the 2nd word):
^(?:\w+ ){1}(\S+)$
The result is:
One Two
I have also tried other variations of the regex:
(?:\w+ ){1}(\S+)$
^(?:\w+ ){1}(\S+)
But these just return the entire string.
I have tried replicating the behaviour that I see using regex101 but the results seem to be different, particularly when changing around the ^ and $.
For example, I get the same output on regex101 if I use the altered regex:
^(?:\w+ ){1}(\S+)
In any case, none of the comparing has helped me actually achieve my stated aim.
I am hoping that I have just missed something basic!
===EDIT===
Thanks to all of you who have contributed thus far, however, I am still running into issues. I am afraid that I do not know the language or restrictions on the regex other than what I can ascertain through trial and error, therefore here is a list of attempts and results all of which are trying to return "Two" from a sample of:
One Two Three Four Five
\w+(?=( \w+){1}$)
returns all words
^(\w+ ){1}\K(\w+)
returns no words atall (so I assume that \K does not work)
(\w+? ){1}\K(\w+?)(?= )
returns no words at all
\w+(?=\s\w+\s\w+\s\w+$)
returns all words
^(?:\w+\s){1}\K\w+
returns all words
====
With all of the above not working, I thought I would test out some others to see the limitations of the system
Attempting to return the last word:
\w+$
returns all words
This leads me to believe that something strange is going on with the start ^ and end $ characters, perhaps the server puts these in automatically if they are omitted? Any more ideas greatly appreciated.
I don't known if your language supports positive lookbehind, so using your example,
One Two Three Four Five
here is a solution which should work in every language :
\w+ match the first word
\w+$ match the last word
\w+(?=\s\w+$) match the 4th word
\w+(?=\s\w+\s\w+$) match the 3rd word
\w+(?=\s\w+\s\w+\s\w+$) match the 2nd word
So if a string contains 10 words :
The first and the last word are easy to find. To find a word at a position, then you simply have to use this rule :
\w+(?= followed by \s\w+ (10 - position) times followed by $)
Example
In this string :
One Two Three Four Five Six Seven Height Nine Ten
I want to find the 6th word.
10 - 6 = 4
\w+(?= followed by \s\w+ 4 times followed by $)
Our final regex is
\w+(?=\s\w+\s\w+\s\w+\s\w+$)
Demo
It's possible to use reset match (\K) to reset the position of the match and obtain the third word of a string as follows:
(\w+? ){2}\K(\w+?)(?= )
I'm not sure what language you're working in, so you may or may not have access to this feature.
I'm not sure if your language does support \K, but still sharing this anyway in case it does support:
^(?:\w+\s){3}\K\w+
to get the 4th word.
^ represents starting anchor
(?:\w+\s){3} is a non-capturing group that matches three words (ending with spaces)
\K is a match reset, so it resets the match and the previously matched characters aren't included
\w+ helps consume the nth word
Regex101 Demo
And similarly,
^(?:\w+\s){1}\K\w+ for the 2nd word
^(?:\w+\s){2}\K\w+ for the 3rd word
^(?:\w+\s){3}\K\w+ for the 4th word
and so on...
So, on the down side, you can't use look behind because that has to be a fixed width pattern, but the "full match" is just the last thing that "full matches", so you just need something whose last match is your word.
With Positive look-ahead, you can get the nth word from the right
\w+(?=( \w+){n}$)
If your server has extended regex, \K can "clear matched items", but most regex engines don't support this.
^(\w+ ){n}\K(\w+)
Unfortunately, Regex doesn't have a standard "match only n'th occurrence", So counting from the right is the best you can do. (Also, Regex101 has a searchable quick reference in the bottom right corner for looking up special characters, just remember that most of those characters are not supported by all regex engines)

Regex for capitalizing first letter in a tag, alt=", etc

I've found regular expressions that capitalize the first letter in a sentence. But does anyone know a regex that capitalizes the first letter inside a tag, including URL and image attributes (e.g. title="antelope" or alt="antelope").
I used another regex to change all my image paths to lower case, and it zapped a bunch of my tags as well (alt, title, h2, etc.). So now I'd like to get a head start fixing them by capitalizing the first letters.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors.
Before...
alt="antelope" title="antelope" <h2>antelope
After...
alt="Antelope" title="Antelope" <h2>Antelope
Regex
(="\w|>\w)
Replace Regex
\U$1\E
Description: This will work for your example, depending on the regex engine you are using.
Debuggex Demo
This replaces the value in parameters in a url. NOT in html, as I now see that is what you mean. Oh well.
Find what: (\?|\&)([a-z_]+=)([a-z])([^&]+)
Replace (all) with: $1$2\u$3$4
Free spaced:
(\?|\&)
Capture group 1: Either the literal question mark or ampersand.
([a-z_]+=)
Capture group 2: One or more of any lowercase letter or underscore, followed by the equals sign.
([a-z])
Capture group 3: The first letter in the value of the url parameter. Note this does not even notice parameters whose values don't start with a letter.
([^&]+)
Capture group four: Every other character in the value. Or more specifically, one or more of any character as long as it's not an ampersand. This is a negative character class.
The \u in the replace-with is an option in TextWrangler (and in TextPad, which is what I use...so TextWrangler might also use the Boost regex engine) replacement that uppercases the immediately-following character. I'm not sure if this would work if capture groups 3 and 4 were merged.
Try it (although it doesn't have the \u option.)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. There's a lot of helpful information in it, including a list of online regex testers (in the bottom section), so you can try things out yourself. All the links in this answer come from the FAQ.

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/