Writing two regex patterns inside another - regex

I have some strings that can be written in two different ways. I am trying to extract both patterns in the same piece of regex.
The first i'm hoping to do is extract the substring before a substring (i'll call this "endWord")
So
Title Text (Descriptor Text) endword - More words i don't want
Would turn into "Title Text (Descriptor Text)"
NEXT, of this substring i just extract, i am hoping to extract just the word before the " (" (if it exists)
So the final result will be "Title Text".
(.+?(?= endWord))(.+?(?= \()) ends in no result

You can use
^(.*?)\s+\([^()]*\)(?=\s+endword\b)
See the regex demo.
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
\s+ - one or more whitespaces
\([^()]*\) - (, zero or more chars other than ( and ), and then )
(?=\s+endword\b) - a positive lookahead that requires one or more whitespaces and a whole word endword immediately to the right of the current location.

Related

Postgresql - Regex to get all words in string without special characters except -

Input
Word-Word, Some other words and this is another word et another one
Expected output
Word-Word
Some
other
words
this
is
another
word
another
one
I have a table (t) with many strings like the one showed in the input.
I'm trying to get every word in the sentence but the comas (','), the word 'and', 'et', 'und' and of course every whitespace or sequence of whitespace that may be between words.
Regex that I'm using:
Don't match whitespace \\s+
Don't match whitespace as long as special characters ((\b[^\s]+\b)((?<=\.\w).)?) - doesn't work in postgres for some reason
Don't match a particular word ^(?!et$|and$|und$) - doesn't work either
Query that I'm running
SELECT word FROM t,
unnest(regexp_split_to_array(t.word, E'Missing expression')) as word;
You can use an extracting approach here in the following way:
SELECT regexp_matches(
'Word-Word, Some other words and this is another word et another one ',
E'\\y(?!(?:et|[ua]nd)\\y)\\w+(?:-\\w+)*',
'g');
See the online demo. Regex details:
\y - a word boundary
(?!(?:et|[ua]nd)\y) - a negative lookahead that fails the match if there is et, und or and as whole words immediately to the right of the current location
\w+(?:-\w+)* - one or more word chars and then zero or more occurrences of - and one or more word char sequences
See the regex demo (converted to PCRE).

How to remove words which contains same character more number of time?

Here's an example input,,
text = "John Vv"
I need to remove Vv because it's same character repeating multiple times?
If the word contains other characters in it don't remove it, eg VvSh.
I tried something like this, which worked but I don't want to write same line for every english character out there, Any help is appreciated.
re.sub("vv", "", text, flags=re.I)
You can use
re.sub(r"\b([A-Za-z])\1+\b", "", text, flags=re.I)
Details:
\b - a word boundary
([A-Za-z]) - Group 1: an ASCII letter
\1+ - one or more chars equal to the one captured in Group 1
\b - a word boundary
See the regex demo.

Regexp Substring From URL

I need to retrieve some word from url :
WebViewActivity - https://google.com/search/?term=iphone_5s&utm_source=google&utm_campaign=search_bar&utm_content=search_submit
return I want :
search/iphone_5s
but I'm stuck and not really understand how to use regexp_substr to get that data.
I'm trying to use this query
regexp_substr(web_url, '\google.com/([^}]+)\/', 1,1,null,1)
which only return the 'search' word, and when I try
regexp_substr(web_url, '\google.com/([^}]+)\&', 1,1,null,1)
it turns out I get all the word until the last '&'
You may use a REGEXP_REPLACE to match the whole string but capture two substrings and replace with two backreferences to the capture group values:
REGEXP_REPLACE(
'WebViewActivity - https://google.com/search/?term=iphone_5s&utm_source=google&utm_campaign=search_bar&utm_content=search_submit',
'.*//google\.com/([^/]+/).*[?&]term=([^&]+).*',
'\1\2')
See the regex demo and the online Oracle demo.
Pattern details
.* - any zero or more chars other than line break chars as many as possible
//google\.com/ - a //google.com/ substring
([^/]+/) - Capturing group 1: one or more chars other than / and then a /
.* - any zero or more chars other than line break chars as many as possible
[?&]term= - ? or & and a term= substring
([^&]+) - Capturing group 2: one or more chars other than &
.* - any zero or more chars other than line break chars as many as possible
NOTE: To use this approach and get an empty result if the match is not found, append |.+ at the end of the regex pattern.

Regex for finding words with no or only one word between them

I need to find into multiple strings two words with no words or only one word between them. I created the regex for the case to find if those two words exist in string:
^(?=[\s\S]*\bFirst\b)(?=[\s\S]*\bSecond\b)[\s\S]+
and it works correctly.
Then I tried to insert in this regex additional code:
^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b)[\s\S]+
but it didn't work. It selects text with two or more words between searched words. It is not what I need.
First Second - must be selected
First word1 Second - must be selected
First word1 word2 Second - must be not selected by regex, but my regex select it.
Can I get advise how to solve this problem?
Root cause
You should bear in mind that lookarounds match strings without moving along the string, they "stand their ground". Once you write ^(?=[\s\S]*\bFirst\b)(\b\w+\b){0,1}(?=[\s\S]*\bSecond\b), the execution is as follows:
^ - the regex engine checks if the current position is the start of string
(?=[\s\S]*\bFirst\b) - the positive lookahead requires the presence of any 0+ chars followed with a whole word First - note that the regex index is still at the start of the string after the lookahead returns true or false
(\b\w+\b){0,1} - this subpattern is checked only if the above check was true (i.e. there is a whole word First somewhere) and matches (consumes, moves the regex index) 1 or 0 occurrences of a whole word (i.e. there must be 1 or more word chars right at the string start
(?=[\s\S]*\bSecond\b) - another positive lookahead that makes sure there is a whole word Second somewhere after the first whole word consumed with \b\w+\b - if any. Even if the word Second is the first word in the string, this will return true since backtracking will step back the word matched with (\b\w+\b){0,1} (see, it is optional), and the Second will get asserted, and [\s\S]+ will grab the whole string (Group 1 will be empty). See the regex demo with Second word word2 First string.
So, your approach cannot guarantee the order of First and Second in the string, they are just required to be present but not necessarily in the order you expect.
Solution
If you need to check the order of First and Second in the string, you need to combine all the checks into one single lookahead. The approach might turn out very inefficient with longer strings and multiple alternatives in the lookaround, consider either unrolling the patterns, or trying mutliple regex patterns (like this pseudo-code if /\bFirst\b/.finds_match().index < /\bSecond\b/.finds_match().index => Good, go on...).
If you plan to go on with the regex approach, you may match a string that contains First....Second only in this order:
^(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b)[\s\S]+
See the regex demo
Details:
^ - start of string
(?=[\s\S]*\bFirst(?:\W+\w+)?\W+Second\b) - there must be:
[\s\S]* - any zero or more chars up to the last
\bFirst - whole word First
(?:\W+\w+)? - optional sequence (1 or 0 occurrences) of 1+ non-word chars and 1+ word chars
\W+ - 1+ non-word chars
Second\b - Second as a whole word
[\s\S]+ - any 1 or more characters (empty string won't match).

Sublime Regex extract

<.*>|\n.*\s.*\sid="(\w*)".*\n+|.*>\n|\n.+
and replace $1
This regex can take all id out from file
<a href="java" class="total" id="maker" placeholder="getTheResult('local6')">master6<a>
Result is maker
How can I extract getTheResult key name?
so my result will be local6
Tried <.*>|\n.*\s.*\sgetTheResult('(\w*)').*\n+|.*>\n|\n.+ but didn't helped
I assume that:
you have files with text like getTheResult('local6')
you may have several values like that on a line
you'd like to keep those text only, one value per line.
I suggest
getTheResult\('([^']*)'\)|(?:(?!getTheResult\(')[\s\S])*
and replace with $1\n. The \n will insert a newline between the values. You can then use ^\n regex (to replace with empty string) to remove empty lines.
Pattern details:
getTheResult\(' - matches getTheResult(' as a literal string (note the ( is escaped)
([^']*) - Group 1 capturing 0+ chars other than '
'\) - a literal ')
| - or
(?:(?!getTheResult\(')[\s\S])* - 0+ chars that are not starting chars of the getTheResult(' character sequence (this is a tempered greedy token).