Regex, Word contains x but doesn't start with x - regex

I'm not sure how to do this. I want a string to contain "end" but not start with end. An example would be the word "ascending". I've tried ^[^(end)]*end.*, but this finds "end" too.
So how would this be done correctly?

Your attempt didn't work because you confused a negated character class (defined with [^...], but able to cover just a single character by definition) with a negative lookahead (defined with (?!...)).
Have you written this properly, it would have looked like that:
^(?!end).+end.*$
... with + instead of * right after the preceding . for obvious reasons. )
Note that more straight-forward attempt - ^.+end.*$ - while looking pretty normal, actually fails on words looking like endendless (i.e., when there's an end substring in the target string, yet it does start with end too).
Also, if you're looking for words in the target string, and not just validate it, you should adjust the regex accordingly, using word boundary anchors instead of string anchors like that:
\b(?!end)\w+end\w*
Demo.

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

How write a regex starts and ends with particular string?

I hava a string, like this:
{"content":(uint32)123", "id":(uint64)111, "test":{"hi":"(uint32)456"}}
I want to get result:
(uint32)123
(uint64)111
so I write regex like this:
[^(?!\")](\(uint32\)|\(uint64\))(\d)+[^(?!\")$]
but the result is:
:(uint32)123
:(uint64)111,
here the result adds : and ,
I hope that the regex does not begin with " and does not end with " , now I should how change my regex?
(\(uint(?:32|64)\)\d+) Works for me. It captures the entire string (uint[32/64])<any number of digits\> without bothering about the characters that come before or after.
Tested the following one in python
(?<!\")(\(uint32\)|\(uint64\))\d+(?!(\"|\d))
It looked like you was trying to use negative lookahead and negative lookbehind checks. But you did couple of mistakes:
You put them inside symbol group like this: [^(?!\")] what this regexp really mean - not any of symbols inside square bracket (^ - stands for not). How it should be instead: (?!\") - which mean symbol after current position shouldn't be quote (note: this will also work if there is no symbol after
To check symbol before you need to use look ahead check which have syntax (?<!some_regexp). So it would be (?<!\")
You don't need checks for start or end of the line. If you do you can put then into separate negative look ahead/behind statement.
Here is corrected example without line start/end checks:
(?<!\")(\(uint32\)|\(uint64\))(\d)+(?!\")(?!\d)
Note: you need to add (?!\d) at the end, cause otherwise it would match everything except last digit if there is quote.
Here is example with start/end of line checks:
(?<!^)(?<!\")(\(uint32\)|\(uint64\))(\d)+(?!\")(?!\d)(?!$)
P.S.: depending on language you using - you might not need to escape quote - you do need to escape quote only in case it is string escape sequence not regexp escape sequence.

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

regex remove all numbers from a paragraph except from some words

I want to remove all numbers from a paragraph except from some words.
My attempt is using a negative look-ahead:
gsub('(?!ami.12.0|allo.12)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
But this doesn't work. I get this:
"." "" "ami.. " "allo."
Or my expected output is:
"." "" 'ami.12.0','allo.12'
You can't really use a negative lookahead here, since it will still replace when the cursor is at some point after ami.
What you can do is put back some matches:
(ami.12.0|allo.12)|[[:digit:]]+
gsub('(ami.12.0|allo.12)|[[:digit:]]+',"\\1",
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)
I kept the . since I'm not 100% sure what you have, but keep in mind that . is a wildcard and will match any character (except newlines) unless you escape it.
Your regex is actually finding every digit sequence that is not the start of "ami.12.0" or "allo.12". So for example, in your third string, it gets to the 12 in ami.12.0 and looks ahead to see if that 12 is the start of either of the two ignored strings. It is not, so it continues with replacing it. It would be best to generalize this, but in your specific case, you can probably achieve this by instead doing a negative lookbehind for any prefixes of the words (that can be followed by digit sequences) that you want to skip. So, you would use something like this:
gsub('(?<!ami\\.|ami\\.12\\.|allo\\.)[[:digit:]]+','',
c('0.12','1245','ami.12.0 00','allo.12 1'),perl=TRUE)

Regex to include one thing but exclude another

I've been having a lot of trouble finding how to write a regex to include certain URLs starting with a specified phrase while excluding another.
We want to include pages that start with:
/womens
/mens
/kids-clothing/boys
/kids-clothing/girls
/homeware
But we want to exclude anything that has /sXXXXXXX in the URL - where the X's are numbers.
I've written this so far to match the below URLs but it's behaving very oddly. Should I be using lookarounds or something?
\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware).*[^s[0-9]+].*
/homeware/bathroom/s2522424/4-tier-pastel-pop-drawers-approx-91cm-x25cm-x-28cm
/homeware/bathroom/towels-and-bathmats
/homeware/bathroom/towels-and-bathmats/s2506420/boutique-luxury-towels
/homeware/bathroom/towels-and-bathmats?page=3&size=36&cols=4&sort=&id=/homeware/bathroom/towels-and-bathmats&priceRange[min]=1&priceRange[max]=14
/homeware/bathroom?page=3&size=36&cols=4&sort=&id=/homeware/bathroom&priceRange[min]=1&priceRange[max]=35
/homeware/bedroom
/homeware/bedroom/bedding-sets
/homeware/bedroom/bedding-sets/s2471012/striped-reversible-printed-duvet-set
/homeware/bedroom/bedding-sets/s2472706/check-printed-reversible-duvet-set
/homeware/bedroom/bedding-sets/s2475332/union-jack-duvet-set
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520246/boys-lollipop-slogan-t-shirt
/kids-clothing/boys/shop-by-age/toddler-3mnths-5yrs/s2520253/boys-2-pack-dinosaur-t-shirts
/kids-clothing/girls/great-value/sale?page=1&size=36&cols=4&sort=price.asc&id=/kids-clothing/girls/great-value/sale&priceRange[min]=0.5&priceRange[max]=7
/kids-clothing/girls/mini-shops/ballet-outfits
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2484120/3-pack-frill-pants-pinks
/kids-clothing/girls/shop-by-age/baby--newborn-0-18mths/s2504431/3-pack-l-s-bodysuit
/mens/categories/tops?page=5&size=36&cols=4&sort=&id=/mens/categories/tops&priceRange[min]=2&priceRange[max]=22.5
/mens/categories/trousers-and-chinos
/mens/categories/trousers-and-chinos/s2438566/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2438574/easy-essential-cuffed-jogging-bottoms
/mens/categories/trousers-and-chinos/s2458939/regatta-zip-off-lightweight-outdoor-trousers
You are on the right track. A negative lookahead will do it:
"^(?!.*\/s\d+)\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*"
The ^ anchors to the start of the string. The (?!.*\/s\d+) means that "/sXXXXXXX" can't appear anywhere in the string, and the rest of it matches your required starting tokens.
The reason [^s[0-9]+] didn't work is that [^xyz] matches only one single character. What you're effectively saying there is that you're looking for any character that isn't any combination of "s", "[" and "0-9", followed by "]". e.g. "s[234[s]".
The reason you need to put your negative lookahead at the start of the string is so nothing is matched at all. If you put it after the \/(womens|mens|kids\-clothing\/boys|kids\-clothing\/girls|homeware)\/.*, you would still successfully match everything before the "/sXXXXXXX". i.e. for line 1 of your data, you would match "/homeware/bathroom/".
Yes, you need a negative lookaround:
/^\/(womens|mens|kids\-clothing\/boys|kids\-clothing\/boys|homeware)(?:\/(?:(?!s\d+).)*)+$/gm
If you're comparing one line at a time you don't need the multiline (m) flag. It's probably behaving strangely because you had a character class (denoted by square brakcets) nested inside more square brackets, which doesn't work; you can't nest character classes. This was tested and works on refiddle.