For example, I have the word: sh0rt-t3rm.
How can I get the t3rm part using perl regex?
I could get sh0rt by using [(a-zA-Z0-9)+]\[-\], but \[-\][(a-zA-Z0-9)+] doesn't work to get t3rm.
The syntax used for the regex is not correct to get either sh0rt or t3rm
You flipped the square brackets and the parenthesis, and the hyphen does not have to be between square brackets.
To get sh0rt in sh0rt-t3rm you you might use for example one of:
Regex
Demo
Explanation
\b([a-zA-Z0-9]+)-
Demo 1
\b is a word boundary to prevent a partial word match, the value is in capture group 1.
\b[a-zA-Z0-9]+(?=-)
Demo 2
Match the allowed chars in the character class, and assert a - to the right using a positive lookahead (?=-)
To get t3rm in sh0rt-t3rm you might use for example one of:
Regex
Demo
Explanation
-([a-zA-Z0-9]+)\b
Demo 3
The other way around with a leading - and get the value from capture group 1.
-\K[a-zA-Z0-9]+\b
Demo 4
Match - and use \K to keep out what is matched so far. Then match 1 or more times the allowed chars in the character class.
If your whole target string is literally just sh0rt-t3rm then you want all that comes after the -.
So the barest and minimal version, cut precisely for this description, is
my ($capture) = $string =~ /-(.+)/;
We need parenthesis on the left-hand-side so to make regex run in a list context because that's when it returns the matches (otherwise it returns true/false, normally 1 or '').
But what if the preceding text may have - itself? Then make sure to match all up to that last -
my ($capture) = $string =~ /.*-(.+)/;
Here the "greedy" nature of the * quantifier makes the previous . match all it possibly can so that the whole pattern still matches; thus it goes up until the very last -.
There are of course many other variations on how the data may look like, other than just being one hyphenated-word. In particular, if it's a part of a text, you may want to include word-boundaries
my ($capture) = $string =~ /\b.*?-(.+?)\b/;
Here we also need to adjust our "wild-card"-like pattern .+ by limiting it using ? so that it is not greedy. This matches the first such hyphenated word in the $string. But if indeed only "word" characters fly then we can just use \w (instead of . and word-boundary anchors)
my ($capture) = $string =~ /\w*?-(\w+)/;
Note that \w matches [a-zA-Z0-9_] only, which excludes some characters that may appear in normal text (English, not to mention all other writing systems).
But this is clearly getting pickier and cookier and would need careful close inspection and testing, and more complete knowledge of what the data may look like.
Perl offers its own tutorial, perlretut, and the main full reference is perlre
-([a-zA-Z0-9]+) will match a - followed by a word, with just the word being captured.
Demo
I need to exclude a string from being matched if it's preceeded by a certain character, and my regex engine is POSIX. I was able to get the desired result using a negative lookbehind on https://regexr.com/ but just discovered that won't work on my POSIX SnowFlake platform :-( .
I'm trying to standardize variations of company names and want to match the strings that end in 'COMPANY', 'CO', or 'CO.', but not match them if preceeded by an ' & '. So 'COMPANY' would get matched in 'POWERWASH COMPANY', but not in 'JONES & COMPANY'.
Is there a way I can accomplish this in POSIX regex? I was able to get this to work using a negative lookbehind as follows:
(?<!&)( COMPANY$| CO[.]?$)
You may use a capturing group (as you're already doing) and put the irrelevant parts outside of the group:
[^&]( COMPANY| CO\.?)$
Demo.
I'm not that familiar with SnowFlake but according to the documentation, you can extract the value captured by group 1 using the regexp_substr method as follows:
regexp_substr(input, '[^&]( COMPANY| CO\.?)$', 1, 1, 'e', 1)
-- ^
-- Group number
Note that [^&] will match any character other than '&'. If you'd like the match to succeed even if the target word is at the beginning of the string, you may use (^|[^&]) in place of [^&]. In that case, you may extract the value from group 2 rather than group 1.
You can use
(^|[^&])( COMPANY| CO[.]?)$
See the regex demo.
Whatever you capture is usually of no importance in POSIX regex, but in other cases it is usually easy to work around using additional capturing groups and code logic.
Regex details:
(^|[^&]) - start of string or any char other than &
( COMPANY| CO[.]?) - either a space and COMPANY, or a space, CO, an optional . and
$ - end of string
I'm using Atom's regex search and replace feature and not JavaScript code.
I thought this JavaScript-compatible regex would work (I want to match the commas that have Or rather behind it):
(?!\b(Or rather)\b),
?! = Negative lookahead
\b = word boundary
(...) = search the words as a whole not character by character
\b = word boundary
, = the actual character.
However, if I remove characters from "Or rather" the regex still matches. I'm confused.
https://regexr.com/4keju
You probably meant to use positive lookbehind instead of negative lookbehind
(?<=\b(Or rather)\b),
Regex Demo
You can activate lookbehind in atom using flags, Read this thread
The (?!\b(Or rather)\b), pattern is equal to , as the negative lookahead always returns true since , is not equal to O.
To remove commas after Or rather in Atom, use
Find What: \b(Or rather),
Replace With: $1
Make sure you select the .* option to enable regular expressions (and the Aa is for case sensitivity swapping).
\b(Or rather), matches
\b - a word boundary
(Or rather) - Capturing group 1 that matches and saves the Or rather text in a memory buffer that can be accessed using $1 in the replacement pattern
, - a comma.
JS regex demo:
var s = "Or rather, an image.\nor rather, an image.\nor rather, friends.\nor rather, an image---\nOr rather, another time they.";
console.log(s.replace(/\b(Or rather),/g, '$1'));
// Case insensitive:
console.log(s.replace(/\b(Or rather),/gi, '$1'));
To Match any comma after "Or rather" you can simply use
(or rather)(,) and access the second group using match[2]
Or an alternative would be to use or rather as a non capturing group
(?:or rather)(,) so the first group would be commas after "Or rather"
I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").
The main thing I am trying to do here is learn regex so that I have a better understanding of it. What I am trying to do is a find and replace using regex to remove only the commas that are within the numbers.
I can do this using multiple find/replace patterns, and I can also do this using a brute force method of matching a large number and ignoring commas, however I am wondering if there is some way to place the numbers and comma into a capture group but ignore the commas from output.
Here is an example of a list of numbers:
"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"
So my 'brute-force' method is the following:
\"([0-9]+)[,]?([0-9]*)[,]?([0-9]*)[,]?([0-9]*[.]+[0-9]+)\"
This would account for any number up to 999,999,999,999.00. It contains the four capture groups $1$2$3$4 and will output any number I would expect in the format that I want.
Example of wanted output using a replace of $1$2$3$4:
7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01
What I would like to do is something like this (pseudo code):
[\"]([0-9]+)([(?:,)[0-9]*][.]+[0-9]+)[\"]
The idea behind this is:
Match the first quotation mark but ignore it
Match a group of numbers and place in capture group $1
Match either a number or comma followed by a period and one or more numbers and store in a capture group, but leave the commas out of the capture group.
Match the last quotation mark but ignore it
I've been reading and reading but can't seem to find a way to ignore part of a capture group the way I want to do it. Any suggestions or can it not be done?
A two step method would be to match the commas first then remove the quotes, which might work too:
(,)(?=([0-9]{2,3}[.,]))
Well, regexr uses ECMAScript regex, so you might use something like
"|([0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with $1.
regexr demo
Otherwise, with PCRE, you might use something like:
"|(?<=[0-9]),(?=[0-9])(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
And replace with nothing, where it makes use of lookarounds to make sure that the comma in question is surrounded by [0-9] (ECMAScript doesn't support lookbehinds currently).
regex101 demo
" matches a literal quote character.
| means OR, so the regex matches a " or a ([0-9]),(?=[0-9]) (or (?<=[0-9]),(?=[0-9]))
([0-9]) is a capture group to get one digit.
, matches a literal comma.
(?=[0-9]) is a positive lookahead and ensures that the comma is followed by a digit, without matching the digit itself.
(?<=[0-9]) is a positive lookbehind and ensures that the comma is preceded by a digit, again without matching the digit itself.
(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$) ensures that there are an odd number of quotes ahead, and this in turn means that this will match a comma only within quotes, assuming that there are no unbalanced or escaped quotes.
In two steps:
First remove all commas within quotes (i.e. commas that are followed by an odd number of quotes. This even works with escaped quotes since in CSV files, quotes are escaped by doubling):
>>> import re
>>> s = '"7,033.00","0.00","7,033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123,123,123.00","123,444,38.01"'
>>> s = re.sub(r',(?!(?:[^"]*"[^"]*")*[^"]*$)', '', s)
>>> s
'"7033.00","0.00","7033.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00",1,1,1,!!$,,"123123123.00","12344438.01"'
Then remove all the quotes:
>>> s.replace('"', '')
'7033.00,0.00,7033.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1,1,1,!!$,,123123123.00,12344438.01'