Postgres asterisc regex quantifier not working - regex

In Postgres 9.5.1 the following command works:
select regexp_replace('JamesBond007','\d+','');
Output:
JamesBond
However the asterisc does not seem to work:
select regexp_replace('JamesBond007','\d*','');
it produces:
JamesBond007
Even more weird things happen when I put something in as replacement string:
select regexp_replace('JamesBond007','\d+','008');
results in:
JamesBond008
while
select regexp_replace('JamesBond007','\d*','008');
gives me back:
008JamesBond007
The Postgres documentation says * = a sequence of 0 or more matches of the atom.
So what is happening here? (N.B. in Oracle all the above works as expected)

The thing is that \d* can match an empty string and you are not passing the flag g.
See regexp_replace:
The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag i specifies case-insensitive matching, while flag g specifies replacement of each matching substring rather than only the first one.
The \d* matches the empty location at the beginning of the JamesBond007 string, and since g is not passed, that empty string is replaced with 008 when you use select regexp_replace('JamesBond007','\d*','008'); and the result is expected - 008JamesBond007.
With select regexp_replace('JamesBond007','\d*','');, again, \d* matches the empty location at the beginning of the string, and replaces it with an empty string (no visible changes).
Note that Oracle's REGEXP_REPLACE replaces all occurrences by default:
By default, the function returns source_char with every occurrence of the regular expression pattern replaced with replace_string.
In general, you should be cautious when using patterns matching empty strings inside regex-based replace functions/methods. Do it only when you understand what you are doing. If you want to replace digit(s) you usually want to find at least 1 digit. Else, why remove something that is not present in the string in the first place?

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Azure Data Explorer, Kusto: regex not semantically correct in extract()

I am trying to grab a substring of a column value in Kusto.
I know that the string is always preceded by the format 'text-for-fun-' then the string of letters I want, followed by anything that is not a letter.
I thought I should use extract() as that allows me to enter a regular expression to handle the multiple possibilities of characters that can follow the string I want.
However, when I attempt to enter the regex, I keep getting a SEM0420: Semantic error: Regex pattern is ill formed.
Can you help me figure out how to enter the regex properly?
Example string: stuff milk-cow-cocoa a/123
Desired substring: cocoa
Current regex: (?<=milk-cow-\s*).*?(?=\s*[^A-Za-z])
Note: looks like the single asterisks are being removed. They appear below in code.
At this point, the \s are to defensively parse the string and remove whitespaces. The end of the overall string may also exist immediately after the desired substring.
I have tried something similar to this Data Explorer statement:
cluster("mine").database("mine").
DataTable
| where PreciseTimeStamp >ago(5h) and resourceProvider == "Provider"
| where info has "cow-milk-"
| take 200
| project extract("(?<=milk-cow-\\s*).*?(?=\\s*[^A-Za-z])", 0, info), info
I had to add an extra \ before each \ for the Data Explorer to parse the strings correctly.
Your regex engine chokes on a lookbehind, and possibly on lookahead, too.
You have a second argument to extract that tells the function to return the capture only, so you may use
| project extract("milk-cow-\\s*([a-zA-Z]+)", 1, info)
It means
milk-cow- - match milk-cow-
\s* - match 0 or more whitespaces
([a-zA-Z]+) - match and capture into Group 1 only one or more ASCII letters.

specify pattern at the beginning of string in regular expression

I have some string with multiple possible values:
e
(space)Exact
Exact
exact
phase
I want to get only the first four values, the regular expression I came up with is:
^\s*e
it means at the beginning of the string it has 0 or more white space followed by e(or E, case insensitive), howevever it always filters out the case
(space)Exact
my guess is it take ^ as not instead of beginning of string. How can i correct that? I use Perl Compatible Regular Expressions(PCRE) as the matching engine.
Try the using the mode modifiers in your regex to turn on ^$ match at linebreaks; and also, if necessary case insensitive
(?mi)^\s*e
The ^ character means only the beginning of a string. The beginning of a new line does not count as the beginning of a string. So this would not work if more than one are inside the same "string" object. Not sure how pcre works, but if you want to be able to match the begging of a line also you have to have the multi-line flag enabled.
Edit: If you want to pick up the beginnning of a new line go this route instead: \r\n at the beginning of the expression and remove the "^"
Edit #2 (because I feel like doing regex): here's what you're looking for:
(\b)[eE]+\w*

regular expression matching issue

I've got a string which has the following format
some_string = ",,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,"
and this is the content of a text file called f
I want to search for a specific term within the xxx (let's say that term is 'silicon')
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn't seem to work because it returns results which are in the format:
["xxx,,,xxx,,,xxx,,,xxx,,,silicon", "xxx,,,xxx,,,xxx,,,xxsiliconxx"] but I only want it to return ["silicon", "xxsiliconxx"]
What am I doing wrong?
Try the following regex:
(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})
Example:
>>> s = ',,,xxx,,,silicon,,,xxx,,,xxsiliconxx,,,xxx'
>>> re.findall(r'(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})', s)
['silicon', 'xxsiliconxx']
I am assuming that the content in the xxx can contain commas, just not three consecutive commas or it would end the field. If the content in the xxx sections cannot contain any commas, you can use the following instead:
(?<=,{3})[^,\r\n]*?silicon.*?(?=,{3})
The reason your current approach doesn't work is that even though .*? will try to match as few characters as possible, the match will still start as early as possible. So for example the regex a*?b would match the entire string "aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since ,,, can be matched by the .*?, your match will always start at the beginning of the string or just after the previous match.
The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically re.findall() won't return overlapping matches, so you need the leading and trailing ,,, to not be a part of the match.

regex with 3 backreferences but one optional

I have a regular expression that captures three backreferences though one (the 2nd) may be null.
Given the flowing string:
http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajonathonoat.es&source=web&cd=1&ved=0CC8QFjAA&url=http%3A%2F%2Fjonathonoat.es%2Fbritish-mozcast%2F&ei=MQj9UKejDYeS0QWruIHgDA&usg=AFQjCNHy1cDoWlIAwyj76wjiM6f2Rpd74w&bvm=bv.41248874,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1
I wish to capture the TLD (in this case .co.uk), q param and cd param.
I'm using the following RegEx:
/.*\.google([a-z\.]*).*q=(.*[^&])?.*cd=(\d*).*/i
Which works except the 2nd backreference includes the other parameters upto the cd param, I current get this:
["http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1 ", ".co.uk", "site%3Ajonathonoat.es&source=web", "1", index: 0, input: "http://www.google.co.uk/url?sa=t&rct=j&q=site%3Ajo…,d.d2k,.co.uk,site%3Ajonathonoat.es&source=web,1"]
The 1st backreference is correct, it's .co.uk and so is the 3rd; it's 1. I want the 2nd backreference to be either null (or undefined or whatever) or just the q param, in this example site%3Ajonathonoat.es. It currently includes the source param too (site%3Ajonathonoat.es&source=web).
Any help would be much appreciated, thanks!
I've added a JSFiddle of the code, look in your browser console for the output, thanks!
if negating character classes, i always add a multiplier to the class itself:
/.*\.google([a-z\.]*).*q=([^&]*?)?.*cd=(\d*).*/i
i also recoomend not using * or + as they are "greedy", always use *? or +? when you are going to find delimiters inside your string. For more on greedyness check J.F.Friedls Mastering Rgeular Expressions or simply here
You want the middle group to be:
q=([^&]*)
This will capture characters other than ampersand. This also allows zero characters, so you can remove the optional group (?).
Working example: http://rubular.com/r/AJkXxgeX5K