I'm trying to extract some information from a string in one of my columns with a RegEx.
I need to define a second column equal to what is between the 2nd and 3rd occurrence of a hyphen in my first column.
After much googling around I managed to get this far:
IFNULL(SAFE.REGEXP_EXTRACT(Final.CampaignName, r"(?:\w+\s+-\s+){2}(\w+)\s+-"), "Other") AS CampaignCategory
Example of how a string of in Final.CampaignName could look:
S - Oranges - Bar - Apples
S - Apples - Foo Bar - Oranges - Bananas
S - Apples - Bar
My Regex will only return the value if there is 1 word between the 2nd and 3rd hyphens, but I need to have the entire text returned (minus leading and trailing whitespace).
Can anyone guide me in the right direction to doing this?
Thanks!
If the regex engine supports \K (loosely, forget everything matched so far), one could use the following regular expression to match the text between the second and third hyphen.
^(?:[^-]+-){2}\K[^-]+(?=-)
Note that this regex does not contain a capture group.
Demo
This does not match Bar in the third example because there are only two hyphens. To match Bar simply remove the lookahead (?=-).
The regex engine performs the following operations.
^ match beginning of line
(?:[^-]+-) match 1+ chars other than '-' followed by '-'
in a non-capture group
{2} execute non-capture group twice
\K discard everything matched so far (reset the starting
point of the reported match)
[^-]+ match 1+ chars other than '-'
(?=-) match '-' in a positive lookahead
If [^-] is not to match newlines change it to [^-\r\n].
If \K is not supported, a capture group is needed (and the lookahead is not):
^(?:[^-]+-){2}([^-]+)-
I were almost there - so, below is as close to your original idea as I could get (BigQuery Standard SQL)
SELECT IFNULL(REGEXP_EXTRACT(final.CampaignName, r"(?: - .*?){2}(.*?)(?: -|$)"), "Other") AS CampaignCategory
Use the following pattern with a capture group to isolate what you really want to extract:
SAFE.REGEXP_EXTRACT(Final.CampaignName, r"[^-]+-[^-]+-\s*([^-]+?)\s*-") AS CampaignCategory
Demo
You could match what is between the second and third hyphen using a capturing group and make matching the rest optional using a repeating pattern with *
\w+(?:\s+-\s+\w+)\s+-\s+(\w+(?: \w+)*)(?:\s+-\s+\w+)*
Regex demo
I always prefer the other way if possible, instead of using Regex.
So for your problem, I can recommend that code:
split(Final.CampaignName, ' - ')[safe_offset(2)]
An example with your sample data:
select campaignName, split(campaignName, ' - ')[safe_offset(2)] as third_item
from unnest(['S - Oranges - Bar - Apples', 'S - Apples - Foo Bar - Oranges - Bananas', 'S - Apples - Bar']) as campaignName
Output looks like:
Related
I have a column of names in my spreadsheet that are structured like this...
Albarran Basten, Thalia Aylin
I'm using the formula below to extract every word BEFORE the comma (last name), and then only the first word AFTER the comma (first name), and then switch their places. It works great.
=join(" ",REGEXEXTRACT(D2,",\s(\S+)"),REGEXEXTRACT(D2,"^(.*?),"))
The formula above returns the name mentioned above like this, exactly as I need it to...
Thalia Albarran Basten
But, when I try to get it to automatically update the entire column of names using ARRAYFORMULA, it joins together all the names in the column all together into one cell, in each of the cells all the way down the column. Here's the formula I'm using that won't work...
={"Student Full Name";arrayformula(if(D2:D="",,join(" ",REGEXEXTRACT(D2:D,",\s(\S+)"),REGEXEXTRACT(D2:D,"^(.*?),"))))}
Any idea on what I could change in this arrayformula to make it work? Thanks for your help.
You can replace your REGEXEXTRACTs with a single REGEXREPLACE:
REGEXREPLACE(D2:D, "^(.*?),\s*(\S+).*", "$2 $1")
Or,
REGEXREPLACE(D2:D, "^([^,]*),\s*(\S+).*", "$2 $1")
See the regex demo.
Details:
^ - start of string
(.*?) - Group 1 ($1): zero or more chars other than line break chars as few as possible
, - a comma
\s* - zero or more whitespaces
(\S+) - Group 2 ($2): one or more non-whitespaces
.* - zero or more chars other than line break chars as many as possible.
With your shown samples please try following regex with REGEXREPLACE.
REGEXREPLACE(D2:D, "^([^,]*),\s*([^\s]+)\s\S*$", "$2 $1")
Here is the Online demo for used regex.
Explanation: Adding detailed explanation for used regex.
^ ##Matching from starting of the value.
([^,]*) ##Creating 1st capturing group which matches everything before comma comes.
,\s* ##Matching comma followed by 0 or more occurrences of spaces here.
([^\s]+) ##Creating 2nd capturing group where matching all non-spaces here.
\s\S*$ ##Matching space followed by 0 or more non-spaces till end of the value.
I was able to achieve some of the output but not the right one. I am using replace all regex and below is the sample code.
final String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
System.out.println(label.replaceAll(
"([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)", "$3"));
i want this output:
abc-nyd-request-xyxpt
but getting:
abc-nyd-request-xyxpt-
here is the code https://ideone.com/UKnepg
You may use this .replaceFirst solution:
String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
label.replaceFirst("(?:[^-]*-){2}(.+?)(?:--1)?-[^-]+$", "$1");
//=> "abc-nyd-request-xyxpt"
RegEx Demo
RegEx Details:
(?:[^-]+-){2}: Match 2 repetitions of non-hyphenated string followed by a hyphen
(.+?): Match 1+ of any characters and capture in group #1
(?:--1)?: Match optional --1
-: Match a -
[^-]+: Match a non-hyphenated string
$: End
The following works for your example case
([^-]+)-([^-]+)-(.+[^-])-+([^-]+)-([^-]+)
https://regex101.com/r/VNtryN/1
We don't want to capture any trailing - while allowing the trailing dashes to have more than a single one which makes it match the double --.
With your shown samples and attempts, please try following regex. This is going to create 1 capturing group which can be used in replacement. Do replacement like: $1in your function.
^(?:.*?-){2}([^-]*(?:-[^-]*){3})--.*
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^(?:.*?-){2} ##Matching from starting of value in a non-capturing group where using lazy match to match very near occurrence of - and matching 2 occurrences of it.
([^-]*(?:-[^-]*){3}) ##Creating 1st and only capturing group and matching everything before - followed by - followed by everything just before - and this combination 3 times to get required output.
--.* ##Matching -- to all values till last.
Im trying to match a string to that containsthree consecutive characters at the beginning of the line and the same six consecutive characters at the end.
for example
CCC i love regex CCCCCC
the C's would be highlighted from search
I have found a way to find get the first 3 and the last six using these two regex codes but im struggling to combine them
^([0-9]|[aA-zZ])\1\1 and ([0-9]|[aA-zZ])\1\1\1\1\1$
appreciate any help
If you want just one regular expression to "highlight" only the 1st three characters and last six, maybe use:
(?:^([0-9A-Za-z])\1\1(?=.*\1{6}$)|([0-9A-Za-z])\2{5}(?<=^\2{3}.*)$)
See an online demo
(?: - Open non-capture group to allow for alternations;
^([0-9A-Za-z])\1\1(?=.*\1{6}$) - Start-line anchor with a 1st capture group followed by two backreferences to that same group. This is followed by a positive lookahead to assert that the very last 6 characters are the same;
| - Or;
([0-9A-Za-z])\2{5}(?<=^\2{3}.*)$ - The alternative is to match a 2nd capture group with 5 backreferences to the same followed by a positive lookbehind (zero-width) to check that the first three characters are the same.
Now, if you don't want to be too strict about "highlighting" the other parts, just use capture groups:
^(([0-9A-Za-z])\2\2).*(\2{6})$
See an online demo. Where you can now refer to both capture group 1 and 3.
Using PCRE, I want to capture only and all digits in a line which follows a line in which a certain string appears. Say the string is "STRING99". Example:
car string99 house 45b
22 dog 1 cat
women 6 man
In this case, the desired result is:
221
As asked a similar question some time ago, however, back then trying to capture the numbers in the SAME line where the string appears ( Regex (PCRE): Match all digits conditional upon presence of a string ). While the question is similar, I don't think the answer, if there is one at all, will be similar. The approach using the newline anchor ^ does not work in this case.
I am looking for a single regular expression without any other programming code. It would be easy to accomplish with two consecutive regex operations, but this not what I'm looking for.
Maybe you could try:
(?:\bstring99\b.*?\n|\G(?!^))[^\d\n]*\K\d
See the online demo
(?: - Open non-capture group:
\bstring99\b - Literally match "string99" between word-boundaries.
.*?\n - Lazy match up to (including) nearest newline character.
| - Or:
\G(?!^) - Asserts position at the end of the previous match but prevent it to be the start of the string for the first match using a negative lookahead.
) - Close non-capture group.
[^\d\n]* - Match 0+ non-digit/newline characters.
\K - Resets the starting point of the reported match.
\d - Match a digit.
I've got oneline string that includes several dates. In JScript Regex I need to extract dates that are proceded by case insensitive substrings of "dat" and "wy" in the given order. Substrings can be preceded by and followed by any character (except new line).
reg = new RegExp('dat.{0,}wy.{0,}\\d{1,4}([\-/ \.])\\d{1,2}([\-/ \.])\\d{1,4}','ig');
str = ('abc18.Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30')
result = str.match(reg).toString()
Received result: 'Dat wy.03/12/2019FFF*Dato dost2009/03/03**data wy2020-09-30'
Expected result: 'Dat wy.03/12/2019,data wy2020-09-30' or preferably: '03/12/2019,2020-09-30'
Thanks.
Several issues.
You want to match as few as possible between the substrings and date, but your current regex uses greed .{0,} (same like .*). See this Question and use .*? instead.
dat.*?wy.*?FOO can still skip over any other dat. To avoid skipping over, use what some call a Tempered Greedy Token. The .*? becomes (?:(?!dat).)*? for NOT skipping over.
Not really an issue, but you can capture the date separator and reuse it.
If you want to extract only the date part, also use capturing groups. I put a demo at regex101.
dat(?:(?!dat).)*?wy.*?(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})
There are many ways to achieve your desired outcome. Another idea, I would think of - if you know, there will never appear any digits between the dates, use \D for non-digit instead of the .
dat\D*?wy\D*(\d{1,4}([/ .-])\d{1,2}\2\d{1,4})
You might use a capturing group with a backreference to make sure the separators like - and / are the same in the matched date.
\bdat\w*\s*wy\.?(\d{4}([-/ .])\d{2}\2\d{2}|\d{2}([-/ .])\d{2}\3\d{4})
\bdat\w*\s*wy\.? A word boundary, match dat followed by 0+ word chars and 0+ whitespace chars. Then match wy and an optional .
( Capture group 1
\d{4}([-/ .])\d{2}\2\d{2} Match a date like format starting with the year where \2 is a backreference to what is captured in group 2
| Or
\d{2}([-/ .])\d{2}\3\d{4} Match a date like format ending with the year where \3 is a backreference to what is captured in group 3
) Close group
The value is in capture group 1
Regex demo
Note That you could make the date more specific specifying ranges for the year, month and day.