Extracting a numerical value from a paragraph based on preceding words - regex

I'm working with some big text fields in columns. After some cleanup I have something like below:
truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]
I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.
This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.
Any help would be much appreciated!

You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:
xerb Scale:\s*\["(\d+)"]
^^^^^
See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:
SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
^^^
Note the 1 argument:
regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

Related

Populates the Necessary Data and skip others

I am using RegexReplace which populates the necessary data and my formula is working fine but i also want to populate those values which has (Ordered but Pending) and the word before it.
My formula populates the value which has ? sign
There are some words that i want to keep as it is like starting *** PEELS and ending like ZA - Date *** that ZA and Date can be changed in Data cell.
Your help will be much appreciated.
=TRIM(REGEXREPLACE(A2,"(\*{3}.*?)(?:\s*?\.{3}DONE=>.*)?(\*{3})$","$1$2"))
Link to Sheet
You might use a pattern to capture until the first occurrence of a date.
Then capture the data like pattern in group 2, as that should be in another place for the final replaced string.
Then match to remove either until repeated parts of (Ordered bu Pending) or match the rest of the line.
At the end capture the asterix pattern.
In the replacement use the 4 groups $1$3 $2$4
^(.*?)\s*(-\s*[A-Z]+\s*-\s*\d{2}-\d{2}-\d{4})(?:\s*\.+DONE=>.*?((?:\s*[A-Z]+\s*\(Ordered but Pending\))+)|.*)\s*(\*{3})$
Regex demo

How to duplicate regex search result within one line?

I have a csv table following the scheme:
"text1","text2",3
"text5","text?",5
"baa","foo",99
...
Which I need to transform to:
"text1","text2","-text2-",3
"text5","text?","-text?-",5
"baa","foo","-foo-",99
...
I'm sorry but I have no idea how to duplicate a part of a line using a regex.
I'm using VS Code find-replace engine.
How could I do this?
See regex101 demo.
Find: ^(\s*"[^"]*?","([^"]*?)",)
Replace: $1"-$2-",
Group 1: the first two values in each line, like "text1","text2",
Group 2: just the inner second value, like text2
Replace: Use Group 1 and then replicate Group 2 with surrounding "-Group2-"
Make sure you have this in your settings.json:
"search.usePCRE2": true,
"text1","text2",3
"text5","text?",5
Find the matched word group1, group2, group3. Match A-Za-z0-9 and "?" characters. I am not sure how long the last data number that I set the number 1~3 digital numbers. You can adjust to your condition easier.
("[\w?]+"),"([\w?]+)",(\d{1,3})
Replace with regex as following
$1,"$2","-$2-",$3
The results should be as following
"text1","text2","-text2-",3
"text5","text?","-text?-",5
Never mind asking questions to me.

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

regex to find value at a particular location

Presently the regex is:
[A-Z]+(?=-\d+$)
This pulls out the correct value for most of the strings which follow the below format:
ANG-RGN-SOR-BCP-0004 i.e. BCP
However it pulls out SS for the following document instead of PMR:
ANG-B31-OPS-PMR-MACE-SS-0229
So basically I want to pull out the fourth term (between the hyphens), so it should pick BCP and PMR.
The following regex will get the 4th item in group 1:
(?:[A-Z0-9]+-){3}([A-Z0-9]+)
The first bit in (?:...) is a "non-capturing group" which acts like a group but won't appear in the backreference list.
The next bit means "3 of these non-capturing groups".
And finally, a capturing group to collect what you want.
I have assumed here that all the groups contain only uppercase letters and digits, you should modify the parts in [square brackets] to represent what these groups could be.
A more easily understandable method in Python:
a = "ANG-B31-OPS-PMR-MACE-SS-0229"
part = a.split('-')[3]
print part
This gives "PMR".
This should suit your needs (demo):
(?:.+?-){3}([^-]+)
You'll be able to access the fourth term in the first capturing group.