regex: substitute character in captured group - regex

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.

I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

Related

Extracting a numerical value from a paragraph based on preceding words

I'm working with some big text fields in columns. After some cleanup I have something like below:
truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]
I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.
This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.
Any help would be much appreciated!
You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:
xerb Scale:\s*\["(\d+)"]
^^^^^
See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:
SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
^^^
Note the 1 argument:
regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

preg_match_all with multiple OR conditions

Im trying to create a single regex pattern to match a string where 2 fields (separated by a comma) could either be
a) empty,
b) a single word, or
c) 2 words separated by a backslash (\).
This is a log file where position 1 is a source username field and position 2 is a destination user field, but both could be separated with a backslash if domain name is present (domain\username)
I've tried everything I can think of and can get 2 out of 3 to match, but not all conditions. Below are the possible variants that this string could be in. (something1 and something2 are known patterns that occur before and after this condition)
something1,,,something2
something1,,dstuser,something2
something1,,dstdomain\dstuser,something2
something1,srcdomain\srcuser,,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
something1,srcuser,dstdomain\dstuser,something2
something1,srcuser,dstuser,something2
something1,srcuser,,something2
something1,srcdomain\srcuser,dstuser,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
For example, I've tried this:
^.*something1,(,|(?J)(?<src_username>[^\\]*),|(?<src_domain>.*?)\\(?<src_username>[^\\]*),).*?,something2*
this matches some of the time, but I'm curious if this is possible with a single line of regex.
Thanks in advance....
I think you are looking for this regex:
(?J)^.*something1,(?:,|(?<src_username>[^,\\]+),|(?<src_domain>[^,\\]+)\\(?<src_username>[^,\\]+),)(?:,|(?<dst_user>[^\\,]+),|(?<dst_domain>[^,\\]+)\\(?<dst_username>[^,\\]*),)something2.*
Check the demo
I am using negated character class [^,\\] extensively to not overmatch and stay in the boundaries of a "cell". Also, I make use of (?:...) non-capturing groups to not make a mess with the captured groups and helps keep the output clean.

regex to find value at a particular location

Presently the regex is:
[A-Z]+(?=-\d+$)
This pulls out the correct value for most of the strings which follow the below format:
ANG-RGN-SOR-BCP-0004 i.e. BCP
However it pulls out SS for the following document instead of PMR:
ANG-B31-OPS-PMR-MACE-SS-0229
So basically I want to pull out the fourth term (between the hyphens), so it should pick BCP and PMR.
The following regex will get the 4th item in group 1:
(?:[A-Z0-9]+-){3}([A-Z0-9]+)
The first bit in (?:...) is a "non-capturing group" which acts like a group but won't appear in the backreference list.
The next bit means "3 of these non-capturing groups".
And finally, a capturing group to collect what you want.
I have assumed here that all the groups contain only uppercase letters and digits, you should modify the parts in [square brackets] to represent what these groups could be.
A more easily understandable method in Python:
a = "ANG-B31-OPS-PMR-MACE-SS-0229"
part = a.split('-')[3]
print part
This gives "PMR".
This should suit your needs (demo):
(?:.+?-){3}([^-]+)
You'll be able to access the fourth term in the first capturing group.

Matching a word if preceding text is not "class=' "

I'm trying to create a regex for a search that will look at the following code and return only the ids and not the classes:
1 id="contact"
2 class="contact"
3 #contact
4 .contact
I want to return contact from the 1st and 3rd lines and NOT 2nd and 4th lines.
This is for a search across multiple files to avoid going through each one individually and checking whether it needs changing or not.
Is this possible?
Here you go:
/(?:#|id=")(\w+)"?/g
strings beginning with either # or id=" followed by word characters. You'll probably want to enhance it to handle dashes and underscores, I'd bet.
In this case, the first group is non-capturing, and the ID text will be your first capture group $1.
UPDATE
this one:
(?:(?<=id=")|(?<=#))(contact)
uses a positive lookbehind to find your prefixes and matches just the string "contact". This will NOT work in JavaScript (so you can't test it online) but will work in a text editor or CLI tool like ack.