preg_match_all with multiple OR conditions - regex

Im trying to create a single regex pattern to match a string where 2 fields (separated by a comma) could either be
a) empty,
b) a single word, or
c) 2 words separated by a backslash (\).
This is a log file where position 1 is a source username field and position 2 is a destination user field, but both could be separated with a backslash if domain name is present (domain\username)
I've tried everything I can think of and can get 2 out of 3 to match, but not all conditions. Below are the possible variants that this string could be in. (something1 and something2 are known patterns that occur before and after this condition)
something1,,,something2
something1,,dstuser,something2
something1,,dstdomain\dstuser,something2
something1,srcdomain\srcuser,,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
something1,srcuser,dstdomain\dstuser,something2
something1,srcuser,dstuser,something2
something1,srcuser,,something2
something1,srcdomain\srcuser,dstuser,something2
something1,srcdomain\srcuser,dstdomain\dstuser,something2
For example, I've tried this:
^.*something1,(,|(?J)(?<src_username>[^\\]*),|(?<src_domain>.*?)\\(?<src_username>[^\\]*),).*?,something2*
this matches some of the time, but I'm curious if this is possible with a single line of regex.
Thanks in advance....

I think you are looking for this regex:
(?J)^.*something1,(?:,|(?<src_username>[^,\\]+),|(?<src_domain>[^,\\]+)\\(?<src_username>[^,\\]+),)(?:,|(?<dst_user>[^\\,]+),|(?<dst_domain>[^,\\]+)\\(?<dst_username>[^,\\]*),)something2.*
Check the demo
I am using negated character class [^,\\] extensively to not overmatch and stay in the boundaries of a "cell". Also, I make use of (?:...) non-capturing groups to not make a mess with the captured groups and helps keep the output clean.

Related

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

regex: substitute character in captured group

EDIT
In a regex, can a matching capturing group be replaced with the same match altered substituting a character with another?
ORIGINAL QUESTION
I'm converting a list of products into a CSV text file. Every line in the list has: number name[ description] price in this format:
1 PRODUCT description:120
2 PRODUCT NAME TWO second description, maybe:80
3 THIRD PROD:18
The resulting format must include also a slug (with - instead of ) as second field:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product-name-two-2:second description, maybe:80
3 THIRD PROD:third-prod-3::18
The regex i'm using is this:
(\d+) ([A-Z ]+?)[ ]?([a-z ,]*):([\d]+)
and substitution string is:
`\1 \2:\L$2-\1:\3:\4
This way my result is:
1 PRODUCT:product-1:description:120
2 PRODUCT NAME TWO:product name two-2:second description, maybe:80
3 THIRD PROD:third prod-3::18
what i miss is the separator hyphen - i need in the second field, that is group \2 with '-' instead of ''.
Is it possible with a single regex or should i go for a second pass?
(for now i'm using Sublime text editor)
Thanx.
I don't think doing this in a single pass is reasonable and maybe it's not even possible. To replace the spaces with hyphens, you will need either multiple passes or use continous matching, both will lose the context of the capturing groups you need to rearrange your structure. So after your first replace, I would search for (?m)(?:^[^:\n]*:|\G(?!^))[^: \n]*\K and replace with -. I'm not sure if Sublime uses multiline modifier per default, you might drop the (?m) then.
The answer might be a different one, if you were to use a programming language, that supports callback function for regex replace operations, where you could do the to - replace inside this function.

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Replacing part of delimited string with R's regex

I have the following list of strings:
name <- c("hsa-miR-555p","hsa-miR-519b-3p","hsa-let-7a")
What I want to do is for each of the above strings
replace the text after second delimiter (-) with "zzz".
Yielding:
hsa-miR-zzz
hsa-miR-zzz
hsa-let-zzz
What's the way to do it?
Might as well use something like:
gsub("^((?:[^-]*-){2}).*", "\\1zzz", name)
(?:[^-]*-) is a non-capturing group which consists of several non-dash characters followed by a single dash character and the {2} just after means this group occurs twice only. Then, match everything else for the replacement. Note I used an anchor just in case to avoid unintended substitutions.
Perhaps something like this:
> gsub("([A-Za-z]+-)([A-Za-z]+-)(.*)", "\\1\\2zzz", name)
[1] "hsa-miR-zzz" "hsa-miR-zzz" "hsa-let-zzz"
There are actually several ways to approach this, depending on how "regular" your expressions actually are. For example, do they all start with "hsa-"? What are the options for the "middle" group? Might there be more than three dashes?