regex optional group matching different sections of optional data - regex

I have the following type of data
Apple:Red
Kiwi:brown,Box:no
Grapes:"Black,Green",Box:yes,qty:55
I created this regex,
(.*):(.*)(?:(,)?),(Box):(.*),(qty):(.*)
but the issue is this matches only 3rd line.
first line has one set of data, second line has one comma and thrid line has two commas. in other words, i have 3 sets of data and i need all the key and value in capture groups. How can i make sections after each comma is optional, so that i can match all 3 lines?

See regex in use here
([^:\n]*):("[^"]*"|[^,\n]*)(?:,Box:([^,\n]+)(?:,qty:(\d+))?)?
([^:\n]*) Capture any character except : or \n into capture group 1
: Match this literally
("[^"]*"|[^,\n]*) Capture either of the following into capture group 2
"[^"]*" Match ", then any character except " any number of times, then "
[^,\n]* Match any character except , or \n any number of times
(?:,Box:([^,\n]+)(?:,qty:(\d+))?)? Optionally match the following
,Box: Match this literally
([^,\n]+) Capture any character except , or \n one or more times into capture group 3
(?:,qty:(\d+))? Optionally match the following
,qty: Match this literally
(\d+) Capture one or more digits into capture group 4
Results in the following:
Apple
Red
Kiwi
brown
no
Grapes
"Black,Green"
yes
55

Related

How to conditionally expect particular characters if a prior regex matched?

I want to expect some characters only if a prior regex matched. If not, no characters (empty string) is expected.
For instance, if after the first four characters appears a string out of the group (A10, B32, C56, D65) (kind of enumeration) then a "_" followed by a 3-digit number like 123 is expected. If no element of the mentioned group appears, no other string is expected.
My first attempt was this but the ELSE branch does not work:
^XXX_(?<DT>A12|B43|D14)(?(DT)(_\d{1,3})|)\.ZZZ$
XXX_A12_123.ZZZ --> match
XXX_A11.ZZZ --> match
XXX_A12_abc.ZZZ --> no match
XXX_A23_123.ZZZ --> no match
These are examples of filenames. If the filename contains a string of the mentioned group like A12 or C56, then I expect that this element if followed by an underscore followed by 1 to 3 digits. If the filename does not contain a string of that group (no character or a character sequence different from the strings in the group) then I don't want to see the underscore followed by 1 to 3 digits.
For instance, I could extend the regex to
^XXX_(?<DT>A12|B43|D14)_\d{5}(?(DT)(_\d{1,3})|)_someMoreChars\.ZZZ$
...and then I want these filenames to be valid:
XXX_A12_12345_123_wellDone.ZZZ
XXX_Q21_00000_wellDone.ZZZ
XXX_Q21_00000_456_wellDone.ZZZ
...but this is invalid:
XXX_A12_12345_wellDone.ZZZ
How can I make the ELSE branch of the conditional statement work?
In the end I intend to have two groups like
Group A: (A11, B32, D76, R33)
Group B: (A23, C56, H78, T99)
If an element of group A occurs in the filename then I expect to find _\d{1,3} in the filename.
If an element of group B occurs ion the filename then the _\d{1,3} shall be optional (it may or may not occur in the filename).
I ended up in this regex:
^XXX_(?:(?A12|B43|D14))?(?(DT)(_\d{5}_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).*\.ZZZ$
^XXX_(?:(?<DT>A12|B43|D14))?_\d{5}(?(DT)(_\d{1,3})|(?!(?&DT))(?!.*_\d{3}(?!\d))).+\.ZZZ$
Since I have to use this regex in the OpenApi #Pattern annotation I have the problem that I get the error:
Conditionals are not supported in this regex dialect.
As #The fourth bird suggested alternation seems to do the trick:
XXX_((((A12|B43|D14)_\d{5}_\d{1,3}))|((?:(A10|B10|C20)((?:_\d{5}_\d{3})|(?:_\d{3}))))).*\.ZZZ$
The else branch is the part after the |, but if you also want to match the 2nd example, the if clause would not work as you have already matched one of A12|B43|D14
The named capture group is not optional, so the if clause will always be true.
What you can do instead is use an alternation to match either the numeration part followed by an underscore and 3 digits, or match an uppercase char and 2 digits.
^XXX_(?:(?<DT>A12|B43|D14)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
If you want to make use of the if/else clause, you can make the named capture group optional, and then check if group 1 exists.
^XXX_(?<DT>A12|B43|D14)?(?(DT)_\d{1,3}|[A-Z]\d{2})\.ZZZ$
Regex demo
For the updated question:
^XXX_(?<DT>A12|B43|D14)?(?(DT)(?:_\d{5})?_\d{3}(?!\d)|(?!A12|B43|D14|[A-Z]\d{2}_\d{3}(?!\d))).*\.ZZZ$
The pattern matches:
^ Start of string
XXX_ Match literally
(?<DT>A12|B43|D14)?
(?(DT) If we have group DT
(?:_\d{5})? Optionally match _ and 5 digits
_\d{3}(?!\d) Match _ and 3 digits
| Or
(?! Negative lookahead, assert not to the right
A12|B43|D14| Match one of the alternatives, or
[A-Z]\d{2}_\d{3}(?!\d) Match 1 char A-Z, 2 digits _ 3 digits not followed by a digit
) Close lookahead
) Close if clause
.* Match the rest of the line
\.ZZZ Match . and ZZZ
$ End of string
Regex demo

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo

Is there a regex for adding the first 4 characters to end of string and the last 4 characters to start of string?

I have some lines which I need to alter. They are protein sequences. How would I copy the first 4 characters of the line to the end of the line, and also copy the last 4 characters to the beginning of the line?
The strings are variable which complicates it, for example:
>X
LTGLGIGTGMAATIINAISVGLSAATILSLISGVASGGAWVLAGAKQALKEGGKKAGIAF
>Y
LVATGMAAGVAKTIVNAVSAGMDIATALSLFSGAFTAAGGIMALIKKYAQKKLWKQLIAA
Moreover, how could I exclude lines with a '>' at the beginning (these are names of the corresponding sequence)?
Does anyone know a regex which will allow this to work?
I've already tried some regex solutions but I'm not very experienced with this sort of thing and I can find the end string but can't get it to replace:
Find:
(...)$
Replace:
^$2$1"
An example of what I want to achieve is:
>1
ABCDEFGHIJKLMNOPQRSTUVWXYZ
becomes:
>1
WXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCD
Thanks
Try doing a find, in regex mode, on the following pattern:
^([A-Z]{4}).*([A-Z]{4})$
Then replace with the first four and last four characters swapped:
$2$0$1
Demo
You can use the regex below.
^(([A-Z]{4})([A-Z]*)([A-Z]{4}))$
^ asserts the position at the start of the line, so nothing can come before it.
( is the start of a capture group, this is group 1.
( is the start of a capture group, this is group 2. This group is inside group 1.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 2.
( is the start of a capture group, this is group 3.
[A-Z]* matches capital characters from A to Z between zero and infinite times.
) is the end of capture group 3.
( is the start of a capture group, this is group 4.
[A-Z]{4} means exactly 4 capital characters from A to Z.
) is the end of capture group 4.
$ asserts the position at the end of the line, so nothing can come after it.
See how it works with a replace here: https://regex101.com/r/W786uL/3.
$4$1$2
$4 means put capture group 4 here. Which is the last 4 characters.
$1 means put capture group 1 here. Which is everything in the entire string.
$2 means put capture group 2 here. Which is the first 4 characters.
You can use
^(.{4})(.*?)(.{4})$
^ - start of sting
(.{4}) - Match any for characters except new line
(.*?) - Match any character zero or more time (lazy mode)
$ - End of string
Demo

I need to combine multiple lines starting with the same ID

I have multiple lines in a text file that I need to combine together. The file is about 200 million lines long, so opening it with Excel and using their built-in tools is out of the picture.
The first set of lines looks like this:
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
Second set which I want to add at the last line of the first set is:
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
If anyone has experience with this, I'd love some help
Code
Regex
^(\d+),(.*$)(?=[\s\S]*^\1,(.*))
Formatting output
$1,$2,$3
Results
Input
1,example#gmail.com,Username
3,example#gmail.com,Username
4,example#gmail.com,Username
5,example#gmail.com,Username
9,example#gmail.com,Username
10,example#gmail.com,Username
1,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Output
1,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
3,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
4,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
5,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
9,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
10,example#gmail.com,Username,$2a$10$gdsZkf62vUfwHQX8pUGe2.7zqvBvcIPWseaJmboJw3U2sxDj18y5q
Explanation
^ Assert position at the start of a line
(\d+) Capture one or more digits into capture group 1
, Match the comma character , literally
(.*$) Capture any number of any character (except newline characters) until the asserted position at the end of the line (asserting end of line position dramatically reduces steps) into capture group 2
(?=[\s\S]*^\1,(.*)) Positive lookahead asserting what follows matches
[\s\S]* Match any number of any character (\s: any whitespace character; \S: any non-whitespace character)
^ Assert position at the start of a line
\1 Matches the same text as most recently matched by the 1st capturing group
, Matches the comma character , literally
(.*) Capture any number of any character into capture group 3

I need a regx to validate a name that can be 1, 2, or 3 words

In this example I try to validate for a city name. It works if I enter San Louis Obispo but not if I enter Boulder Creek or Boulder. I thought ? was supposed to make a block optional.
if (!/^[a-zA-Z'-]+\s[a-zA-Z'-]*\s([a-zA-Z']*)?$/.test(field)){
return "Enter City only a-z A-Z .\' allowed and not over 20 characters.\n";
}
I think spaces are the problem (\s). You made second and third words optional (by using * instead of +), but not the spaces. Question mark is only being applied to the third word because of parentheses.
The issue with your regex is that, in english, it says to match a word that's required to be followed by a space that's optionally followed by another word but then is required to have another space and then optionally another word. So, a single-word would not match - however, a word followed by two spaces would. Additionally two words that have a space at the end would also match - but neither without the trailing spaces would match.
To fix your exact regex you should add another grouping (non-matching group with (?: instead of just () around the second word to the end of the sentence) and have this group as optional with ?. Also, move the \s's inside the optional groups as well.
Try this:
^[a-zA-Z'-]+(?:\s[a-zA-Z'-]+(?:\s[a-zA-Z']+)?)?$
Regex explaind:
^ # beginning of line
[a-zA-Z'-]+ # first matching word
(?: # start of second-matching word
\s[a-zA-Z'-]+ # space followed by matching word
(?: # start of third-matching word
\s[a-zA-Z']+ # space followed by matching word
)? # third-matching word is optional
)? # second-matching word is optional
$ # end of line
Alternatively, you can try the following regex:
^([a-zA-Z'-]+(?:\s[a-zA-Z'-]+){0,2})$
This will match 1 through 3 words, or "cities", in a given line with the ability to adjust the range of words without having to further-duplicate the matching set for each new word.
Regex explained:
^( # start of line & matching group
[a-zA-Z'-]+ # required first matching word
(?: # start a non-matching group (required to "match", but not returned as an individual group)
\s # sub-group required to start with a space
[a-zA-Z'-]+ # sub-group matching word
){0,2} # sub-group can match 0 -> 2 times
)$ # end of matching group & line
So, if you want to add the ability to match more than 3 words, you can change the 2 in the {0,2} range above to be the number of words you want to match minus 1 (i.e. if you want to match 4 words, you'll set it to {0,3}).