Regex to match three words separated by two commas - regex

I am trying to get at least three words separated by two commas.I have so far managed to match two words with one comma with
/([A-z]|[0-9])(,{1})([A-z]|[0-9])/
but how can I add a comma and a word to this.I have tried repeating the same but did not work.

/^(?:\w+,){2,}(?:\w+)$/
This will get you a comma separated list of at least 3 words ([a-zA-Z0-9_]+).
/^\s*(?:\w+\s*,\s*){2,}(?:\w+\s*)$/
This is a slightly more user-friendly version of the first, allowing spaces in between words.

If it's a PERL derived regex, as most implementations I've encountered, /[^,]+(?:,[^,]+){2,}/ tests well against anything that has at least two commas in it, providing that the commas have something between them. The (?:) construct allows to group without capturing. The {2,} construct specifies 2 or more matches of the previous group. In javascript, you can test it:
/[^,]+(?:,[^,]+){2,}/.test("hello,world,whats,up,next"); // returns true
/[^,]+(?:,[^,]+){2,}/.test("hello,world"); // returns false

Try this one:
([a-zA-Z0-9]+)(,[a-zA-Z0-9]+){2,}

Few general suggestions from performance perspective:
Don't use [ ]|[ ] clause - you can just put few character classes inside one [ ], e.g. [A-Za-z0-9]
Don't overuse () - usually each of them stores captured argument which requires additional overhead. If you just need to group few pieces together look for grouping operator that does not store match (it might be something like (?: ... ) )

This will solve your problem,
try this
([a-zA-Z0-9],[a-zA-Z0-9],([a-zA-Z0-9]))

Related

Regexp - Get everything before two different strings. One can contain both

I have to use regexp.
Current state:
.+?((/=\.czxy)|(?=\.zzzz))
It's working for the first two cases (that's obvious)
So I have decided to do something like this:
.+?((/=\.czxy)|(?=\.zzzz)|(?=\-\-[0-9]))
But this still doesn't work. (There is OR).
I want to have everything before the extension. (Example 1 and 2)
When string is ended with '--1,--2, --3... and so on', I need to have everything before that. (Example 3 and 4)
Note: I cannot use if construction.
Examples:
123_abc_cb1.czxy -> 123_abc_cb1
123_23c_cb1.zzzz -> 123_23c_cb1
123_abc_cb1--1.czxy -> 123_abc_cb1
123_23c_cb1--1.zzzz -> 123_23c_cb1
EDIT:
123_abc_cb1 is a random combination of letters, numbers and special characters, there can be everything.
Your attempt has these issues:
A typo: (/= should be (?=
The regex does not require that the --[0-9] part is still followed by the extension. That part should actually be an optional part that precedes the pattern for the extension.
So change to this:
^.+?(?=(?:--\d)?\.(?:czxy|zzzz))
Or -- if matches do not necessarily start at the start of the input/line:
(?<!\S).+?(?=(?:--\d)?\.(?:czxy|zzzz))
You don't need any lookarounds if you can use a capture group. To match characters and underscore you can use for example \w to match word characters:
(\w+)(?:--\d+)?\.(?:czxy|zzzz)\b
Regex demo
why not use the recurrent information "_cb1"
/.*_cb1/

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

How can I match groups separated by other groups in regex?

I am writing a regex to match a list of items that follow a specific complex format, so the regex for that is very long. The items on this list have to be separated by either a comma, which can optionally be padded with either one space on the right or spaces on both sides, so the regex for matching the delimiter is ( , )|(, ?). Also, I want the list to be between square brackets.
For example, it should match the following:
[]
[validItem]
[validItem,validItem, validItem]
But not the following:
[validItem,invalidItem]
[validItemvalidItem]
[validItem, validItem ]
The regex I currently have is: \[verylongregex(?:(?: , )|(?:, ?)verylongregex)*\], but I'd like to simplify this to include the regex pattern that matches the element format only once.
Does regex have a method to match X groups separated by another group?
Here is an answer. I don`t know if it is what you are looking for, but here it is nonetheless.
1/ Assuming you want to capture the list in one group:
(\[(?:complexRegex(?: , |, ?|\]))+)
Demo: http://regex101.com/r/pW2oZ1/1
2/ Assuming you want all element of the list matched separately, this is a much more complex thing (at least for my knowledge...). Here is a working (complex) solution:
(?:\[|(?!\[)\G(?: , |, ?))(complexRegex)(?=(?:(?: , |, ?)complexRegex)*\])
Demo: http://regex101.com/r/iB3jD1/2
I don't have the time to write an explanation right now if it's needed. Ask for it in the comments if you want one, I'll write it later today. Sorry...

Replacing part of delimited string with R's regex

I have the following list of strings:
name <- c("hsa-miR-555p","hsa-miR-519b-3p","hsa-let-7a")
What I want to do is for each of the above strings
replace the text after second delimiter (-) with "zzz".
Yielding:
hsa-miR-zzz
hsa-miR-zzz
hsa-let-zzz
What's the way to do it?
Might as well use something like:
gsub("^((?:[^-]*-){2}).*", "\\1zzz", name)
(?:[^-]*-) is a non-capturing group which consists of several non-dash characters followed by a single dash character and the {2} just after means this group occurs twice only. Then, match everything else for the replacement. Note I used an anchor just in case to avoid unintended substitutions.
Perhaps something like this:
> gsub("([A-Za-z]+-)([A-Za-z]+-)(.*)", "\\1\\2zzz", name)
[1] "hsa-miR-zzz" "hsa-miR-zzz" "hsa-let-zzz"
There are actually several ways to approach this, depending on how "regular" your expressions actually are. For example, do they all start with "hsa-"? What are the options for the "middle" group? Might there be more than three dashes?

Can I shorten this regular expression?

I have the need to check whether strings adhere to a particular ID format.
The format of the ID is as follows:
aBcDe-fghIj-KLmno-pQRsT-uVWxy
A sequence of five blocks of five letters upper case or lower case, separated by one dash.
I have the following regular expression that works:
string idFormat = "[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}";
Note that there is no trailing dash, but the all of the blocks within the ID follow the same format. Therefore, I would like to be able to represent this sequence of four blocks with a trailing dash inside the regular expression and avoid the duplication.
I tried the following, but it doesn't work:
string idFormat = "[[a-zA-Z]{5}[-]{1}]{4}[a-zA-Z]{5}";
How do I shorten this regular expression and get rid of the duplicated parts?
What is the best way to ensure that each block does also not contain any numbers?
Edit:
Thanks for the replies, I now understand the grouping in regular expressions.
I'm running a few tests against the regular expression, the following are relevant:
Test 1: aBcDe-fghIj-KLmno-pQRsT-uVWxy
Test 2: abcde-fghij-klmno-pqrst-uvwxy
With the following regular expression, both tests pass:
^([a-zA-Z]{5}-){4}[a-zA-Z]{5}$
With the next regular expression, test 1 fails:
^([a-z]{5}-){4}[a-z]{5}$
Several answers have said that it is OK to omit the A-Z when using a-z, but in this case it doesn't seem to be working.
You can try:
([a-z]{5}-){4}[a-z]{5}
and make it case insensitive.
If you can set regex options to be case insensitive, you could replace all [a-zA-Z] with just plain [a-z]. Furthermore, [-]{1} can be written as -.
Your grouping should be done with (, ), not with [, ] (although you're correctly using the latter in specifying character sets.
Depending on context, you probably want to throw in ^...$ which matches start and end of string, respectively, to verify that the entire string is a match (i.e. that there are no extra characters).
In javascript, something like this:
/^([a-z]{5}-){4}[a-z]{5}$/i
This works for me, though you might want to check it:
[a-zA-Z]{5}(-[a-zA-Z]{5}){4}
(One group of five letters, followed by [dash+group of five letters] four times)
([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}
Try
string idFormat = "([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}";
I.e. you basically replace your brackets by parentheses. Brackets are not meant for grouping but for defining a class of accepted characters.
However, be aware that with shortened versions, you can use the expression for validating the string, but not for analyzing it. If you want to process the 5 groups of characters, you will want to put them in 5 groups:
string idFormat =
"([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})";
so you can address each group and process it.