Regex that matches every nth occurences of character - regex

I have found solutions for finding nth occurrence but could not find about finding every nth occurrences.
I have string such as "key1~value1~key2~value2~key3~value3~".
What is the regex that will match every second occurrence of the ~?
key1~value1~key2~value2~key3~value3~
I am trying to create a custom Pattern Analizer for Elasticsearch that is the regex should match the token seperators instead of tokens.

You may use
~(?=(?:[^~]*~[^~]*~)*[^~]*$)
The pattern matches:
~ - a tilde that is followed by...
(?=(?:[^~]*~[^~]*~)*[^~]*$) - 0+ non-tildes + ~ x 2 times, 0+ times, and then 0+ non-tildes up to the end of string. So, this check makes sure there is an even number of tildes up to the end of string after matching the first tilde.

You need to ensure that there are not an even number of ~ before:
(?<!^([^~]*~[^~]*~)*[^~]*)~
Try it online!
How it works:
(?<!^([^~]*~[^~]*~)*[^~]*)~ Our regex.
~ Matches a tilde (~).
(?<! ) Assert that before it is not:
^ the beginning
( )* followed by zero or more times:
[^~]*~[^~]*~ two tildes, no matter what comes within
[^~]* followed by non-tildes.

First group of non-overlapping occurrences of ~.*?(~). Try: http://regexr.com/3dc15.

Related

How can I get the first and last part of one wordcombination using regex

How can I get only the middle part of a combined name with PCRE regex?
name: 211103_TV_storyname_TYPE
result: storyname
I have used this single line: .(\d)+.(_TV_) to remove the first part: 211103_TV_
Another idea is to use (_TYPE)$ but the problem is that I don´t have in all variations of names a space to declare a second word to use the ^ for the first word and $ for the second.
The variation of the combined name is fix for _TYPE and the TV.
The numbers are changing according to the date. And the storyname is variable.
Any ideas?
Thanks
With your shown samples, please try following regex, this creates one capturing group which contains matched values in it.
.*?_TV_([^_]*)(?=_TYPE)
OR(adding a small variation of above solution with fourth bird's nice suggestion), following is without lazy match .*? unlike above:
_TV_([^_]*)(?=_TYPE)
Here is the Online demo for above regex
Explanation: Adding detailed explanation for above.
.*?_ ##Using Lazy match to match till 1st occurrence of _ here.
TV_ ##Matching TV_ here.
([^_]*) ##Creating 1st capturing group which has everything before next occurrence of _ here.
(?=_TYPE) ##Making sure previous values are followed by _TYPE here.
You could match as least as possible chars after _TV_ until you match _TYPE
\d_TV_\K.*?(?=_TYPE)
\d_TV_ Match a digit and _TV_
\K Forget what is matched until now
.*? Match as least as possible characters
(?=_TYPE) Assert _TYPE to the right
Regex demo
Another option without a non greedy quantifier, and leaving out the digit at the start:
_TV_\K[^_]*+(?>_(?!TYPE)[^_]*)*(?=_TYPE)
_TV_ Match literally
\K[^_]*+ Forget what is matched until now and optionally match any char except _
(?>_(?!TYPE)[^_]*)* Only allow matching _ when not directly followed by TYPE
(?=_TYPE) Assert _TYPE to the right
Regex demo
Edit
If you want to replace the 2 parts, you can use an alternation and replace with an empty string.
If it should be at the start and the end of the string, you can prepend ^ and append $ to the pattern.
\b\d{6}_TV_|_TYPE\b
\b\d{6}_TV_ A word boundary, match 6 digits and _TV_
| Or
_TYPE\b Match _TYPE followed by a word boundary
Regex demo
Here i put some additional Screenshots to the post. With the Documentation that appears on the help button. And you see the forms and what i see.
Documentation
The regular expressions we use are based on PCRE - Perl Compatible Regular Expressions. Full specification can be found here: http://www.pcere.org and http://perldoc.perl.org/perlre.html
Summary of some useful terms:
Metacharacters
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
Quantifiers
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
Charcter Classes
\w Match a "word" character (alphanumeric plus mao}
\W Match a non-"word" character
\s Match a whitespace character
\S Match a non-whitespace character
\d Match a digit character
\D Match a non-digit character
Capture buffers
The bracketing construct (...) creates capture buffers. To refer to
Within the same pattern, use \1 for the first, \2 for the second, and so on. Outside the match use "$" instead of "". The \ notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.
Referring back to another part of the match is called a backreference.
Examples
Replace story with certain prefix letters M N or E to have the prefix "AA":
`srcPattern "(M|N|E ) ([A-Za-z0-9\s]*)"`
`trgPattern "AA$2" `
`"N StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"E StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
`"M StoryWord1 StoryWord2" -> "AA StoryWord1 StoryWord2"`
"NoMatchWord StoryWord1 StoryWord2" -> "NoMatchWord StoryWord1 StoryWord2" (no match found, name remains the same)

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

Regex match for multiple characters

I want to write a regex pattern to match a string starting with "Z" and not containing the next 2 characters as "IU" followed by any other characters.
I am using this pattern but it is not working Z[^(IU)]+.*$
ZISADR - should match
ZIUSADR - should not match
ZDDDDR - should match
Try this regex:
^Z(?:I[^U]|[^I]).*$
Click for Demo
Explanation:
^ - asserts the start of the line
Z - matches Z
I[^U] - matches I followed by any character that is not a U
| - OR
[^I] - matches any character that is not a I
.* - matches 0+ occurrences of any character that is not a new line
$ - asserts the end of the line
When you want to negate certain characters in a string, you can use character class but when you want to negate more than one character in a particular sequence, you need to use negative look ahead and write your regex like this,
^Z(?!IU).*$
Demo
Also note, your first word ZISADR will match as Z is not followed by IU
Your regex, Z[^(IU)]+.*$ will match the starting with Z and [^(IU)]+ character class will match any character other than ( I U and ) one or more times further followed by .* means it will match any characters zero or more times which is not the behavior you wanted.
Edit: To provide a solution without look ahead
A non-lookahead based solution would be to use this regex,
^Z(?:I[^U]|[^I]U|[^I][^U]).*$
This regex has three main alternations which incorporate all cases needed to cover.
I[^U] - Ensures if second character is I then third shouldn't be U
[^I]U - Ensures if third character is U then second shouldn't be I
[^I][^U] - Ensures that both second and third characters shouldn't be I and U altogether.
Demo non-look ahead based solution

Regex for alphanumberic with / or -

The regex should match alphabets or numbers with / or - in between them but should not start or end with / or -
I tried this using RegExr but does not work
[a-zA-Z0-9]+[/|-]*[a-zA-Z0-9]+$
Your current regex has the following problems :
it matches multiple / and -, but only in one spot (e.g. will match 0123/-/-456 but not 0123/456/789
it also matches |, which you don't need to use in a [character class]
it matches up until the end of the string$, but doesn't match from ^the start of the string (e.g. it would match foo0123/456, although it wouldn't match 0123/456foo)
You can use the following regex that Avinash Raj proposed :
^[a-zA-Z0-9]+(?:[/-][a-zA-Z0-9]+)*$
The first point it fixed by putting both the character classe matching slashes and dashes and the one matching alnum characters inside a (?:non-capturing group) which we can quantify with * to specify it can occur any number of time. This group will match any number of slash or dash followed by alnum characters.
The other two points are straightforward, we remove the useless | and add a ^ at the start of the regex.

How to get the first match in regexp?

I have three strings as list below:
Levofloxacin 500mg/100mL
Levofloxacin 500mg
Procaterol Hydrochloride …………… 25μg
The first line, I want to just get 'mg' without 'mL' in my result.
The second line, I want get 'mg'.
The third line, I want get 'ug'.
I have try regexp pattern like:
(?!(.*[ ]{1}[0-9]+))[a-zA-Zμ]+
However, the first line always returns 'mg' with 'mL'...
How could I just acquire 'mg' with regexp?
Any suggestions will be appreciated.
As mentioned in the comment section, try this regex:
^\D*[\d.]+\K[a-zμ]+
Click for Demo
Explanation:
^ - asserts the start of the string
\D* - matches 0+ occurrences of any character that is not a digit
[\d.]+ - matches 1+ occurrences of any character that is a digit
\K - removes what has been matched so far
[a-zμ]+ - this is what you want. This will contain the units like mg, ml appearing after the first number. If there are any other special characters like μ, you can add them too in this character list