pcre regex match nth occurence [duplicate] - regex

This question already has answers here:
regex to match substring after nth occurence of pipe character
(3 answers)
Closed 4 years ago.
This is not a duplicate. I have checked before asking.
I have this string separated with | and I want to match the nth element.
aaaaaaaaa aaa|bb bbbbb|cccc ccccccc|ddd ddddddd|aaa aaa aaaaa|zzz zzz zzzzzzz
The closer I got is using this pattern but it buggy:
([^\|]*\|){2}[^\|]*
https://regex101.com/r/EYZbK5/1
This is plain pcre. In this context, javascript such .split() cannot be used.
Say I want to get the 3rd element cccc ccccccc what regex should I use?

You could use an anchor to assert the start of the line and then repeat not matching a | followed by a | 2 times. Then capture the third part in a capturing group which will contain cccc ccccccc
^(?:[^|]*\|){2}([^|]*)
Regex demo
Explanation
^ Assert the start of the line
(?: Start non capturing group
[^|]*\| Match not a | using a negated character class zero or more times followed by a |.
){2} close non capturing group and repeat that 2 times
([^|]*) Capture in a group matching not a | zero or more times

You may use
^(?:[^|]*\|){2}\K[^|]*
See the regex demo.
Details
^ - start of string
(?:[^|]*\|){2} - a non-capturing group matching two consecutive occurrences of
[^|]* - a negated character class matching 0+ chars other than |
\| - a | char
\K - match reset operator that discards the text matched so far
[^|]* - 0+ chars other than |.
To avoid empty string matches, you may replace the last [^|]* with [^|]+.

you may try this and take group2
(\|?(.*?)(?:\|)){3}
demo and explanation

Related

regex getting words between '|'

I am trying to get the full words between two '|' characters
example string: {{person label|Jens Addle|border=red}}
here I would like to get the string: Jens Addle
I have attempted with the following:
(([A-Z]\w+))
However, this separates the result into two words and I would like to get it as a single entity.
This should put the value into $1.
Key is escaping the pipes, capturing what is in between and being non-greedy about it.
\|(.+?)\|
This should work in your case: /\|(.*?)\|/gm, or without the flags \|(.*?)\|.
This regex matches all character between two | characters. (\| - the | character, (.*?) - match everything and capture)
Here is the regex101 page.
You can use
\|\K[^|]*(?=\|)
(?<=\|)[^|]*(?=\|)
See the regex #1 demo and regex #2 demo.
Details:
(?<=\|) - a location that is immediately preceded with a | char
\|\K - matches a | char and then "forgets" it
[^|]* - zero or more chars other than a | char
(?=\|) - a location that is immediately followed with a | char.
Matching 1 ore more words between the pipe chars can be done using a capture group.
Note that [A-Z]\w+ matches at least 2 characters.
\|([A-Z]\w+(?: \w+)*)(?=\|)
\| Match |
( Capture group 1
[A-Z]\w+ Match an uppercase char A-Z and 1+ word characters
(?: \w+)* Optionally repeat matching a space and 1+ word characters
) Close group 1
(?=\|) Positive lookahead, assert | to the right
See a regex demo.
To take the format of the example string into account, you might also make the pattern a bit more specific:
{{[^|]*\|([A-Z]\w+(?: \w+)*)\|[^|]*}}
See another regex demo.

regular expression with If condition question

I have the following regular expressions that extract everything after first two alphabets
^[A-Za-z]{2})(\w+)($) $2
now I want to the extract nothing if the data doesn't start with alphabets.
Example:
AA123 -> 123
123 -> ""
Can this be accomplished by regex?
Introduce an alternative to match any one or more chars from start to end of string if your regex does not match:
^(?:([A-Za-z]{2})(\w+)|.+)$
See the regex demo. Details:
^ - start of string
(?: - start of a container non-capturing group:
([A-Za-z]{2})(\w+) - Group 1: two ASCII letters, Group 2: one or more word chars
| - or
.+ - one or more chars other than line break chars, as many as possible (use [\w\W]+ to match any chars including line break chars)
) - end of a container non-capturing group
$ - end of string.
Your pattern already captures 1 or more word characters after matching 2 uppercase chars. The $ does not have to be in a group, and this $2 should not be in the pattern.
^[A-Za-z]{2})(\w+)$
See a regex demo.
Another option could be a pattern with a conditional, capturing data in group 2 only if group 1 exist.
^([A-Z]{2})?(?(1)(\w+)|.+)$
^ Start of string
([A-Z]{2})? Capture 2 uppercase chars in optional group 1
(? Conditional
(1)(\w+) If we have group 1, capture 1+ word chars in group 2
| Or
.+ Match the whole line with at least 1 char to not match an empty string
) Close conditional
$ End of string
Regex demo
For a match only, you could use other variations Using \K like ^[A-Za-z]{2}\K\w+$ or with a lookbehind assertion (?<=^[A-Za-z]{2})\w+$

how to match a list of fixed length words separated by space or comma?

The words' length could be 2 or 6-10 and could be separated by space or comma. The word only include alphabet, not case sensitive.
Here is the groups of words that should be matched:
RE,re,rereRE
Not matching groups:
RE,rere,rel
RE,RERE
Here is the pattern that I have tried
((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|\s+)?)
But unfortunately this pattern can match string like this: RE,RERE
Look like the word boundary has not been set.
You could match chars a-z either 2 or 6 - 10 times using an alternation
Then repeat that pattern 0+ times preceded by a comma or a space [ ,].
^(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*$
Explanation
^ Start of string
(?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match chars a-z 6 -10 or 2 times
(?: Non capturing group
[, ](?:[A-Za-z]{6,10}|[A-Za-z]{2}) Match comma or space and repeat previous pattern
)* Close non capturing group and repeat 0+ times
$ End of string
Regex demo
If lookarounds are supported, you might also assert what is directly on the left and on the right is not a non whitespace character \S.
(?<!\S)(?:[A-Za-z]{6,10}|[A-Za-z]{2})(?:[ ,](?:[A-Za-z]{6,10}|[A-Za-z]{2}))*(?!\S)
Regex demo
([a-zA-Z]{2}(,|\s)|[a-zA-Z]{6,10}|(,|\s))
This one will get only the words who have 2 letter, or between 6 and 10
\b,?([a-zA-Z]{6,10}|[a-zA-Z]{2}),?\b
You can use this
^(?!.*\b[a-z]{4}\b)(?:(?:[a-z]{2}|[a-z]{6,10})(?:,|[ ]+)?)+$
Regex Demo
This regex will match your first case, but neither of your two other cases:
^((([a-zA-Z]{2})|([a-zA-Z]{6,10}))(,|[ ]+|$))+$
I'm making the assumption here that each line should be a single match.
Here it is in action.

Regex select 3 group

How can I select the third group of numbers using Regex.
With the following string.
21|2|964|Texto 02
I want to select only 964.
I only managed to extract all the digit chunks with \d+ regex.
Thanks.
If you cannot split with | and get Item 3 from the resulting array, you may use
^(?:[^|]*\|){2}\K\d+
See the regex demo.
Alternatively, use
^(?:[^|]*\|){2}(\d+)
and grab Group 1 value. See another regex demo.
Details
^ - start of string
(?:[^|]*\|){2} - 2 sequences of:
[^|]* - any 0+ chars other than |
\| - a literal | symbol
\K - a match reset operator discarding the text matched so far
\d+ - 1 or more digits.

Tokenizing a string with a regular expression

Suppose I have a string like this: abc def ghi jkl (I put a space at the end for the sake of simplicity but it doesn't really matter for me) and I want to capture its "chunks" as follows:
abc
def
ghi
jkl
if and only if there are 1-4 "chunks" in the string. I have already tried the following regex:
^([^ ]+ ){1,4}$
at Regex101.com but it only captures the last occurrence. A warning about it is issued:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
How to correct the regular expression to achieve my goal?
Since you have no access to the code, the only solution you might use is a regex based on the \G operator that will only allow consecutive matches and a lookahead anchored at the start that will require 1 to 4 non-whitespace chunks in the string.
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+
See the regex demo
Details:
(?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - a custom boundary that checks if:
^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - the string start position (^) that is followed with 1 to 4 non-whitespace chunks, separated with 1+ whitespaces, and trailing/leading whitespaces are allowed, too
| - or
\G(?!^) - the current position at the end of the previous successful match (\G also matches the start of a string, thus we have to use the negative lookahead to exclude that matching position, since there is a separate check performed)
\s* - zero or more whitespaces
\K - a match reset operator discarding all the text matched so far
\S+ - 1 or more characters other than whitespace
It can be done on linux using tr:
tr -sc 'a-zA-Z' '\n' < text.txt > out_text.txt
where in a text.txt file is your string to be normalized.