how to match sub conditions occurring once using regex? - regex

Hi every body i'm trying to match the following condition using regex:
string start with P
followed by one of the following:-
-------- & OR number from 0-9 followed by D occurs once
--------& OR number from 0-9 followed by M occurs once
--------& OR number from 0-9 followed by Y occurs once
--------& OR T followed by one of the following:-
------------------------------------------ & OR number from 0-9 followed by H occurs once
------------------------------------------ & OR number from 0-9 followed by M occurs once
------------------------------------------ & OR number from 0-9 followed by S occurs once
i try to use the following with success:-
P?(([0-9]{1,}D)|([0-9]{1,}M)|([0-9]{1}Y)|(T?(([0-9]{1,}H)|([0-9]{1,}M)|([0-9]{1,}S))))
but it match any given number of any condition i addressed before
any idea how i can achieve this regex condition ?
Edit
lastly i found what i'm looking for
/^P(?=\w*\d)(?:\d+Y|Y)?(?:\d+M|M)?(?:\d+W|W)?(?:\d+D|D)?(?:T(?:\d+H|H)?(?:\d+M|M)?(?:\d+(?:\­.\d{1,2})?S|S)?)?$/

I'm not sure I fully understand your spec, but is this getting close?
P([0-9]+D)?([0-9]+M)?([0-9]+Y)?(T([0-9]+H)?([0-9]+M)?([0-9]+S)?)?
Everything is optional except the leading P, order matters, each section can occur only
once, and the number of digits used in each case is one or more. T is required if anything
after it is included.
The RE above matches "P" and "PT", while the spec presumably requires at least one of the optional components to follow P and T. Using lookahead with grep -P (for Perl regular expressions), we can require P to be followed by a digit.
$ RE='P(?=[0-9])([0-9]+D)?([0-9]+M)?([0-9]+Y)?(T(?=[0-9])([0-9]+H)?([0-9]+M)?([0-9]+S)?)?'
$ for s in P1DT5S P1DT5 P1DT P1D P1 P
do
printf "%-10s %s\n" $s $(echo $s | grep -P -o $RE)
done
P1DT5S P1DT5S
P1DT5 P1DT
P1DT P1D
P1D P1D
P1 P
P
$

Related

Regex match until third occurrence of a char is found, counting occurrence of said char starting from the end of string

Let's dive in : Input :
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
Desired output :
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
Starting from the beginning of my string, I need to match until the third occurrence of " _ " (underscore) is found, but I need to count " _ " (underscore) occurrence starting from end of string.
Any tips is appreciated,
Best regards
I believe this regex should do the trick!
^.*?(?=_[^_]*_[^_]*_[^_]*$)
Online Demo
Explanation:
^ the start of the line
.*? matches as many characters as possible
(?=...) asserts that its contents follow our match
_[^_]*_[^_]*_[^_]* Looks for exactly three underscores after our match.
$ the end of the line
You should think beyond regex to solve this problem. For example, if you are using Python just use rsplit with a limit of 3 and get the first resulting string:
>>> data = [
'p9_rec_tonly_.cr_called.seg',
'p9_tonly_.cr_called.seg',
'p10_nor_nor_.cr_called.seg',
'p10_rec_tn_.cr_called.seg',
'p10_tn_.cr_called.seg',
'p26_rec_nor_nor_.cr_called.seg',
'p26_rec_tn_.cr_called.seg',
'p26_tn_.cr_called.seg',
]
>>> for d in data:
print(d.rsplit('_', 3)[0])
p9_rec
p9
p10_nor
p10_rec
p10
p26_rec_nor
p26_rec
p26
bash you say? Well it's not a regular expression but you can do pattern substitutions (or stripping with bash):
while read var ; do echo ${var%_*_*_*} ; done <<EOT
p9_rec_tonly_.cr_called.seg
p9_tonly_.cr_called.seg
p10_nor_nor_.cr_called.seg
p10_rec_tn_.cr_called.seg
p10_tn_.cr_called.seg
p26_rec_nor_nor_.cr_called.seg
p26_rec_tn_.cr_called.seg
p26_tn_.cr_called.seg
EOT
${var%_*_*_*} expands variable var stripping shorted suffix match for _*_*_*.
Otherwise to perform regex operations in shell, you could normally ask a utility like sed for help and feed your lines through for instance this:
sed -e 's#_[^_]*_[^_]*_[^_]*$##'
or for short:
sed -e 's#\(_[^_]*\)\{3\}$##'
Find three groups of _ and zero or more characters of not _ at the end of line $ replacing them with nothing ('').

Regular Expressions with multiple dots in Linux bash shell give strange results

I tried to match a substring including a lot of dots, and it failed in Debian Linux shell. I made a simple script to look how dots are processed and found it completely out of rules. I retried it Bash, perl, Ubunta shell it all the same. The script and output are below.
#!/bin/sh
my_regex=u2734523abcABCB.C123.ABC.abc.1..2.34.2
Numbering=123456789_123456789_123456789_123456789
echo "$my_regex"
echo "$Numbering"
echo `expr index "$my_regex" '(ABC)'`
echo `expr index "$my_regex" '(ABC\.)'`
echo `expr index "$my_regex" '(\.\.)'`
echo `expr index "$my_regex" '(.)'`
echo `expr index "$my_regex" '(\.1)'`
Output:
u2734523abcABCB.C123.ABC.abc.1..2.34.2
123456789_123456789_123456789_123456789
12
12
16
16
16
The first regex should match ABC and return number-position of first character. It works.
The second one should find ABC followed by dot, it looks like it ignores dot.
The third one should find two dots but it finds first occurrence of one dot. Ignores again?
The fourth should find first any character, but it still finds the dot on position 16.
The fifth should find a dot followed by 1, it still finds the first occurrence of dot.
It seems like neither \ nor [ ] (I tried it too), nor the dot itself works as in common regular expression.
Why?
expr index has nothing to do with regular expressions.
expr index STRING CHARS outputs the index of the first occurrance of any of the CHARS in STRING. So your first search for '(ABC)' finds the first left parenthesis, A, B, C, or right parenthesis in your string. The first one is the A at position 12.
'(ABC\.)' does the same thing, except it's now also looking for a backslash or period. But the A is still the first match at position 12.
'(\.\.)' looks only for a parenthesis, backslash, or period. The first match is the period at position 16.
Likewise, all your other searches find the period at position 16, because none of the other characters you're listing come before that.
(On a side note, it's silly to capture the output with backticks only to immediately echo it. You'd get the same result by omitting the echo and backticks.)
You are incorrectly using index function of expr. As per man expr:
index STRING CHARS - index in STRING where any CHARS is found, or 0
So 2 things to note here:
index doesn't do any regex matching
index will find position of any of the char is found in string
If you want regex matching then use:
STRING : REGEXP
like this:
my_regex='u2734523abcABCB.C123.ABC.abc.1..2.34.2'
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC'
24
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC\.'
25
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.\.'
32
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*.'
38
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.1'
30
The numbers after each expr command is actually the length of the match.
There is no need to use echo here as expr anyway writes output on stdout.
You might want to take a look at BASH built-in =~ operator for regex matching.

How to select a part of a string OR another with REGEXP in MATLAB

I've been trying to solve this problem in the last few days with no success. I have the following string:
comment = '#disabled, Fc = 200Hz'
What I need to do is: if there's the string 'disabled' it needs to be matched. Otherwise I need to match the number that comes before 'Hz'.
The closest solution I found so far was:
regexpi(comment,'\<#disabled\>|\w*Hz\>','match') ;
It will match the word '#disabled' or anything that comes before 'Hz'. Problem is that when it first finds '#disabled#' it copies also the result '200Hz'.
So I'm getting:
ans = '#disabled' '200Hz'
Summing up, I need to select only the 'disabled' part of a string if there is one, otherwise I need to get the number before 'Hz'.
Can someone give me a hand ?
Suppose your input is:
comment = {'#disabled, Fc = 200Hz';
'Fc = 300Hz'}
The regular expression (match disabled if follows # otherwise match digits if they are followed by Hz):
regexp(comment, '(?<=^#)disabled|\d+(?=Hz)','match','once')
Explaining it:
^# - match # at the beginning of the line
(?<=expr)disabled - match disabled if follows expr
expr1 | expr2 - otherwise match expr2
\d+ - match 1 or more digits, equivalently [0-9]+
expr(?=Hz) - match expr only if followed by 'Hz'
Diagram:
Debuggex Demo

procmail find first 3 upper case chars in subject

I am new to procmail and struggling to understand the syntax.
What I want to do is to check the subject line to see if it begins with 3 upper case chars followed by a colon, and if it does, remove the colon from the end and perform and action i.e:
Subject: ABC: Other parts of the subject
:0
* $ ^Subject:/^[A-Z]{3}:$/
| /usr/bin/zarafa-dagent -C -P 'Support\\$1' vmail
Firstly I'm not sure if my regex is correct, and secondly, despite a lot of googling I can't figure out how to save my search into a variable to use elsewhere, I tried $1 for the first returned variable but that does not appear to work.
Any help would be much appreciated.
You can post-process the value of $MATCH to trim the colon.
:0 D
* ^Subject:[^ ]*\/[A-Z][A-Z][A-Z]:
{
:0
* MATCH ?? ^^\/[A-Z][A-Z][A-Z]
| /usr/bin/zarafa-dagent -C -P "Support\\$MATCH" vmail
}
The first condition captures the three uppercase characters and the colon into MATCH. The second matches this value against three uppercase characters, and captures just that part into the new value for MATCH.
As usual, the whitespace inside the brackets after Subject: consists of a space and a tab.
OK, solved this, procmail has it's own version of regex:
:0 D
* ^Subject:.*\/([A-Z]+[A-Z]+[A-Z]):
| /usr/bin/zarafa-dagent -C -P "Support\\$MATCH" vmail
EXITCODE=$?
It does not support the iterator brackets [A-Z]{3} and so you have to repeat the expression.
Also, it is case-insensitive, so you need to add the "D" flag.
Problem is I seem to be unable to remove the colon : from the end.

How to match the whole expression only, even when there are sub parts that match?

Just trying to write input validation pattern that would allow entry of wild characters. Input field is 9 char max and should follow these rules:
* + 1- 8 charcters
1- 8 chars + *
* + 1-7 chars + *
I've written this regex using the regex documentation and testing it on one of the regex testers.
\*{1}[0-9]{1,7}\*{1}|[0-9]{1,8}\*{1}|\*{1}[0-9]{1,8}|[0-9]{9}
It matches all these correctly
123456789
*1*
*12*
*123*
*1234*
*12345*
*123456*
*1234567*
1234567*
123456*
12345*
1234*
123*
12*
1*
*1
*12
*123
*1234
*12345
*123456
*1234567
*12345678
But it also matches when I don't want it. For example it finds 2 matches in this *123456789* First match is *12345678 and second one is 9*
I don't want in this case to find any matches. Either the whole string matches one of the patterns or not. How does one do that?
Use anchors that make sure the regex always matches the entire string:
^(\*[0-9]{1,7}\*|[0-9]{1,8}\*|\*[0-9]{1,8}|[0-9]{9})$
Note the parentheses to make sure that the alternation is contained within the group:
^
(
\*[0-9]{1,7}\*
|
[0-9]{1,8}\*
|
\*[0-9]{1,8}
|
[0-9]{9}
)
$
Also, {1} is always superfluous - one match per token is the default.
You could use start and end string anchors:
http://www.regular-expressions.info/anchors.html
So, your regex would be something like this (note first and last symbol):
^(\*{1}[0-9]{1,7}*{1}|[0-9]{1,8}*{1}|*{1}[0-9]{1,8}|[0-9]{9})$