Parse value using Regex [duplicate] - regex

This question already has answers here:
How to capture an arbitrary number of groups in JavaScript Regexp?
(5 answers)
Closed 7 years ago.
I have a long strings taken from a VCF file such as (These are truncated for example purpose):
chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;
chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;
I want to to write a single regex to return all values of FAO on a given line.
The valid format for FAO is: FAO=SomeNumber; or FAO=SomeNumber, SomeNumber, SomeNumber, etc...;
Is there a way to write a REGEX capture group that takes into account both a single value and an infinite number of values separated by a comma until you see a ';'?
I've tried
FAO=((([0-9]+);)|(([0-9]+),([0-9])+))
But it only takes into account up to 2 numbers and I need matcher group 1 to be the first value, matcher group 2 to be the second etc...

You can use a negated character class: [^;]+ This says to match any characters that are not a semicolon. Since it's a greedy match it will continue until it sees the first semicolon.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/(FAO=[^;]+)/)[1]);
});
From there you can edit the group match to only grab the values /FAO=([^;]+)/ and then you can split that value on the comma delimiter.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/FAO=([^;]+)/)[1].split(','));
});
As stated in this SO answer it's not possible in most languages to have an arbitrary number of group matches.

you could use a regex like this
FAO=([0-9]+(,[0-9]+)*);
the outer parentheses allow you to extract the value or values with the first matching group.
EDIT
considering that you want to capture the individual values with different matching groups this approach won't work (capturing groups inside * will only capture the last match). see the accepted answer to this question for a solution.
EDIT 2
see this demo based on that answer for an example of a pcre regex that will match each number with the same capturing group.
(?:FAO=|\G,)\K(\d+)
note that not all regex flavours support \G and \K. \G matches the end of the previous match (or the start of the string), and \K resets the start of current match.

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Access multiple captures of one capture group in substition string

Suppose I have the regex (\d)+.
In .Net I can access all captures of this capture group using the match.Groups[1].Captures.
Can I also access these captures in a substition string?
So for example for the input string 523, I need to use 5, 2 and 3 in my substition string (and not just 3, which is $1).
If you intend to capture the digits each in its separate capturing group then you need to actually make a separate capturing groups for every digits like this:
(\d)(\d)(\d)
NOTE: This does not scale very well and you could not match numbers of any other length than 3 digits. In other words, no math on either 23 or 345667!
An good page with a long and detailed explanation why this cant be done as (\d)+ can be found here:
https://www.regular-expressions.info/captureall.html
So if this is indeed what you want then you need to craft your own loop that searches the string for every digit separately.
If you on the other hand need to capture the number and not the individual digits then you simply put the +sign in the wrong position. I think you should write:
(\d+)
I think the OP wants to get every single digit match separately.
Perhaps this will help you then:
<!-- language: lang-vb -->
' Create a list to put the resulting matches in
Dim ResultList As StringCollection = New StringCollection()
Dim RegexObj As New Regex("(\d)")
Dim MatchResult As Match = RegexObj.Match(strName)
While MatchResult.Success
ResultList.Add(MatchResult.Groups(1).Value)
' Console.WriteLine(MatchResult.Groups(1).Value)
MatchResult = MatchResult.NextMatch()
End While

1 to 5 of the same groups in REGEX

For a string such as:
abzyxcabkmqfcmkcde
Notice that there are string patterns between ab and c in bold. To capture the first string pattern:
ab([a-z]{3,5})c
Is it possible to match both of the groups from the sample string? Actually, there should be 1 to 5 groups.
Note: python style regex.
You can verify that a given string conforms to the 1-5 repetitions of ab([a-z]{3,5})c using this regex
(?:ab([a-z]{3,5})c){1,5}
or this one if there are characters expected between the groups
(?:ab([a-z]{3,5})c.*?){1,5}
You will only be able to extract the last matching group from that string however, not any of the previous ones. to get a previous one you need to use hsz's approach
Just match all results - i.e. with g flag:
/ab([a-z]{3,5})c/g
or some method like in Python:
re.findall(pattern, string, flags=0)

Extract numbers between brackets within a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Extract info inside all parenthesis in R (regex)
I inported data from excel and one cell consists of these long strings that contain number and letters, is there a way to extract only the numbers from that string and store it in a new variable? Unfortunately, some of the entries have two sets of brackets and I would only want the second one? Could I use grep for that?
the strings look more or less like this, the length of the strings vary however:
"East Kootenay C (5901035) RDA 01011"
or like this:
"Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020"
All I want from this is 5901035 and 5933039
Any hints and help would be greatly appreciated.
There are many possible regular expressions to do this. Here is one:
x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020")
> gsub('.+\\(([0-9]+)\\).+?$', '\\1', x)
[1] "5901035" "5933039"
Lets break down the syntax of that first expression '.+\\(([0-9]+)\\).+'
.+ one or more of anything
\\( parentheses are special characters in a regular expression, so if I want to represent the actual thing ( I need to escape it with a \. I have to escape it again for R (hence the two \s).
([0-9]+) I mentioned special characters, here I use two. the first is the parentheses which indicate a group I want to keep. The second [ and ] surround groups of things. see ?regex for more information.
?$ The final piece assures that I am grabbing the LAST set of numbers in parens as noted in the comments.
I could also use * instead of . which would mean 0 or more rather than one or more i in case your paren string comes at the beginning or end of a string.
The second piece of the gsub is what I am replacing the first portion with. I used: \\1. This says use group 1 (the stuff inside the ( ) from above. I need to escape it twice again, once for the regex and once for R.
Clear as mud to be sure! Enjoy your data munging project!
Here is a gsubfn solution:
library(gsubfn)
strapplyc(x, "[(](\\d+)[)]", simplify = TRUE)
[(] matches an open paren, (\\d+) matches a string of digits creating a back-reference owing to the parens around it and finally [)] matches a close paren. The back-reference is returned.