Regular expression with multiple endings - regex

I have a pandas DataFrame like this:
idx name
1 "NM_014855.2(AP5Z1):c.80_83delGGATinsTGCTGTAAACTGTAACTGTAAA (p.Arg27_Ala362delinsLeuLeuTer)"
2 "NM_014630.2(ZNF592):c.3136G>A (p.Gly1046Arg)"
3 "NM_000410.3(HFE):c.892+48G>A"
4 "NC_000014.9:g.(31394019_31414809)_(31654321_31655889)del"
I need to extract whatever follows the ':' character, until any of the following:
" ("
"del"
{end of string}
I have tried the following:
df.str.extract(r"\):(.*) \(|\n")
But it doesn't work for all the cases.
How can I properly specify the condition I need?

Use a lazy match *? to minimize how much the .* will capture, then specify the stop conditions you're looking for:
df.str.extract(r":(.*?)(?:\(|del|$)")
Regular expressions normally match the longest possible string, but ? switches it to match the shortest possible string.

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

How to use regexp in MATLAB to strictly match a substring and not a larger string containing that substring?

I would like to find whether a cell contains the substring foo and only this string (nothing before, nothing after) in a series of cells that may contain foobar.
I am currently using regexp in MATLAB and would like to tweak the searched pattern regexp to exclude cells that contain a string that contains the substring I defined.
I know it kind of goes against the very idea of regexp, but I am fairly certain there is a way to do what I want.
As a MWE, here is a snippet of the data I have (in cell format), called potentialfields:
'horaracha'
'sol'
'presmax'
'horapresmax'
'presmin'
'horapresmin'
and the regexp expression that I am currently using:
selected_fields={'sol','presmin'};
diffset=setdiff(potentialfields,selected_fields);
pattern=strjoin(diffset,'|');
idx_to_delete=~cellfun(#isempty,regexp(potentialfields,pattern));
The expected output of idx_to_delete is the following:
1 0 1 1 0 1
At the moment, the output is 1 0 1 1 1 1 because horapresmin contains presmin.
Thank you very much in advance.
regexp is overkill here, ismember is an in-built function specifically designed for finding exact strings in a cell
idx_to_delete = ismember( potentialfields, selected_fields );
If you're really set on regexp you can use the start anchor (^) and end anchor ($) like so:
pattern = ['^(', strjoin( selected_fields, '|' ), ')$'];
idx_to_delete2 = ~cellfun( #isempty, regexp( potentialfields, pattern ) );
You can build the word boundary based regex dynamically:
pattern = strcat('\<(', strjoin(diffset,'|'), ')\>')
idx_to_delete=~cellfun(#isempty,regexp(potentialfields,pattern))
With strjoin(diffset,'|'), you get the alternation pattern created, and the \<(...)\> is a grouping construct wrapped with word boundaries to only match whole words where word boundaries apply to every alternative start and end char.

Regex to parse json formatted string

I have a JSON formatted string which I am trying to parse using a regex. I would like to parse each key value pair for later use in grafana (the regex itself is used in logstash).
The test string looks like this:
{
"version":"1.1",
"nameId":"test",
"productId":"B2",
"total customers":99,
"full_description":"asdf"
}
I am using the following regex expression, but it seems that if the value is a number (without " "), it groups the comma inthe the value. For example, the group value for the key "total customers" is "99," and not just "99".
(?i)["'](?<key>[^"]*)["'](?:\:)["'\{\[]?([\r\n]?\t+\")?(?<value>\w(?:\s[a-zA-Z0-9_=]\.?)+\w+#(?:(?:\w[a-z\d\-]+\w)\.)+[a-z]{2,10}|true|false|[\w+-.$\s=-]*)(",[\r\n])?(?2)?(?J)(?<value>(?&value))?
What do I have to add to the regex expression in order to parse JSON-values which are numbers?
This part in the pattern [\w+-.$\s=-] has a range +-. instead of matching either a + - or .
The range matches ASCII chars decimals number 43-46 where number 44 matches the unwanted ,
As the character class already matches - at the end, so you can omit the middle -.
The pattern contains some superfluous escapes and capture groups and seems a bit complicated. The updated pattern with just 2 capture groups could look like;
(?i)["'](?<key>[^"]*)["']:["'{\[]?(?:[\r\n]?\t+")?(?<value>\w(?:\s[a-zA-Z\d_=]\.?)+\w+#(?:\w[a-z\d-]+\w\.)+[a-z]{2,10}|true|false|[\w+.$\s=-]*)(?:",[\r\n])?(?2)?(?J)(?<value>(?&value))?
Regex demo

Value matching in regex and Openrefine

I am trying to use the value.match command in OpenRefine 2.6 for splitting two columns based on a 4 number date.
A sample of the text is:
"first sentence, second sentence, third sentences, 2009"
What I do is going to "Add column based on this column" and insert
value.match(\d{4})
but I get the error
Parsing error at offset 12: Missing number, string, identifier, regex,
or parenthesized expression
any idea of the possible solution?
You need to fix 3 things to get this working:
1) As Wiktor says you need to start & end the regular expression with a forward slash /
2) The 'match' function requires you to match the whole string in the cell, not just the fragment you need - so your regular expression needs to match the whole string
3) To extract part of a string with 'match' you need to have capture groups in your regular expression- that is use ( ) around the bit of the regular expression you want to extract. The captured values will be put in an array and you will need to get the string out of tge array to store it in a cell
So you'll need something like:
value.match(/.*(\d{4})/)[0]
To get the four digit year from the end of the string

Regular expression which will match if there is no repetition

I would like to construct regular expression which will match password if there is no character repeating 4 or more times.
I have come up with regex which will match if there is character or group of characters repeating 4 times:
(?:([a-zA-Z\d]{1,})\1\1\1)
Is there any way how to match only if the string doesn't contain the repetitions? I tried the approach suggested in Regular expression to match a line that doesn't contain a word? as I thought some combination of positive/negative lookaheads will make it. But I haven't found working example yet.
By repetition I mean any number of characters anywhere in the string
Example - should not match
aaaaxbc
abababab
x14aaaabc
Example - should match
abcaxaxaz
(a is here 4 times but it is not problem, I want to filter out repeating patterns)
That link was very helpful, and I was able to use it to create the regular expression from your original expression.
^(?:(?!(?<char>[a-zA-Z\d]+)\k<char>{3,}).)+$
or
^(?:(?!([a-zA-Z\d]+)\1{3,}).)+$
Nota Bene: this solution doesn't answer exaactly to the question, it does too much relatively to the expressed need.
-----
In Python language:
import re
pat = '(?:(.)(?!.*?\\1.*?\\1.*?\\1.*\Z))+\Z'
regx = re.compile(pat)
for s in (':1*2-3=4#',
':1*1-3=4#5',
':1*1-1=4#5!6',
':1*1-1=1#',
':1*2-a=14#a~7&1{g}1'):
m = regx.match(s)
if m:
print m.group()
else:
print '--No match--'
result
:1*2-3=4#
:1*1-3=4#5
:1*1-1=4#5!6
--No match--
--No match--
It will give a lot of work to the regex motor because the principle of the pattern is that for each character of the string it runs through, it must verify that the current character isn't found three other times in the remaining sequence of characters that follow the current character.
But it works, apparently.