Parse formula name and arguments with regex [duplicate] - regex

This question already has answers here:
How to get function parameter names/values dynamically?
(34 answers)
Closed 6 years ago.
The objective of this Regex (\w*)\s*\([(\w*),]*\) is to get a function name and its arguments.
For example, given f1 (11,22,33)
the Regex should capture four elements:
f1
11
22
33
What's wrong with this regex?

You can do it with split Here is an example in javascript
var ar = str.match(/\((.*?)\)/);
if (ar) {
var result = ar[0].split(",");
}
Reference: https://stackoverflow.com/a/13953005/1827594

Some things are hard for regexes :-)
As the commenters above are saying, '*' can be too lax. It means zero or more. So foo(,,) also matches. Not so good.
(\w+)\s*\((\w+)(?:,\s*(\w+)\s*)*\)
That is closer to what you want I think. Let's break that down.
\w+ <-- The function name, has to have at least one character
\s* <-- zero or more whitespace
\( <-- parens to start the function call
(\w+) <-- at least one parameter
(?:) <-- this means not to save the matches
,\s* <-- a comma with optional space
(\w+) <-- another parameter
\s* <-- followed by optional space
This is the result from Python:
>>> m = re.match(r'(\w+)\s*\((\w+)(?:,\s*(\w+)\s*)*\)', "foo(a,b,c)")
>>> m.groups()
('foo', 'a', 'c')
But, what about something like this:
foo(a,b,c
d,e,f)
?? Yeah, it gets hard fast with regexes and you move on to richer parsing tools.

Related

Extract all chars between parenthesis [duplicate]

This question already has answers here:
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 2 years ago.
I used
let regExp = /\(([^)]+)\)/;
to extract
(test(()))
from
aaaaa (test(())) bbbb
but I get only this
(test(()
How can I fix my regex ?
Don't use a negative character set, since parentheses (both ( and )) may appear inside the match you want. Greedily repeat instead, so that you match as much as possible, until the engine backtracks and finds the first ) from the right:
console.log(
'aaaaa (test(())) bbbb'
.match(/\(.*\)/)[0]
);
Keep in mind that this (and JS regex solutions in general) cannot guarantee balanced parentheses, at least not without additional post-processing/validation.

Python regex to parse '#####' text in description field [duplicate]

This question already has answers here:
regex to extract mentions in Twitter
(2 answers)
Extracting #mentions from tweets using findall python (Giving incorrect results)
(3 answers)
Closed 3 years ago.
Here's the line I'm trying to parse:
#abc def#gmail.com #ghi j#klm #nop.qrs #tuv
And here's the regex I've gotten so far:
#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
My goal is to get ['#abc', '#ghi', '#tuv'], but no matter what I do, I can't get 'j#klm' to not match. Any help is much appreciated.
Try using re.findall with the following regex pattern:
(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)
inp = "#abc def#gmail.com #ghi j#klm #nop.qrs #tuv"
matches = re.findall(r'(?:(?<=^)|(?<=\s))#[A-Za-z]+(?=\s|$)', inp)
print(matches)
This prints:
['#abc', '#ghi', '#tuv']
The regex calls for an explanation. The leading lookbehind (?:(?<=^)|(?<=\s)) asserts that what precedes the # symbol is either a space or the start of the string. We can't use a word boundary here because # is not a word character. We use a similar lookahead (?=\s|$) at the end of the pattern to rule out matching things like #nop.qrs. Again, a word boundary alone would not be sufficient.
just add the line initiation match at the beginning:
^#[A-Za-z]+[^0-9. ]+\b | #[A-Za-z]+[^0-9. ]
it shoud work!

RegEx for matching characters unless they are contained in certain string [duplicate]

This question already has answers here:
Regex Pattern to Match, Excluding when... / Except between
(6 answers)
Closed 7 years ago.
Let's say I wanna match the letters E, Q and W. However, I don't want them matched if they're found in a certain string of characters, for instance, HELLO.
LKNSDG8E94GO98SGIOWNGOUH09PIHELLOBN0D98HREINBMUE
^ ^ ^ ^ ^
yes yes NO yes yes
There's a nifty regex trick you can use for this. Here's some code in JavaScript, but it can be adapted to any language:
var str = 'LKNSDG8E94GO98SGIOWNGOUH09PIHELLOBN0D98HREINBMUE',
rgx = /HELLO|([EQW])/g,
match;
while ((match = rgx.exec(str)) != null) {
if (match[1]) output(match[1] + '\n');
}
function output(x) { document.getElementById('out').textContent += x; }
<pre id='out'></pre>
Basically, you match on HELLO|([EQW]). Since regex is inherently greedly, if it comes across a HELLO, it'll immediately match that, thereby skipping the E inside of it.
Then you can just check the capture group. If there's something in that capture group, we know it's something we want. Otherwise, it must have been part of the HELLO, so we ignore it.

Parse value using Regex [duplicate]

This question already has answers here:
How to capture an arbitrary number of groups in JavaScript Regexp?
(5 answers)
Closed 7 years ago.
I have a long strings taken from a VCF file such as (These are truncated for example purpose):
chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;
chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;
I want to to write a single regex to return all values of FAO on a given line.
The valid format for FAO is: FAO=SomeNumber; or FAO=SomeNumber, SomeNumber, SomeNumber, etc...;
Is there a way to write a REGEX capture group that takes into account both a single value and an infinite number of values separated by a comma until you see a ';'?
I've tried
FAO=((([0-9]+);)|(([0-9]+),([0-9])+))
But it only takes into account up to 2 numbers and I need matcher group 1 to be the first value, matcher group 2 to be the second etc...
You can use a negated character class: [^;]+ This says to match any characters that are not a semicolon. Since it's a greedy match it will continue until it sees the first semicolon.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/(FAO=[^;]+)/)[1]);
});
From there you can edit the group match to only grab the values /FAO=([^;]+)/ and then you can split that value on the comma delimiter.
var strings = [
'chr1 11189845 COSM462604;COSM893813 G C,T 158.16 PASS AF=0,0;AO=0,0;DP=1201;FAO=0,0;FDP=1201;FR=.;',
'chr1 11190804 COSM180789 C T 134.06 PASS AF=0;AO=0;DP=1016;FAO=0;FDP=1018;FR=.;FRO=1018;'
];
strings.forEach(function(str) {
alert(str.match(/FAO=([^;]+)/)[1].split(','));
});
As stated in this SO answer it's not possible in most languages to have an arbitrary number of group matches.
you could use a regex like this
FAO=([0-9]+(,[0-9]+)*);
the outer parentheses allow you to extract the value or values with the first matching group.
EDIT
considering that you want to capture the individual values with different matching groups this approach won't work (capturing groups inside * will only capture the last match). see the accepted answer to this question for a solution.
EDIT 2
see this demo based on that answer for an example of a pcre regex that will match each number with the same capturing group.
(?:FAO=|\G,)\K(\d+)
note that not all regex flavours support \G and \K. \G matches the end of the previous match (or the start of the string), and \K resets the start of current match.

Extract numbers between brackets within a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Extract info inside all parenthesis in R (regex)
I inported data from excel and one cell consists of these long strings that contain number and letters, is there a way to extract only the numbers from that string and store it in a new variable? Unfortunately, some of the entries have two sets of brackets and I would only want the second one? Could I use grep for that?
the strings look more or less like this, the length of the strings vary however:
"East Kootenay C (5901035) RDA 01011"
or like this:
"Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020"
All I want from this is 5901035 and 5933039
Any hints and help would be greatly appreciated.
There are many possible regular expressions to do this. Here is one:
x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020")
> gsub('.+\\(([0-9]+)\\).+?$', '\\1', x)
[1] "5901035" "5933039"
Lets break down the syntax of that first expression '.+\\(([0-9]+)\\).+'
.+ one or more of anything
\\( parentheses are special characters in a regular expression, so if I want to represent the actual thing ( I need to escape it with a \. I have to escape it again for R (hence the two \s).
([0-9]+) I mentioned special characters, here I use two. the first is the parentheses which indicate a group I want to keep. The second [ and ] surround groups of things. see ?regex for more information.
?$ The final piece assures that I am grabbing the LAST set of numbers in parens as noted in the comments.
I could also use * instead of . which would mean 0 or more rather than one or more i in case your paren string comes at the beginning or end of a string.
The second piece of the gsub is what I am replacing the first portion with. I used: \\1. This says use group 1 (the stuff inside the ( ) from above. I need to escape it twice again, once for the regex and once for R.
Clear as mud to be sure! Enjoy your data munging project!
Here is a gsubfn solution:
library(gsubfn)
strapplyc(x, "[(](\\d+)[)]", simplify = TRUE)
[(] matches an open paren, (\\d+) matches a string of digits creating a back-reference owing to the parens around it and finally [)] matches a close paren. The back-reference is returned.