Regular expression with an optional group - regex

I want to split a comma seperated list of email addresses AND I want to get the user friendly names within those email addresses if there is one.
Now I use this regular expression:
(?<value>(?<normalized>.*?)\[.*?\])\s*,*\s*
This reg exp works for input string
"Eline[Elinek#yahoo.com],raymond[raymondc#yahoo.com]"
It returns two pairs:
value 'Eline[Elinek#yahoo.com]' normalized 'Eline'
value 'raymond[raymondc#yahoo.com]' normalized 'raymond'
but it doesn't work for input string
"Eline[Elinek#yahoo.com],piet#yahoo.com,raymond[raymondc#yahoo.com]"
It should return 3 email addresses with normalized empty in the second case.

Why should your second example return 3 matches? The second email has no [...], which you require in your pattern, so this address is additionally matched by (?<normalized>.*?) of the third email address.
Try this here instead:
(?<value>(?<normalized>[^,]*?)\[.*?\]|[^,\[\]]*)\s*,?\s*
See it here on Regexr
But this is getting unreadable, why not at first split on commas and work then on the resulting array?

You can try this pattern:
(?<value>(?<normalized>[^\[,]*?)\[?[^,]*\]?)
It seems that your pattern is not intended to match the whole input string, and you intent to iterate through different matches, therefore there's no need to add the patterns for commas in the end.
The normalized group matches characters while they are not either [ or ,. The group for value makes [, and ] optional, and matches any character in between while they are not a comma.

Related

Regex that either matches a string OR returns a non empty value

I am writing Regex for a REST API where specific numeric values upto five decimal places are extracted by API. I am trying to write a regex that either matches alphanumeric value or returns a non empty value if no match is found. The alpha numeric values are in a lengthy string. A block of string example is below:
,"circulatingSupply":18687562,"totalSupply":18687562,"maxSupply":21000000,"marketCapDominance":50.5442,"
We need maxsupply numeric value so I successfully built the regex to extract the numeric value:
/\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5})[0-9]*\,\"/
But the problem is that the substring may or may not be present. If the string is not present then the regex should return non empty value (not a NULL value or undefined value). I tried following but couldn't get a working regex as all are being rejected by the REST API:
/\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5})[0-9]*\,\"|([ ])/
/(?:\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5})[0-9]*\,\")?/
/\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5})[0-9]*\,\"|( )/
/(?:\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5})[0-9]*\,\"){0,1}/
I tried using OR operator but i think I am having a problem in Right Hand Side operand of OR operator. The required substring is necessarily in the middle of the large string so I guess ^ and $ will not be of any use. To emphasize, I repeat that if substring is not present then the regex should return non empty value say zero or a space character.
I found the solution from somewhere. I did the following thing and I got the desired result. It was all about capture group and OR operator placement.
$_ =~ /(?:\,\"maxSupply\"\:([0-9]+[.]?[0-9]{0,5}[0-9])*\,\"|(No Data))/

regex needed for parsing string

I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.

Regex to Match a single or multiple email addresses separated by comma

I need to have a regex pattern that matches the following kind of string
#keyword1 a#b.com or #keyword2 a#b.com;b#c.com;d#e.com
The following regex pattern doesn't do exactly what I want:
/(#)(?:keyword1|keyword2)\s([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)/g
The above regex expression only matches #keyword1 a#b.com correctly.
But for the second it matches everything before the first semicolon. I need it to match the entire thing. How can I do that please?
I would suggest parsing the string in two steps. First distinguish the keyword from the array of email addresses and then split the array.
First retrieve both the keyword and the arrray, assuming that is all that the string consists of. I'm using the JavaScript RegExp notation, but you should be able to understand what is happening.
Assume the string is "#keyword2 a#b.com;b#c.com;d#e.com".
/^#(keyword1|keyword2) (.*)$/g
Group 1 will be "keyword2" and group 2 will be "a#b.com;b#c.com;d#e.com". Now apply the following pattern to group 2 and loop through the matches to retrieve each email address.
/([^;]*)(?:;|$)/g
This pattern makes no assumptions about whether or not the email addresses are properly formatted, just that they are separated by a semicolon. This also works if there's only a single email address.

regular expression matching issue

I've got a string which has the following format
some_string = ",,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,xxx,,,"
and this is the content of a text file called f
I want to search for a specific term within the xxx (let's say that term is 'silicon')
note that the xxx can all be different and can contain any special characters (including meta characters) except for a new line
match = re.findall(r",{3}(.*?silicon.*?),{3}", f.read())
print match
But this doesn't seem to work because it returns results which are in the format:
["xxx,,,xxx,,,xxx,,,xxx,,,silicon", "xxx,,,xxx,,,xxx,,,xxsiliconxx"] but I only want it to return ["silicon", "xxsiliconxx"]
What am I doing wrong?
Try the following regex:
(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})
Example:
>>> s = ',,,xxx,,,silicon,,,xxx,,,xxsiliconxx,,,xxx'
>>> re.findall(r'(?<=,{3})(?:(?!,{3}).)*?silicon.*?(?=,{3})', s)
['silicon', 'xxsiliconxx']
I am assuming that the content in the xxx can contain commas, just not three consecutive commas or it would end the field. If the content in the xxx sections cannot contain any commas, you can use the following instead:
(?<=,{3})[^,\r\n]*?silicon.*?(?=,{3})
The reason your current approach doesn't work is that even though .*? will try to match as few characters as possible, the match will still start as early as possible. So for example the regex a*?b would match the entire string "aaaab". The only time the regex will advance the starting position is when the regex fails to match, and since ,,, can be matched by the .*?, your match will always start at the beginning of the string or just after the previous match.
The lookbehind and lookahead are used to address the issue raised by JaredC in comments, basically re.findall() won't return overlapping matches, so you need the leading and trailing ,,, to not be a part of the match.

Regular expression to check comma separted number values in Flex

Can anyone please help me to find the suitable regular expression to validate a string that has comma separated numbers, for e.g. '1,2,3' or '111,234234,-09', etc. Anything else should be considered invalid. for e.g. '121as23' or '123-123' is invalid.
I suppose this must be possible in Flex using regular expression but I can not find the correct regular expression.
#Justin, I tried your suggestion /(?=^)(?:[,^]([-+]?(?:\d*\.)?\d+))*$/ but I am facing two issues:
It will invalidate '123,12' which should be true.
It won't invalidate '123,123,aasd' which is invalid.
I tried another regex - [0-9]+(,[0-9]+)* - which works quite well except for one issue: it validates '12,12asd'. I need something that will only allow numbers separated by commas.
Your example data consists of three decimal integers, each having an optional leading plus or minus sign, separated by commas with no whitespace. Assuming this describes your requirements, the Javascript/ActionScript/Flex regex is simple:
var re_valid = /^[-+]?\d+(?:,[-+]?\d+){2}$/;
if (re_valid.test(data_string)) {
// data_string is valid
} else {
// data_string is NOT valid
}
However, if your data can contain any number of integers and may have whitespace the regex becomes a bit longer:
var re_valid = /^[ \t]*[-+]?\d+[ \t]*(,[ \t]*[-+]?\d+[ \t]*)*$/;
If your data can be even more complex (i.e. the numbers may be floating point, the values may be enclosed in quotes, etc.), then you may be better off parsing the string as a CSV record and then check each value individually.
Looks like what you want is this:
/(?!,)(?:(?:,|^)([-+]?(?:\d*\.)?\d+))*$/
I don't know Flex, so replace the / at the beginning and end with whatever's appropriate in Flex regex syntax. Your numbers will be in match set 1. Get rid of the (?:\d*\.)? if you only want to allow integers.
Explanation:
(?!,) #Don't allow a comma at the beginning of the string.
(?:,|^) #Your groups are going to be preceded by ',' unless they're the very first group in the string. The '(?:blah)' means we don't want to include the ',' in our match groups.
[-+]? #Allow an optional plus or minus sign.
(?:\d*\.)?\d+ #The meat of the pattern, this matches '123', '123.456', or '.456'.
* #Means we're matching zero or more groups. Change this to '+' if you don't want to match empty strings.
$ #Don't stop matching until you reach the end of the string.