Regex (PCRE) exclude certain words from match result - regex

I need to get only the string with names that is in Bold:
author={Trainor, Sarah F and Calef, Monika and Natcher, David and Chapin, F Stuart and McGuire, A David and Huntington, Orville and Duffy, Paul and Rupp, T Scott and DeWilde, La'Ona and Kwart, Mary and others},
Is there a way to skip all 'and' 'others' words from match result?
Tried to do lots of things, but nothing works as i expect
(?<=\{).+?(?<=and\s).+(?=\})

Instead of using omission, you could be better off by implementing rules which expect a specific format in order to match the examples you've provided:
([A-Z]+[A-Za-z]*('[A-Za-z]+)*, [A-Z]? ?[A-Z]+[A-Za-z]*('[A-Za-z]+)*( [A-Z])?)
https://regex101.com/r/9LGqn3/3

You could make use of \G and a capturing group to get you the matches.
The values are in capturing group 1.
(?:author={|\G(?!^))([^\s,]+,(?:\h+[^\s,]+)+)\h+and\h+(?=[^{}]*\})
About the pattern
(?: Non capturing group
author={ Match literally
| Or
\G(?!^) Assert position at the end of previous match, not at the start
) Close non capturing group
( Capture group 1
[^\s,]+, Match not a whitespace char or comma, then match a comma
(?:\h+[^\s,]+)+ Repeat 1+ times matching 1+ horizontal whitespace chars followed by matching any char except a whitespace char and a comma
) Close group 1
\h+and\h+ Match and between 1+ horizontal whitespaces
(?=[^{}]*\}) Assert what is on the right is a closing }
Regex demo

Related

How to match names separated by "and" excluding "and" itself using regex?

I am trying to solve http://play.inginf.units.it/#/level/10
I have some strings as follows:
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},
I need to match the names in bold. I tried the following regex:
(?<=author={).+(?=})
But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?
It took me a little while to get the samples to show up in your link. What about:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
See an online demo
(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.
Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
See an online demo
In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:
(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
See the demo
(?<= - Open lookbehind;
\bauthor={ - Match word-boundary and literally 'author={';
(?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.
If using a lookbehind assertion is supported and matching word characters, you might use:
(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s+[A-Z][^\s,]*)+\b
Explanation
(?<= Postive lookahead, assert that to the left of the current position is
\bauthor={ Match author={ preceded by a word boundary
[^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
\s+[A-Z][^\s,]* Match 1+ whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
)+ Close the non capture group and repeat it 1 or more times
\b a word boundary
See a regex101 demo.

How to make optional capturing groups be matched first

For example I want to match three values, required text, optional times and id, and the format of id is [id=100000], how can I match data correctly when text contains spaces.
my reg: (?<text>[\s\S]+) (?<times>\d+)? (\[id=(?<id>\d+)])?
example source text: hello world 1 [id=10000]
In this example, all of source text are matched in text
The problem with your pattern is that matches any whitespace and non whitespace one and unlimited times, which captures everything without getting the other desired capture groups. Also, with a little help with the positive lookahead and alternate (|) , we can make the last 2 capture groups desired optional.
The final pattern (?<text>[a-zA-Z ]+)(?=$|(?<times>\d+)? \[id=(?<id>\d+)])
Group text will match any letter and spaces.
The lookahead avoid consuming characters and we should match either the string ended, or have a number and [id=number]
Said that, regex101 with further explanation and some examples
You could use:
:\s*(?<text>[^][:]+?)\s*(?<times>\d+)? \[id=(?<id>\d+)]
Explanation
: Match literally
\s* Match optional whitespace chars
(?<text> Group text
[^][:]+? match 1+ occurrences of any char except [ ] :
) Close group text
\s* Match optional whitespace chars
(?<times>\d+)? Group times, match 1+ digits
\[id= Match [id=
(?<id>\d+) Group id, match 1+ digirs
] Match literally
Regex demo

What is the proper regex for capturing everything after "String" and between two delimeters ('=' and and non alphanumeric))

Details={
AwsEc2SecurityGroup={GroupName=m.com-rds, OwnerId=123, VpcId=vpc-123,
IpPermissions=[{FromPort=3306, ToPort=3306, IpProtocol=tcp, IpRanges=[{CidrIp=1.1.1.1/32}, {CidrIp=2.2.2.2/32}, {CidrIp=0.0.0.0/0}, {CidrIp=3.3.3.3/32}],
UserIdGroupPairs=[{UserId=123, GroupId=sg-123abc}]}], IpPermissionsEgress=[{IpProtocol=-1, IpRanges=[{CidrIp=0.0.0.0/0}]}], GroupId=sg-123abc}},
Region=us-east-1, Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc}]
}
I want to capture exactly arn:aws:ec2:us-east-1:123:security-group/sg-123abc in this example. Generically, I want to capture the value of Id regardless of placement. My current solution is /Details={.*Id=(.*\w)/, but this only works if it's the last object in the data. How can I take into account the following potential scenario:
Id=arn:aws:ec2:us-east-1:123:security-group/sg-123abc, Thing=123abc}]
You have a pattern with 2 times .* which will first match till the end of the line/string (depending on if the dot matches a newline) and it will backtrack to match the last occurrence where this part of the pattern Id=(.*\w) can match.
If you want to use a capture group, you can make the format and the allowed characters a bit more specific:
\bId=(\w+(?:[:\/-]\w+)+)
The pattern in parts
\b A word boundary to prevent a partial word match
Id= Match literally
( Capture group 1
\w+ Match 1+ word chars
(?:[:\/-]\w+)+ Repeat 1+ times either : / - and 1+ word chars
) Close group 1
Regex demo
Or if you know that it starts with Id=arn:
\bId=(arn:[\w:\/-]+)
Regex demo
Note that you don't have to escape the \/ only when the delimiters of the regex are forward slashes, but there is no language tagged.
You can use look-behind to check that there is the Id= prefix, and then match anything that is not a space, comma or closing brace:
(?<=\bId=)[^,}\s]*

Regex pattern for matching float followed by some fixed strings

I want to a regex pattern that could match the following cases:
0, 1, 0.1, .1, 1g, 0.1g, .1g, 1(g/100ml), .1(g/ml)
If the regex matches the pattern, I want to capture only the numerical part(0,1,0.1..)
I tried using following regex but it matches many cases:
((?=\.\d|\d)(?:\d+)?(?:\.?\d*))|((?=\.\d|\d)(?:\d+)?(?:\.?\d*))[a-zA-Z]+?|\([^)]*\)
How to achieve above with single regex pattern?
Edit:
To make the question solution more generic
What would be a single regex that would match below
Any numerical ( 0, 1, 0.1, ...)
Any numerical followed by g, mg any characters (0.1g, .1mg, 100kg)
Any numerical followed by anything in parentheses - .1(g/100ml), 100(mg/1kg)
And just capture the numerical part
You could make the pattern a bit more specific and use a capture group for the digits and optionally match what follows or (Updated with the comment of # anubhava) add a word boundary to prevent another partial match.
(\d*\.?\d+)(?:\(g\/\d*ml\)|g?\b)
(\d*\.?\d+) Capture group 1, match optional digits, optional . and 1+ digits
(?: Non capture group for the alternation
\(g\/\d*ml\) Match (g/ optional digits and ml)
| Or
g?\b Match an optional g followed by a word boundary
) Close non capture group
Regex demo
If the values should match in the comma separated string, you can assert either a , or the end of the string to the right.
(\d*\.?\d+)(?:\(g\/\d*ml\)|g)?(?=,|$)
Regex demo
Edit
A broad pattern to match anything between parenthesis or optional chars a-zA-Z after the digits:
(\d*\.?\d+)(?:\([^()]*\)|[a-zA-Z]*\b)
(\d*\.?\d+) Capture group 1, match optional digits, optional . and 1+ digits
(?: Non capture group
\([^()]*\) Match from opening till closing parenthesis
| Or
[a-zA-Z]*\b Optionally match chars in the ranges a-zA-Z followed by a word boundary
) Close non capture group
Regex demo
EDIT2: With OP's edited samples(to match 0, 1, 0.1 OR (0.1g, .1mg, 100kg) OR .1(g/100ml), 100(mg/1kg)), adding following solution here. Explanation is same as very first solution, only thing is in spite of matching specific strings, I have changed regex to match any alphabets here.
(\d*\.?\d+)(?:[a-zA-Z]+|\([a-zA-Z]+(?:\/\d*(?:[a-zA-Z]+))?\)|(?:,\s+|$))
Online Demo for above regex
EDIT1: As per OP's comments to match .01c and 100(g/1000L) kind of examples adding following regex, which is small edit to 1st solution here.
(\d*\.?\d+)(?:g|cc|\(g(?:\/\d*(?:ml|L))?\)|(?:,\s+|$))
Online demo for above regex
With your shown samples, please try following regex here.
(\d*\.?\d+)(?:g|\(g(?:\/\d*ml)?\)|(?:,\s+|$))
Online demo for above regex
Explanation: Adding detailed explanation for above.
(\d*\.?\d+) ##Matching digits 0 or more occurrences followed by .(optional, followed by 1 or more digits occurrences here.
(?: ##Starting a non-capturing group here.
g| ##matching only g here OR.
\(g(?:\/\d*ml)?\)| ##Matching (g) OR (g/digits ml) here OR.
(?:,\s+|$) ##Matching comma followed by 1 or more spaces occurrences OR end of value here.
) ##Closing non-capturing group here.
try this:
[\d]?\.?\d+(?:g|(?<p>\()(?(p)g\/(?:\d+)?ml\)))?
Demo

How can I match everything between 2 commas?

I want to match basically any text that has a comma separated list of weekdays.
(?i)(every (mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5}, .*+,
(mon|tue|wed|thu|fri|sat|sun)[A-Za-z]{3,5})
Above is what what I have and I want to make it match the following strings. I don't need help in the case that only 2 weekdays are supplied.
Every mon, tue, wednesday
Every wed, Saturday, Friday, sun.
Try pattern: (?<=,|^)[^,\n]+
Explanation
(?<=,|^) - positive lookbehind: assert what preceeds is comma , or beginning of the string ^
[^,\n]+ - match one or more characters other than comma , or newline \n
Demo
You might list the abbreviations and optionally match the full name by listing them using an alternation followed by a comma and a space.
Add that to a group and repeat that 0+ times. After that add the group without a comma to make sure you match at least a single day.
(?i)\bevery (?:(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), )*(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?)\b
Explanation
(?i)\bevery Case insensitive modifier
(?: No capturing group
(?:mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)?), Match any of the listed followed by a comma and space
)* Close non capturing group and repeat 0+ times
(?: Non capturing group
mon(?:day)?|tue(?:sday)?|wed(?:nesday)?|thu(?:rsday)?|fri(?:day)?|sat(?:urday)?|sun(?:day)? Match any of the listed
)\b Close non capturing group and add a word boundary to prevent being part of a larger word
Regex demo
To not match only multiple days, you could update the * quantifier for the first non capturing groupe to for example + or {2,}.