Regex: need some help match using semicolon

Regex: need some help match using semicolon - regex

I have the following input: Mobileapp/1.19.2 (SM-S908B; Android 12; da-DK)
I either need to match (SM-S908B; and da-DK) or just (SM-S908B;
So match anything between ( and first ; and last ; and )
I tried and and was able to use this expression ([^(;]+);([^;]+)
But it matches to SM-S908B; Android 12
Would really appreciate if anyone could help since I am still learning Regex.

Assuming at least one occurence of the semi-colon is present, maybe chuck both options in their own group:
(?:\(([^;]+)|;\s*([^;)]+)\))
See an online demo
(?: - Open non-capture group;
\(([^;]+) - Match a literal open-paranthesis followed by a 1st capture group to match 1+ non-semicolon characters;
| - Or;
;\s*([^;)]+)\) - Match a semicolon and 0+ whitespace characters before a 2nd capture group to match 1+ characters other than semicolon or closing paranthesis.
Another option is to match just these parts:
(?:\(|;.*;\s*|\G(?!^))\K[^;)]+
See an online demo
(?: - Open non-capture group;
\( - Match an open paranthesis;
| - Or;
;.*;\s* - Match from 1st semicolon to last semicolon with possible 0+ whitespace chars;
| - Or;
\G(?!^) - Assert position at end of previous match but exclude start-line with negative lookahead;
\K - Reset starting point of reported match;
[^;)]+ - Match 1+ characters other than semicolon or closing paranthesis.

Related

How to match names separated by "and" excluding "and" itself using regex?

I am trying to solve http://play.inginf.units.it/#/level/10
I have some strings as follows:
title={AUTOMATIC ROCKING DEVICE},
author={Diaz, Navarro David and Gines, Rodriguez Noe},
year={2006},
title={The sitting position in neurosurgery: a retrospective analysis of 488 cases},
author={Standefer, Michael and Bay, Janet W and Trusso, Russell},
journal={Neurosurgery},
title={Fuel cells and their applications},
author={Kordesch, Karl and Simader, G{"u}nter and Wiley, John},
volume={117},
I need to match the names in bold. I tried the following regex:
(?<=author={).+(?=})
But it matches the entire string inside {}. I understand why is it so but how can I break the pattern with and?

It took me a little while to get the samples to show up in your link. What about:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},).)+
See an online demo
(?:^\s*author={|\G(?!^) and ) - Either match start of a line followed by 0+ whitespace chars and literally match 'author={` or assert position at end of previous match but negate start-line;
\K - Reset starting point of reported match;
(?:(?! and |},).)+ - Match any if it's not followed by ' and ' or match a '}' followed by a comma.
Above will also match 'others' as per last sample in linked test. If you wish to exclude 'others' then maybe add the option to the negated list as per:
(?:^\s*author={|\G(?!^) and )\K(?:(?! and |},|\bothers\b).)+
See an online demo
In the comment section we established above would not work for given linked website. Apparently its JS based which would support zero-width lookbehind. Therefor try:
(?<=\bauthor={(?:(?!\},).*?))\b[A-Z]\S*\b(?:,? [A-Z]\S*\b)*
See the demo
(?<= - Open lookbehind;
\bauthor={ - Match word-boundary and literally 'author={';
(?:(?!\},).*?)) - Open non-capture group to match a negative lookahead for '},' and 0+ (lazy) characters. Close lookbehind;
\b[A-Z]\S*\b - Match anything between two word-boundaries starting with a capital letter A-Z followed by 0+ non-whitespace chars;
(?:,? [A-Z]\S*\b)* - A 2nd non-capture group to keep matching comma/space seperated parts of a name.

If using a lookbehind assertion is supported and matching word characters, you might use:
(?<=\bauthor={[^{}]*(?:{[^{}]*}[^{}]*)*)[A-Z][^\s,]*,(?:\s+[A-Z][^\s,]*)+\b
Explanation
(?<= Postive lookahead, assert that to the left of the current position is
\bauthor={ Match author={ preceded by a word boundary
[^{}]*(?:{[^{}]*}[^{}]*)* Match optional chars other than { } or match {...}
) Close the lookbehind
[A-Z] Match an uppercase char A-Z
[^\s,]*, Optionally match non whitespace chars except , and then match ,
(?: Non capture group to repeat as a whole part
\s+[A-Z][^\s,]* Match 1+ whitespace chars, uppercase char A-Z, optional non whitespace chars except ,
)+ Close the non capture group and repeat it 1 or more times
\b a word boundary
See a regex101 demo.

Regex - Find string that has 5 or more mentions in it

Trying to detect whether a message has 5 or more mentions in it.
For example:
#Boddy is doing great with #shirly #rebecca #jimmy and #mom
Above will count as one match. This will not count as a match:
#Boddy is doing great with #shirly #rebecca #jimmy and ...
Preferably, ##### should not count either, but not too important!
I've tried
#([^# ]+){5,}
But no luck, it highlights all 5 instead of the whole string.

Use a pattern which covers the entire string:
^.*#.*#.*#.*#.*#.*$
Demo

Assumptions:
Looking at your data you'd maybe want to assert that the '#' is preceded with either the start-line anchor or a space;
You'd like to avoid concatenated '#'s to prevent false positives.
With these in mind, maybe you could try:
^(?:[^#]*(?<!\S)#\w+){5}.*$
Seen an online demo.
^ - Start-line anchor;
(?:- Open non-capture group;
[^#]* - 0+ (Greedy) characters other than '#';
(?<!\S) - Negative lookbehind to assert position is not preceded by a non-whitespace;
#\w+ - A literal '#' with 1+ (Greedy) word-characters;
){4} - Close non-capture group and match 4 more times;
.* - Any 0+ (Greedy) characters;
$ - An end-line anchor.

You can change your pattern by adding the # after the negated character class, and also end the pattern with the negated character class.
Using the negated character class also prevents unnecessary backtracking.
If you don't need the capture group, you can use a non capture group (?:
Note that [^#]* can also match a newline
^([^#]*#){5,}[^#]*$
See a regex demo.
If you want to match mentions, ##### should not match and you don't want to match crossing newlines, you can prepend \B before the # to assert a non word boundary.
Then match at least a single char other than a whitespace char or # after matching the #.
^[^#\n\r]*(\B#[^#\s][^#\n\r]*){5,}$
See another regex demo.

Regex - add a zero after second period

I have the following example of numbers, and I need to add a zero after the second period (.).
1.01.1
1.01.2
1.01.3
1.02.1
I would like them to be:
1.01.01
1.01.02
1.01.03
1.02.01
I have the following so far:
Search:
^([^.])(?:[^.]*\.){2}([^.].*)
Substitution:
0\1
but this returns:
01 only.
I need the 1.01. to be captured in a group as well, but now I'm getting confuddled.
Does anyone know what I am missing?
Thanks!!

You may try this regex replacement with 2 capture groups:
Search:
^(\d+\.\d+)\.([1-9])
Replacement:
\1.0\2
RegEx Demo
RegEx Details:
^: Start
(\d+\.\d+): Match 1+ digits + dot followed by 1+ digits in capture group #1
\.: Match a dot
([1-9]): Match digits 1-9 in capture group #2 (this is to avoid putting 0 before already existing 0)
Replacement: \1.0\2 inserts 0 just before capture group #2

You could try:
^([^.]*\.){2}\K
Replace with 0. See an online demo
^ - Start line anchor.
([^.]*\.){2} - Negated character 0+ times (greedy) followed by a literal dot, matched twice.
\K - Reset starting point of reported match.
EDIT:
Or/And if \K meta escape isn't supported, than see if the following does work:
^((?:[^.]*\.){2})
Replace with ${1}0. See the online demo
^ - Start line anchor.
( - Open 1st capture group;
(?: - Open non-capture group;
`Negated character 0+ times (greedy) followed by a literal dot.
){2} - Close non-capture group and match twice.
) - Close capture group.

Using your pattern, you can use 2 capture groups and prepend the second group with a dot in the replacement like for example \g<1>0\g<2> or ${1}0${2} or $10$2 depending on the language.
^((?:[^.]*\.){2})([^.])
^ Start of string
((?:[^.]*\.){2}) Capture group 1, match 2 times any char except a dot, then match the dot
([^.].*) Capture group 2, match any char except a dot
Regex demo
A more specific pattern could be matching the digits
^(\d+\.\d+\.)(\d)
^ Start of string
(\d+\.\d+\.) Capture group 1, match 2 times 1+ digits and a dot
(\d) Capture group 2, match a digit
Regex demo
For example in JavaScript
const regex = /^(\d+\.\d+\.)(\d)/;
[
"1.01.1",
"1.01.2",
"1.01.3",
"1.02.1",
].forEach(s => console.log(s.replace(regex, "$10$2")));

Obviously, there will be tons of solutions for this, but if this pattern holds (i.e. always the trailing group that is a single digit)... \.(\d)$ => \.0\1 would suffice - to merely insert a 0, you don't need to match the whole thing, only just enough context to uniquely identify the places targeted. In this case, finding all lines ending in a . followed by a single digit is enough.

Regex to capture everything after optional token

I have fields which contain data in the following possible formats (each line is a different possibility):
AAA - Something Here
AAA - Something Here - D
Something Here
Note that the first group of letters (AAA) can be of varying lengths.
What I am trying to capture is the "Something Here" or "Something Here - D" (if it exists) using PCRE, but I can't get the Regex to work properly for all three cases. I have tried:
- (.*) which works fine for cases 1 and 2 but obviously not 3;
(?<= - )(.*) which also works fine for cases 1 and 2;
(?! - )(.+)| - (.+) works for cases 2 and 3 but not 1.
I feel like I'm on the verge of it but I can't seem to crack it.
Thanks in advance for your help.
Edit: I realized that I was unclear in my requirements. If there is a trailing " - D" (the letter in the data is arbitrary but should only be a single character), that needs to be captured as well.

About the patterns that you tried:
- (.*)This pattern will match the first occurrence of - followed by matching the rest of the line. It will match too much for the second example as the .* will also match the second occurrence of -
(?<= - )(.*)This pattern will match the same as the first example without the - as it asserts that is should occur directly to the left
(?! - )(.+)| - (.+) This pattern uses a negative lookahead which asserts what is directly to the right is not (?! - ). As none of the example start with - , the whole line will be matched directly after the negative lookahead due to .+ and the second part after the alternation | will not be evaluated
If the first group of letters can be of varying length, you could make the match either specific matching 1 or more uppercase characters [A-Z]+ or 1+ word characters \w+.
To get a more broad match, you could match 1 or more non whitespace characters using \S+
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*
Explanation
^ Start of string
(?:\S+\h-\h)? Optionally match the first group of non whitespace chars followed by - between horizontal whitespace chars
\K Clear the match buffer (Forget what is currently matched)
\S+ Match 1+ non whitespace characters
(?: Non capture group
\h(?!-\h) Match a horizontal whitespace char and assert what is directly to the right is not - followed by another horizontal whitespace char
\S+ Match 1+ non whitespace chars
)* Close non capture group and repeat 1+ times to match more "words" separated by spaces
Regex demo
Edit
To match an optional hyphen and trailing single character, you could add an optional non capturing group (?:-\h\S\h*)?$ and assert the end of the string if the pattern should match the whole string:
^(?:\S+\h-\h)?\K\S+(?:\h(?!-\h)\S+)*\h*(?:-\h\S\h*)?$
Regex demo

You may use
^(?:.*? - )?\K.*?(?= - | *$)
^(?:.*?\h-\h)?\K.*?(?=\h-\h|\h*$)
See the regex demo
Details
^ - start of string
-(?:.*? - )? - an optional non-capturing group matching any 0+ chars other than line break chars as few as possible up to the first space-space
\K - match reset operator
.*? - any 0+ chars other than line break chars as few as possible
(?= - | *$) - space-space or 0+ spaces till the end of string should follow immediately on the right.
Note that \h matches any horizontal whitespace chars.

^(?:[A-Z]+ - \K)?.*\S
demo
Since "Something Here" can be anything, there's no reason to specially describe the eventual last letter in the pattern. You don't need something more complicated.
With this pattern I assume that you are not interested by the trailing spaces, that's why I ended it with \S. If you want to keep them, remove the \S and change the previous quantifier to +.

RegEx multiple lines

I have this text:
text1 without brackets
text2 (with brackets)
and I need two groups in every line:
group#1: text1 without brackets
group#2:
group#1: text2
group#2: with brackets
Here is a link for this example: regexr.com
Thanks for help!

You may use
^(.*?)(?:\s*\(([^()]*)\))?$
See the regex demo and the regex graph:
Details
^ - start of string
(.*?) - Group 1: any 0+ chars as ew as possible
(?:\s*\(([^()]*)\))? - an optional sequence of patterns that is tried at least once:
\s* - 0+ whitespaces
\( - a ( char
([^()]*) - Group 2: 0+ chars other than ( and )
\) - a ) char
$ - end of the string.

Try pattern: ([^(\n]+)(?:\n|\(([^)]+))
Explanation:
([^(\n]+) - first capturing group: match one or more characters other than ( or \n so it will match everything until opening bracket or newline character
(?:...) - used in order to make use of alternation and not create second capturing group
\n|\(([^)]+) - match newline or bracker ( and one or more characters other than closing bracket ) storing it into second capturing group.
Demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: need some help match using semicolon - regex

Related

How to match names separated by "and" excluding "and" itself using regex?

Regex - Find string that has 5 or more mentions in it

Regex - add a zero after second period

Regex to capture everything after optional token

RegEx multiple lines

Categories

Resources