regex capture with g modifier captures only first occurrence - regex

Using EcmaScript 6 RegExp
From this : "-=Section A=- text A -=Section B=- text b"
I want to get this: ['Section A', 'text A', 'Section B', 'text B']
Apart from the delimiters, everything else is variable. (Eventually '-=someString=-' will be '' but for now I did not want to clutter things up or create errors with characters that need escaping.)
I am not a regex expert, but I have searched all day for an example or guidance to make this work without success.
For example using this code:
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = RegExp('-=(.*?)=-(.*?)','g');
I only get this: ["-=Section A=-", "Section A", ""]
I am not sure how to make the second of the captures capture 'text A'. Also I do not understand why the g modifier is not making it continue after the first match and go on to find 'Section B' and 'text B'.
Any pointers to some examples would be appreciated - I have failed to find any.

Note that (.*?) at the end of the pattern will always match an empty string since it is lazy, and is not executed in the first place. text A cannot be matched because the matches ends with =-, since .*? does not have to be matched.
You may use
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = /\s*-=(.*?)=-\s*/;
console.log(templateString.split(regex).filter(Boolean));
The \s*-=(.*?)=-\s* pattern finds
\s* - 0+ whitespaces
-= - a -= substring
(.*?) - Group 1: any 0+ chars, as few as possible up to the first occurrence of the subsequent subpatterns
=- - a =- substring
\s* - 0+ whitespaces.
The String#split method adds to the resulting array all substrings captured into Group 1.
If you want to use a matching approach, you would need to match any char, 0 or more occurrences, that does not start the leading char sequence, which seems to be -= in your scenario:
let templateString = "-=Section A=- text A -=Section B=- text b";
let regex = /-=(.*?)=-\s*([^-]*(?:-(?!=)[^-]*)*)/g;
let m, res=[];
while (m=regex.exec(templateString)) {
res.push([m[1], m[2].trim()]);
}
console.log(res);
See this regex demo
Details
-=(.*?)=-\s* - same as in the first regex (see the split regex above)
([^-]*(?:-(?!=)[^-]*)*) - Group 2 that matches and captures:
[^-]* - 0+ chars other than -
(?: - start of a non-capturing group that matches
-(?!=) - a hyphen that is not immediately followed with =
[^-]* - 0+ chars other than -
)* - ...zero or more times

Related

RegEx string to find two strings and delete the rest of the text in the file including lines that don't contain the strings [duplicate]

I need to do a find and delete the rest in a text file with notepad+++
i want tu use RegeX to find variations on thban..... the variable always has max 5 chars behind it(see dots).
with my search string it hit the last line but the whole line. I just want the word preserved.
When this works i also want keep the words containing C3.....
The rest of a tekst file can be delete.
It should also be caps insensitive
(?!thban\w+).*\r?\n?
\
THBANES900 and C3950 bla bla
THBAN
..THBANES901.. C3850 bla bla
THBANMP900
**..thbanes900..**
This should result in
THBANES900 C3950
THBAN
THBANES901 C3850
THBANMP900
thbanes900
Maybe just capture those words of interest instead of replacing everything else? In Notepad++ search for pattern:
^.*\b(thban\S{0,5})(?:.*(\sC3\w+))?.*$|.+
See the Online Demo
^ - Start string ancor.
.*\b - Any character other than newline zero or more times upto a word-boundary.
(- Open 1st capture group.
thban\S{0,5} - Match "thban" and zero or 5 non-whitespace chars.
) - Close 1st capture group.
(?: - Open non-capturing group.
.* - Any character other than newline zero or more times.
( - Open 2nd capture group.
\sC3\w+ - A whitespace character, match "C3" and one ore more word characters.
) - Close 2nd capture group.
)? - Close non-capturing group and make it optional.
.* - Any character other than newline zero or more times.
$ - End string ancor.
| - Alternation (OR).
.+ - Any character other than newline once or more.
Replace with:
$1$2
After this, you may end up with empty line you can switly remove using the build-in option. I'm unaware of the english terms so I made a GIF to show you where to find these buttons:
I'm not sure what the english checkbutton is for ignore case. But make sure that is not ticked.
You may use
Find What: (?|\b(thban\S{0,5})|\s(C3\w+))|(?s:.)
Replace With: (?1$1\n:)
Screenshot & settings
Details
(?| - start of a branch reset group:
\b(thban\S{0,5}) - Group 1: a word boundary, then thban and any 0 to 5 non-whitespace chars
| - or
\s(C3\w+) - a whitespace char, and then Group 1: C3 and one or more word chars
) - end of the branch reset group
| - or
(?s:.) - any one char (including line break chars)
The replacement is
(?1 - if Group 1 matched,
$1\n - Group 1 value with a newline
: - else, replace with empty string
) - end of the conditional replacement pattern

how to capture from group from end line in js regex?

I'm trying to capture a text into 3 groups I have managed to capture 2 groups but having an issue with the 3rd group.
This is the text :
<13>Apr 5 16:09:47 node2 Services: 2016-04-05 16:09:46,914 INFO [3]
Drivers.KafkaInvoker - KafkaInvoker.SendMessages - After sending
itemsCount=1
I'm using the following regex:
(?=- )(.*?)(?= - )|(?=])(.*?)(?= -)
My 3rd group should be : "After sending itemsCount=1"
any suggestions?
Your original expression is fine, just missing a $:
(?=- )(.*?)(?= - |$)|(?=])(.*?)(?= -)
Demo
and maybe we would slightly modify that to an expression similar to:
(?=-\s+).*?([A-Z].*?)(?=\s+-\s+|$)|(?=]\s+).*?([A-Z].*?)(?=\s+-)
Demo
You have 2 capturing groups. You don't get the match for the third part because the postitive lookahead in the first alternation is not considering the end of the string. You might solve that by using an alternation to look at either a space or assert the end of the string
(?=[-\]] )(.*?)(?= - |$)
^^
If those matches are ok, you could simplify that pattern by making use of a character class to match either - or ] like [-\]] and omit the alternation and the group as you now have only the matches.
Your pattern then might look like (also capturing the leading hyphen like the first 2 matches)
(?=[-\]] ).*?(?= - |$)
Regex demo
If this is your string and you want to have 3 capturing groups, you might use:
^.*?\[\d+\]([^-]+)-([^-]+)-\s*([^-]+)$
^ Start of string
.*? Match any char except a newline non greedy
\[\d+\] match [ 1+ digits ]
([^-]+)- Capture group 1, match 1+ times not -, then match -
([^-]+)- Capture group 2, match 1+ times not -, then match -
\s* Match 0+ whitespace chars
([^-]+) Capture group 2, match 1+ times not -
$ End of string
Regex demo
For example creating the desired object from the comments, you could first get all the matches from match[0] and store those in an array.
After you have have all the values, assemble the object using the keys and the values.
var output = {};
var regex = new RegExp(/(?=[-\]] ).*?(?= - |$)/g);
var str = `<13>Apr 5 16:09:47 node2 Services: 2016-04-05 16:09:46,914 INFO [3] Drivers.KafkaInvoker - KafkaInvoker.SendMessages - After sending itemsCount=1`;
var match;
var values = [];
var keys = ['Thread', 'Class', 'Message'];
while ((match = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (match.index === regex.lastIndex) {
regex.lastIndex++;
}
values.push(match[0]);
}
keys.forEach((key, index) => output[key] = values[index]);
console.log(output);

Regular Expression to parse group of strings with quotes separated by space

Given a line of string that does not have any linebreak, I want to get groups of strings which may consist of quotes and separated by space. Space is allowed only if it's within quotes. E.g.
a="1234" gg b=5678 c="1 2 3"
The result should have 4 groups:
a="1234"
gg
b=5678
c="1 2 3"
So far I have this
/[^\s]+(=".*?"|=".*?[^s]+|=[^\s]+|=)/g
but this cannot capture the second group "gg". I can't check if there is space before and after the text, as this will include the string that has space within quotes.
Any help will be greatly appreciated! Thanks.
Edited
This is for javascript
In JavaScript, you may use the following regex:
/\w+(?:=(?:"[^"]*"|\S+)?)?/g
See the regex demo.
Details
\w+ - 1+ letters, digits or/and _
(?:=(?:"[^"]*"|\S+)?)? - an optional sequence of:
= - an equal sign
(?:"[^"]*"|\S+)? - an optional sequence of:
"[^"]*" - a ", then 0+ chars other than " and then "
| - or
\S+ - 1+ non-whitespace chars
JS demo:
var rx = /\w+(?:=(?:"[^"]*"|\S+)?)?/g;
var s = 'a="1234" gg b=5678 c="1 2 3" d=abcd e=';
console.log(s.match(rx));
if I did not misunderstand what you are saying this is what you are looking for.
\w+=(?|"([^"]*)"|(\d+))|(?|[a-z]+)
think of the or works as a fallback option there for use more complex one in front of the more generic ones.
alternatively, you can remove second ?| and it will capture it as a different group so you can check that group (group 2)

How to get the first date in a script

I have lines of text as follows. I only want the first date after Examination date so that the expected output is 10.08.2017
Examination Date
date: 10.08.2017
423432
tert
g
534534
Examination Date: 04-07-2017
so far I have tried:
Examination Date.*?\d{2}.?{2}?.\d{4}
but I get the entire result to 04-07-2017
Fix the pattern by adding \d before the {2}? and removing unnecessary ?s abd capture the value you need:
String s = "Examination Date \n\ndate: 10.08.2017 \n423432\n\ntert\n\ng\n\n534534\n\nExamination Date: 04-07-2017";
Pattern pattern = Pattern.compile("Examination Date.*?\\b(\\d{2}\\W\\d{2}\\W\\d{4})\\b", Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1)); // => 10.08.2017
}
See the Java demo and the regex demo. In the code, you only get the first match as if is used, not while, and the . matches line breaks thanks to the Pattern.DOTALL modifier.
Details
Examination Date - a literal substring
.*? - any 0+ chars, as few as possible
\\b - a word boundary (if you do not care about matching the date as a "whole" word, remove the \\b)
(\\d{2}\\W\\d{2}\\W\\d{4}) - Group 1:
\\d{2} - 2 digits
\\W - any non-word char (punctuation, space, symbol)
\\d{2}\\W - as above
\\d{4} - 4 digits
\\b - a trailing word boundary.

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.