Regex Lookahead/Lookbehind if more than one occurance - regex

I have string formulas like this:
?{a,b,c,d}
It can be can be embedded like this:
?{a,b,c,?{x,y,z}}
or this is the same:
?{a,b,c,
?{x,y,z}
}
So I have to find those commas, what are in the second and greather "level" brackets.
In the example below I marked the "levels" where I have to find all commas:
?{a,b,c,
?{x,y, <--Those
?{1,2,3} <--Those
}
}
I've tried with lookahead and lookbehind, but I'm totally confused now :/
Here is my latest working try, but it is not good at all:
OnlineRegex
Update:
To avoid misunderstanding, I don't want to count the commas.
I'd like to get groups of commas to replace them.
The condition is find the commas where more than one "open tags" before it like this: ?{
.. without closing tag like this: }
Examlpe.:
In this case I have not replace any commas:
?{1,2,3} ?{a,b,c}
But in this case I have to replace commas between a b c
?{1,2,3,?{a,b,c}}

For the examples which you have provided, the following regex works(gives the desired output as mentioned by you):
(?<!^\?{[^{}]*),(?=[\s\S]*(?:\s*}){2,})
For String ?{a,b,c,d}, see Demo1 No Match
For String, ?{a,b,c,?{x,y,z}}, see Demo2 Match successful
For String,
?{a,b,c,
?{x,y,z}
}
see Demo3 Match Successful
For String,
?{a,b,c,
?{x,y,
?{1,2,3}
}
}
see Demo4 Match Successful
For String ?{1,2,3} ?{a,b,c} ?{1,2,3} ?{a,b,c}, see Demo5 No Match
Explanation:
(?<!^\?{[^{}]*), - negative lookbehind to discard the 1st level commas. The logic applied here is it should not match the comma which is preceded by start of the string followed by ?{ followed by 0+ occurrences of any character except { or }
(?=[\s\S]*(?:\s*}){2,}) - The comma matched above must be followed by atleast 2 occurrences of }(consecutive or having only whitespaces between them)

Your question is rather unclear #norbre, but I presume you'd like to extract (i.e. "count") the number of commas.
You can't do this with a regex. Regexps can't count number of occurences. However, you can use this to extract the "internal part" and then use a spreadsheet formula to count number of commas:
^(?:\?{[a-zA-Z0-9,]+?,\n??\s*?\?{)([a-zA-Z0-9,?{}\n\s]+?(?:\n*?\s*?|})+)(?:[a-zA-Z0-9,\n\s]*})$
Try: https://regex101.com/r/Rr0eFo/5
Examples
1.
Input:
?{a,b,c,?{e,f},1,2,3}
Output:
e,f}
2.
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Output:
x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
Input:
?{a,b,c,?{e},1,2,3}
Output:
e}
(note that there are no commas here!)
One caveat however. As I have said, regexps can't count number of occurences.
Hence, the following sample (don't know if it's valid or not for your case) would return wrong match:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Output:
e,f}
,1,2,3,?{a,b}

OK replacing commas is another story so I'll add another answer.
Your regexp engine would need to support recursion.
Still I don't see a way to do it with one regex - one match would either contain the first comma or contain everything between the braces!
What I suggest is to use one regexp to get "what is inside the inner braces", run a replace (, => "") and assemble the whole line again using submatches from the regexp.
Here it is: (\?{[^?{}]*)((?>[^?{}]|(?R))+?)([^?{}]*?\})
Try: https://regex101.com/r/IzTeY0/3
Example 1:
Input:
?{a,b,c,
?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
,d,e,f}
Submatches:
1. ?{a,b,c,
2. ?{x,y,z,e,
?{1,2,3,?{f,g,3},4,5,6}
}
3.
,d,e,f}
Replace all commas in submatch 2 with anything you want, then reassamble the whole string using submatches 1 and 3.
Again, this would break the regexp:
?{a,b,c,?{e,f}
,1,2,3,?{a,b}
}
Submatch 2 would look like this:
?{e,f}
,1,2,3,?{a,b}

Related

Regular expression to extract string from urls

I need to extract a string from an URL. Here are some examples:
Input: https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html – Output: bas-026-009
Input: https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html – Output: aw18-245-b86
Input: https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html – Output: ss20-028-e70
I want to be able to extract the string that goes from the first character after the "/eur_en/" until the third dash. Can someone help me? Thanks
You're looking for regexp: \/eur_en\/([^-]+-[^-]+-[^-]+)
Play & test it at regex101: https://regex101.com/r/RvGROG/1
You need something like this:
const urls = [
"https://www.example.net/eur_en/bas-026-009-basic-baby-hat-beige.html",
"https://www.example.net/eur_en/aw18-245-b86-big-cherries-snow-jacket-plum-red.html",
"https://www.example.net/eur_en/ss20-028-e70-hearts-tee-off-white-yellow.html",
]
const rg = new RegExp(`\/eur_en\/([^-]+-[^-]+-[^-]+)`)
const strs = urls.map(url => url.match(rg)[1])
console.log(strs)
// Output:
// [
// "bas-026-009",
// "aw18-245-b86",
// "ss20-028-e70"
// ]
Of course, it's a simple example. In real cases don't forget to check that .match returned array with length greater than 1.
So, the first element is full captured string and the second (as third and next) it's a sub-strings, which is captured by parentheses.
We can improve and complicate our regex like so:
\/((?:[^-\/]+-){2}[^-\/]+)
It'll allow us to not to use a specific anchor /eur_en/ and control the number of dash divided parts.
The expression you're looking for is the following:
/(?<=eur_en\/)[^-]*-[^-]*-[^-]*/
Here is how it works:
(?<=eur_en\/): will look behind for eur_env/ but will not use it in the output
[^-]*: it will match any character that is not a dash. So it will get everything up to the first dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the second dash (not including the dash)
[^-]*: it will match any character that is not a dash. So it will get everything up to the third dash (not including the dash).
/(?<=\/eur_en\/)\w+-\w+-\w+/g
Tolkens
Description
(?<=\/eur_en\/)
Look behind - If /eur_en/ is found, match whatever proceeds it.
\w+-\w+-\w+
One or more Word character = [A-Za-z0-9] and a literal hyphen three consecutive times.
Review: https://regex101.com/r/Ge0zA3/1

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

Regex: Separate a string of characters with a non-consistent pattern (Oracle) (POSIX ERE)

EDIT: This question pertains to Oracle implementation of regex (POSIX ERE) which does not support 'lookaheads'
I need to separate a string of characters with a comma, however, the pattern is not consistent and I am not sure if this can be accomplished with Regex.
Corpus: 1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25
The pattern is basically 4 digits, followed by 4 characters, followed by a dot, followed by 1,2, or 3 digits! To make the string above clear, this is how it looks like separated by a space 1710ABCD.13 1711ABCD.43 1711ABCD.4 1711ABCD.404 1711ABCD.25
So the output of a replace operation should look like this:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
I was able to match the pattern using this regex:
(\d{4}\w{4}\.\d{1,3})
It does insert a comma but after the third digit beyond the dot (wrong, should have been after the second digit), but I cannot get it to do it in the right position and globally.
Here is a link to a fiddle
https://regex101.com/r/qQ2dE4/329
All you need is a lookahead at the end of the regular expression, so that the greedy \d{1,3} backtracks until it's followed by 4 digits (indicating the start of the next substring):
(\d{4}\w{4}\.\d{1,3})(?=\d{4})
^^^^^^^^^
https://regex101.com/r/qQ2dE4/330
To expand on #CertainPerformance's answer, if you want to be able to match the last token, you can use an alternative match of $:
(\d{4}\w{4}\.\d{1,3})(?=\d{4}|$)
Demo: https://regex101.com/r/qQ2dE4/331
EDIT: Since you now mentioned in the comment that you're using Oracle's implementation, you can simply do:
regexp_replace(corpus, '(\d{1,3})(\d{4})', '\1,\2')
to get your desired output:
1710ABCD.13,1711ABCD.43,1711ABCD.4,1711ABCD.404,1711ABCD.25
Demo: https://regex101.com/r/qQ2dE4/333
In order to continue finding matches after the first one you must use the global flag /g. The pattern is very tricky but it's feasible if you reverse the string.
Demo
var str = `1710ABCD.131711ABCD.431711ABCD.41711ABCD.4041711ABCD.25`;
// Reverse String
var rts = str.split("").reverse().join("");
// Do a reverse version of RegEx
/*In order to continue searching after the first match,
use the `g`lobal flag*/
var rgx = /(\d{1,3}\.\w{4}\d{4})/g;
// Replace on reversed String with a reversed substitution
var res = rts.replace(rgx, ` ,$1`);
// Revert the result back to normal direction
var ser = res.split("").reverse().join("");
console.log(ser);

How to Select a line only text with special character using regex

Input:
1.\frac{[a+b]}{xjch}
2.\frac{pqz}{xjch}
Wanted output is
1.[a+b]/(xjch)
2.(pqz)/(xjch)
My regex is:
\\frac\{(.{2,})\}\{(.{2,})\}
if i apply this regex,
the output will be,
1.([a+b])/(xjch)
2.(pqz)/(xjch)
But i dont want () in [a+b]. ie if any special character inside the {...}, the round bracket should not come. otherwise, (Without special characters) ,the round bracket should come like (pqz),(xjch).
I want two regex for both 1. and 2. then only i will get wanted output.
Could anyone help me?
you can write a Regex that contain within the bracket and replace the group 1 and 2 with a condition
if(nextchar == "[")
TypeOfYourInstuction = 1;
else
TypeOfYourInstuction = 2;`
and this regex is
\\frac\{\[?([a-zA-Z1-9\+]{2,})\]?\}\{\[?([a-zA-Z1-9\+]{2,})\]?\}
http://regex101.com/r/dN8sA5/18
but as you mention it, you can write two regex for first type and the second one:
the first regex: \[[^\]]{2,}\] // Demo = http://regex101.com/r/dN8sA5/20
the second regex: \{[^\[^\}]*\} // Demo = http://regex101.com/r/dN8sA5/19
you have to replace the second type with parenthesis

Vim: How can I search and replace on just part of each line?

I've got a list of phrases with spaces:
Part One
Part Two
Parts Three And Four
I'd like to use Vim to process the list to produce this:
part_one,"Part One"
part_two,"Part Two"
parts_three_and_four,"Parts Three And Four"
I can get use this regex:
:%s/.*/\L\0\E,"\0"/
to get me to this, which is really close:
part one,"Part One"
part two,"Part Two"
parts three and four,"Parts Three And Four"
Is there any way to replace all spaces before the comma on each with underscores? Either as a modification of the above regex or as a second command?
Assuming there will never be commas in your original data, you should be able to use the following:
:%s/.*/\L\0\E,"\0"/ | %s/ \([^,]*,\)\#=/_/g
This just does another replacement after your current one to replace all of the spaces that come before a comma (using a positive lookahead).
:g/./let #s='"'.getline('.').'"'|s/ /_/g|exec "norm! guuA,\<ESC>\"sp"
if I do this, I may do it with macro.
You can try with an expression in the replacement part, like:
:%s/\v^.*$/\=tolower(substitute(submatch(0), "\\s\\+", "_", "g")) . ",\"" . submatch(0) . "\""/
It first substitute all whitespaces with _, and the returned string is lowercased. Then it joins the result with the matched line surrounded with double quotes.
It yields:
part_one,"Part One"
part_two,"Part Two"
parts_three_and_four,"Parts Three And Four"
From where you left off, I would split all the lines after the comma:
:%s/,/,\r/
Then you can replace the spaces on just the lines with a comma and join your lines back together:
:g/,/:s/ /_/g|j!