The string I watch to match against is as follow:
5 + __FXN1__('hello', 1, 3, '__HELLO__(hello) + 5') + 5 + (2/2) + __FXN2__('Good boy')
I tried with regex express [A-Z0-9_]+\(.*?\) which matches
__FXN1__('hello', 1, 3, '__HELLO__(hello) and __FXN2__('Good boy')
What I am expecting is:
__FXN1__('hello', 1, 3, '__HELLO__(hello) + 5') and __FXN2__('Good boy')
How can we achieve it. Please help.
If the parentheses are always balanced, you can use a recursion-based regex like
__[A-Z0-9_]+__(\((?:[^()]++|(?-1))*\))
may fail if there is an unbalanced amount of ( or ) inside strings, see this regex demo. In brief:
__[A-Z0-9_]+__ - __, one or more uppercase letters, digits or _ and then __
(\((?:[^()]++|(?-1))*\)) - Group 1: (, then any zero or more occurrences of one or more chars other than ( and ) or the whole Group 1 pattern recursed, and then a ) (so the (...) substring with any amount of paired nested parentheses is matched).
If you need to support unbalanced parentheses, it is safer to use a regex that just matches all allowed data formats, e.g.
__[A-Z0-9_]+__\(\s*(?:'[^']*'|\d+)(?:\s*,\s*(?:'[^']*'|\d+))*\s*\)
See the regex demo. Or, if ' can be escaped with a \ char inside the '...' strings, you can use
__[A-Z0-9_]+__\(\s*(?:'[^'\\]*(?:\\.[^'\\]*)*'|\d+)(?:\s*,\s*(?:'[^'\\]*(?:\\.[^'\\]*)*'|\d+))*\s*\)
See this regex demo.
Details:
__[A-Z0-9_]+__ - __, one or more upper or digits and then __
\( - ( char
\s* - zero or more whitespaces
(?:'[^']*'|\d+) - ', zero or more non-' and then a ' or one or more digits
(?:\s*,\s*(?:'[^']*'|\d+))* - zero or more occurrences of a , enclosed with optional whitespace and then either a '...' substring or one or more digits
\s*\) - zero or more whitespace and then a ).
Note if you need to support any kind of numbers, you need to replace \d+ with a more sophisticated pattern like [+-]?\d+(?:\.\d+)? or more.
Related
I have to rename the toString output variables in several hundred files with many occurrences in each. In the most efficient way possible, how could I parse this text:
.append(", myVariable=").append(myVariable)
.append(", myOtherVariable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", myVarWithURL=").append(myVarWithURL);
and it becomes:
.append(", my_variable=").append(myVariable)
.append(", my_other_variable=").append(myOtherVariable)
.append(", mylowervariable=").append(myLowerVariable) // note the left is already lowercase
.append(", my_var_with_url=").append(myVarWithURL);
The ones on the right are to remain unchanged, while the ones to the left of the equals sign are to be changed, if they contain uppercase characters.
These will be of arbitrary lengths with a varying number of upper case letters. I was thinking I had to do some sort of lookahead but could not get the replacement value to work correctly.
I have the flexibility of being able to do this in IntelliJ or Notepad++, so I can easily perform the \l \L operators to make a replacement value lowercase.
This was my thought process:
in: myLongCamelCasedVariable
re: ([a-z]+)([A-Z]{1})([a-z]+) // repeat grouping for capturing
group 1 group 2 group 3 group 4
my + [ L + ong ] + [ C + amel ] + [ C + ased ] + [ V + ariable ]
Is it possible to use a regular expression to effectively capture the various groups of 'text' in the larger text string, and 'loop' over that and apply the output?
Out: $1_\l$2 .... etc
Now I am just stuck
You may use
Find What: (?:\G(?!\A)|",\h*)\K(\b|[a-z]+)([A-Z]+)(?=\w*=")
Replace With: $1_\L$2
Match case: True
Details:
(?:\G(?!\A)|",\h*) - start matching from the end of the previous successful match (\G(?!\A)) or (|) a ", and zero or more horizontal whitespaces (",\h*)
\K - remove the text matched so far from the match memory buffer
(\b|[a-z]+) - Group 1: word boundary or one or more lowercase letters
([A-Z]+) - Group 2: one or more uppercase letters
(?=\w*=") - immediately to the right, there must be zero or more word chars followed with a = char.
The replacement is $1_\L$2: Group 1, _, and then lowercased Group 2 value.
See the Notepad++ demo screen:
You could match sequences of an uppercase char followed by optional uppercase chars and then optional lowercase chars.
In the replacement use _ followed by the lowercased match \L$0
Find what:
(?>,\h+[a-z]+|\G(?!^))\K[A-Z][A-Z]*[a-z]*
(?> Atomic group
,\h+[a-z]+ Match a comma, 1 or more spaces and 1 or more lowercase chars
| Or
\G(?!^) Assert the current position at the end of the previous match but not at the start of the string (so the first part of the alternation has to match first)
) Close atomic group
\K Forget what is matched so far
[A-Z][A-Z]*[a-z]* Match an uppercase char followed by optional upper and lowercase chars
Replace with:
_\L$0
Regex demo
Without using \K you can use 2 capture groups.
(?>(, [a-z]+)|\G(?!^))([A-Z][A-Z]*[a-z]*)
In the replacement use $1_\L$2
My regex is something like this **(A)(([+-]\d{1,2}[YMD])*)** which is matching as expected like A+3M, A-3Y+5M+3D etc..
But I want to capture all the groups of this sub pattern**([+-]\d{1,2}[YMD])***
For the following example A-3M+2D, I can see only 4 groups. A-3M+2D (group 0), A(group 1), -3M+2D (group 2), +2D (group 3)
Is there a way I can get the **-3M** as a separate group?
Repeated capturing groups usually capture only the last iteration. This is true for Kotlin, as well as Java, as the languages do not have any method that would keep track of each capturing group stack.
What you may do as a workaround, is to first validate the whole string against a certain pattern the string should match, and then either extract or split the string into parts.
For the current scenario, you may use
val text = "A-3M+2D"
if (text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex())) {
val results = text.split("(?=[-+])".toRegex())
println(results)
}
// => [A, -3M, +2D]
See the Kotlin demo
Here,
text.matches("""A(?:[+-]\d{1,2}[YMD])*""".toRegex()) makes sure the whole string matches A and then 0 or more occurrences of + or -, 1 or 2 digits followed with Y, M or D
.split("(?=[-+])".toRegex()) splits the text with an empty string right before a - or +.
Pattern details
^ - implicit in .matches() - start of string
A - an A substring
(?: - start of a non-capturing group:
[+-] - a character class matching + or -
\d{1,2} - one to two digits
[YMD] - a character class that matches Y or M or D
)* - end of the non-capturing group, repeat 0 or more times (due to * quantifier)
\z - implicit in matches() - end of string.
When splitting, we just need to find locations before - or +, hence we use a positive lookahead, (?=[-+]), that matches a position that is immediately followed with + or -. It is a non-consuming pattern, the + or - matched are not added to the match value.
Another approach with a single regex
You may also use a \G based regex to check the string format first at the start of the string, and only start matching consecutive substrings if that check is a success:
val regex = """(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$))[^-+]+""".toRegex()
println(regex.findAll("A-3M+2D").map{it.value}.toList())
// => [A, -3M, +2D]
See another Kotlin demo and the regex demo.
Details
(?:\G(?!^)[+-]|^(?=A(?:[+-]\d{1,2}[YMD])*$)) - either the end of the previous successful match and then + or - (see \G(?!^)[+-]) or (|) start of string that is followed with A and then 0 or more occurrences of +/-, 1 or 2 digits and then Y, M or D till the end of the string (see ^(?=A(?:[+-]\d{1,2}[YMD])*$))
[^-+]+ - 1 or more chars other than - and +. We need not be too careful here since the lookahead did the heavy lifting at the start of string.
I have a regex query which works fine for most of the input patterns but few.
Regex query I have is
("(?!([1-9]{1}[0-9]*)-(([1-9]{1}[0-9]*))-)^(([1-9]{1}[0-9]*)|(([1-9]{1}[0-9]*)( |-|( ?([1-9]{1}[0-9]*))|(-?([1-9]{1}[0-9]*)){1})*))$")
I want to filter out a certain type of expression from the input strings i.e except for the last character for the input string every dash (-) should be surrounded by the two separate integers i.e (integer)(dash)(integer).
Two dashes sharing 3 integers is not allowed even if they have integers on either side like (integer)(dash)(integer)(dash)(integer).
If the dash is the last character of input preceded by the integer that's an acceptable input like (integer)(dash)(end of the string).
Also, two consecutive dashes are not allowed. Any of the above-mentioned formats can have space(s) between them.
To give the gist, these dashes are used in my input string to provide a range.
Some example of expressions that I want to filter out are
1-5-10, 1 - 5 - 10, 1 - - 5, -5
Update - There are some rules which will drive the input string. My job is to make sure I allow only those input strings which follow the format. Rules for the format are -
1. Space (‘ ‘) delimited numbers. But dash line doesn’t need to have a space. For example, “10 20 - 30” or “10 20-30” are all valid values.
2. A dash line (‘-‘) is used to set range (from, to). It also can used to set to the end of job queue list. For example, “100-150 200-250 300-“ is a valid value.
3. A dash-line without start job number is not allowed. For example, “-10” is not allowed.
Thanks
You might use:
^(?:(?:[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]*|[1-9][0-9]*)(?: (?:[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]*|[1-9][0-9]*))*(?: [1-9][0-9]*-)?|[1-9][0-9]*-?)[ ]*$
Regex demo
Explanation
^ Assert start of the string
(?: Non capturing group
(?: Non capturing group
[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]* Match number > 0, an optional space, a dash, an optional space and number > 0. The space is in a character class [ ] for clarity.
| Or
[1-9][0-9]* Match number > 0
) Close non capturing group
(?:[ ] Non capturing group followed by a space
(?: Non capturing group
[1-9][0-9]*[ ]?-[ ]?[1-9][0-9]* Match number > 0, an optional space, a dash, an optional space and number > 0.
| Or
[1-9][0-9]* Match number > 0
) close non capturing group
)* close non capturing group and repeat zero or more times
(?: [1-9][0-9]*-)? Optional part that matches a space followed by a number > 0
| Or
[1-9][0-9]*-? Match a number > 0 followed by an optional dash
) close non capturing group
[ ]*$ Match zero or more times a space and assert the end of the string
NoteIf you want to match zero or more times a space instead of an optional space, you could update [ ]? to [ ]*. You can write [1-9]{1} as [1-9]
After the update the question got quite a lot of complexity. Since some parts of the regex are reused multiple times I took the liberty of working this out in Ruby and cleaned it up afterwards. I'll show you the build process so the regex can be understood. Ruby uses #{variable} for regex and string interpolation.
integer = /[1-9][0-9]*/
integer_or_range = /#{integer}(?: *- *#{integer})?/
integers_or_ranges = /#{integer_or_range}(?: +#{integer_or_range})*/
ending = /#{integer} *-/
regex = /^(?:#{integers_or_ranges}(?: +#{ending})?|#{ending})$/
#=> /^(?:(?-mix:(?-mix:(?-mix:[1-9][0-9]*)(?: *- *(?-mix:[1-9][0-9]*))?)(?: +(?-mix:(?-mix:[1-9][0-9]*)(?: *- *(?-mix:[1-9][0-9]*))?))*)(?: +(?-mix:(?-mix:[1-9][0-9]*) *-))?|(?-mix:(?-mix:[1-9][0-9]*) *-))$/
Cleaning up the above regex leaves:
^(?:[1-9][0-9]*(?: *- *[1-9][0-9]*)?(?: +[1-9][0-9]*(?: *- *[1-9][0-9]*)?)*(?: +[1-9][0-9]* *-)?|[1-9][0-9]* *-)$
You can replace [0-9] with \d if you like, but since you used the [0-9] syntax in your question I used it for the answer as well. Keep in mind that if you do replace [0-9] with \d you'll have to escape the backslash in string context. eg. "[0-9]" equals "\\d"
You mention in your question that
Any of the above-mentioned formats can have space(s) between them.
I assumed that this means space(s) are not allowed before or after the actual content, only between the integers and -.
Valid:
15 - 10
1234 -
Invalid:
15 - 10
123
If this is not the case simply add * to the start and end.
^ *... *$
Where ... is the rest of the regex.
You can test the regex in my demo, but it should be clear from the build process what the regex does.
var
inputs = [
'1-5-10',
'1 - 5 - 10',
'1 - - 5',
'-5',
'15-10',
'15 - 10',
'15 - 10',
'1510',
'1510-',
'1510 -',
'1510 ',
' 1510',
' 15 - 10',
'10 20 - 30',
'10 20-30',
'100-150 200-250 300-',
'100-150 200-250 300- ',
'1-2526-27-28-',
'1-25 26-2728-',
'1-25 26-27 28-',
],
regex = /^(?:[1-9][0-9]*(?: *- *[1-9][0-9]*)?(?: +[1-9][0-9]*(?: *- *[1-9][0-9]*)?)*(?: +[1-9][0-9]* *-)?|[1-9][0-9]* *-)$/,
logInputAndMatch = input => {
console.log(`input: "${input}"`);
console.log(input.match(regex))
};
inputs.forEach(logInputAndMatch);
Consider the example below:
AT+CEREG?
+CEREG: "4s",123,"7021","28","8B7200B",8,,,"00000010","10110100"
The desired response would be to pick n
n=1 => "4s"
n=2 => 123
n=8 =>
n=10 => 10110100
In my case, I am enquiring some details from an LTE modem and above is the type of response I receive.
I have created this regex which captures the (n+1)th member under group 2 including the last member, however, I can't seem to work out how to pick the 1st parameter in the approach I have taken.
(?:([^,]*,)){5}([^,].*?(?=,|$))?
Could you suggest an alternative method or complete/correct mine?
You may start matching from : (or +CEREG: if it is a static piece of text) and use
:\s*(?:[^,]*,){min}([^,]*)
where min is the n-1 position of the expected value.
See the regex demo. This solution is std::regex compatible.
Details
: - a colon
\s* - 0+ whitespaces
(?:[^,]*,){min} - min occurrences of any 0+ chars other than , followed with ,
([^,]*) - Capturing group 1: 0+ chars other than ,.
A boost::regex solution might look neater since you may easily capture substrings inside double quotes or substrings consisting of chars other than whitespace and commas using a branch reset group:
:\s*(?:[^,]*,){0}(?|"([^"]*)"|([^,\s]+))
See the regex demo
Details
:\s*(?:[^,]*,){min} - same as in the first pattern
(?| - start of a branch reset group where each alternative branch shares the same IDs:
"([^"]*)" - a ", then Group 1 holding any 0+ chars other than " and then a " is just matched
| - or
([^,\s]+)) - (still Group 1): one or more chars other than whitespace and ,.
I have this scenario:
Ex1:
Valid:
12345678|abcdefghij|aaaaaaaa
Invalid:
12345678|abcdefghijk|aaaaaaaaa
Which means that between pipes the maximum length is 8. How can I make in the regex?
I put this
^(?:[^|]+{0,7}(?:\|[^|]+)?$ but it´s not working
Try the following pattern:
^.{1,8}(?:\|.{1,8})*$
The basic idea is to match between one and eight characters, followed by | and another 1 to 8 characters, that term repeated zero or more times. Explore the demo with any data you want to see how it works.
Sample data:
123
12345678
abcdefghi (no match)
12345678|abcdefgh|aaaaaaaa
12345678|abcdefghijk|aaaaaaaaa (no match)
Demo here:
Regex101
When you want to match delimited data, you should refrain from using plain unrestricted .. You need to match parts between |, so you should consider [^|] negated character class construct that matches any char but |.
Since you need to limit the number of the pattern occurrences of the negated character class, restrict it with a limiting quantifier {1,8} that matches 1 to 8 consecutive occurrences of the quantified subpattern.
Use
^[^|]{1,8}(?:\|[^|]{1,8})*$
See the regex demo.
Details
^ - start of a string
[^|]{1,8} - any 1 to 8 chars other than |
(?:\|[^|]{1,8})* - 0 or more consecutive sequences of:
\| - a literal pipe symbol
[^|]{1,8} - any 1 to 8 chars other than |
$ - end of string.
Then, the [^|] can be restricted further as per requirements. If you only need to validate a string that has ASCII letters, digits, (, ), +, ,, ., /, :, ?, whitespace and -, you need to use
^[A-Za-z0-9()+,.\/:?\s-]{1,8}(?:\|[A-Za-z0-9()+,.\/:?\s-]{1,8})*$
See another regex demo.