Regex for KeyValue pattern - regex

I have to check if a string follows the following patterns:
Field1=value1
Field1=value1,Field2=value2
7645a=fds23,Field2=dsd$
The words 'field1', 'value1' don't count, the important thing is that it has to be something=something and if there is more than 1, it should be a comma for each pair.
I reached the following regex:
((\w+)[^=])=((\w+)[^=])
"Match any one or more word except if it has =, then there should be an = and then match any one or more word except if it has =".
The thing is, it does take the comma but I think is because of \w. I don't think this is correct.
I'm using https://regexr.com/ to check for the correct regular expression.

If you need to match symbols like $, then don't use \w. This satisfies all your conditions:
(?:([^,=\n]+)=([^,=\n]+))(?:,([^,=\n]+)=([^,=\n]+))*
Explanation:
(?: // Begin non-capturing group (first key=value pair)
( // Begin capturing group (key)
[^,=\n]+ // Match one or more characters that aren't comma, equals, or new line
) // End capturing group (key)
= // Equals
( // Begin capturing group (value)
[^,=\n]+ // Match one or more characters that aren't comma, equals, or new line
) // End capturing group (value)
) // End non-capturing group (first key=value pair)
(?: // Begin non-capturing group (additional key=value pairs)
, // Starts with comma (otherwise entire group fails)
( // Begin capturing group (key)
[^,=\n]+ // Match one or more characters that aren't comma, equals, or new line
) // End capturing group (key)
= // Equals
( // Begin capturing group (value)
[^,=\n]+ // Match one or more characters that aren't comma, equals, or new line
) // End capturing group (value)
) // End non-capturing group (additional key=value pairs)
* // Match 0 or more of the additional key value pairs
Test Here

Related

How to match in a single/common Regex Group matching or based on a condition

I would like to extract two different test strings /i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
and
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8
with a single RegEx and in Group-1.
By using this RegEx ^.[i,na,fm,d]+\/(.+([,\/])?(\/|.+=.+,\/).+\/[,](live.([^,]).).+_)?.+(640).*$ I can get the second string to match the desired result int/2021/11/25/,live_20211125_215206_
but the first string does not match in Group-1 and the missing expected test string 1 extraction is int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45
Any pointers on this is appreciated.
Thanks!
If you want both values in group 1, you can use:
^/(?:[id]|na|fm)/([^/\s]*/\d{4}/\d{2}/\d{2}/\S*?)(?:/,|[^_]+_)640(?:\D|$)
The pattern matches:
^ Start of string
/ Match literally
(?:[id]|na|fm) Match one of i d na fm
/ Match literally
( Capture group 1
[^/\s]*/ Match any char except a / or a whitespace char, then match /
\d{4}/\d{2}/\d{2}/ Match a date like pattern
\S*? Match optional non whitespace chars, as few as possible
) Close group 1
(?:/,|[^_]+_) Match either /, or 1+ chars other than _ and then match _
640 Match literally
(?:\D|$) Match either a non digits or assert end of string
See a regex demo and a go demo.
We can't know all the rules of how the strings your are matching are constructed, but for just these two example strings provided:
package main
import (
"fmt"
"regexp"
)
func main() {
var re = regexp.MustCompile(`(?m)(\/i/int/\d{4}/\d{2}/\d{2}/.*)(?:\/,|_[\w_]+)640`)
var str = `
/i/int/2021/11/18/019e1691-614c-4402-a8c1-d0239ad1ac45/,640-1_999899,480-1_999899,960-1_999899,1280-1_999899,1920-1_999899,.mp4.csmil/master.m3u8?set-segment-duration=responsive
/i/int/2021/11/25/,live_20211125_215206_sendeton_640x360-50p-1200kbit,live_20211125_215206_sendeton_480x270-50p-700kbit,live_20211125_215206_sendeton_960x540-50p-1600kbit,live_20211125_215206_sendeton_1280x720-50p-3200kbit,live_20211125_215206_sendeton_1920x1080-50p-5000kbit,.mp4.csmil/master.m3u8`
match := re.FindAllStringSubmatch(str, -1)
for _, val := range match {
fmt.Println(val[1])
}
}

Pattern match for (length)%code with before length

I have a pattern like x%c, where x is a single digit integer and c is an alphanumeric code of length x. % is just a token separator of length and code
For instance 2%74 is valid since 74 is of 2 digits. Similarly, 1%8 and 4%3232 are also valid.
I have tried regex of form ^([0-9])(%)([A-Z0-9]){\1}, where I am trying to put a limit on length by the value of group 1. It does not work apparently since the group is treated as a string, not a number.
If I change the above regex to ^([0-9])(%)([A-Z0-9]){2} it will work for 2%74 it is of no use since my length is to be limited controlled by the first group not a fixed digit.
I it is not possible by regex is there a better approach in java?
One way could be using 2 capture groups, and convert the first group to an int and count the characters for the second group.
\b(\d+)%(\d+)\b
\b Word boundary
(\d+) Capture group 1, match 1+ digits
% Match literally
(\d+) Capture group 2, match 1+ digits
\b Word boundary
Regex demo | Java demo
For example
String regex = "\\b(\\d+)%(\\d+)\\b";
String string = "2%74";
Pattern pattern = Pattern.compile(regex);
String strings[] = { "2%74", "1%8", "4%3232", "5%123456", "6%0" };
for (String s : strings) {
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
if (Integer.parseInt(matcher.group(1)) == matcher.group(2).length()) {
System.out.println("Match for " + s);
} else {
System.out.println("No match for " + s);
}
}
}
Output
Match for 2%74
Match for 1%8
Match for 4%3232
No match for 5%123456
No match for 6%0

Regex: Deal \r\n as normal word

I'm doing a small project which can calculate the count of functions in C++ files(.cpp).
I used the following Regex as "function pattern":
/[a-z|A-Z]+\s*::\s*~?[a-z|A-Z]+\(.*\)/gm
It works for most cases, but fails when there are new line breaks in ().
void CXYZRScanPanel::OnPrepareScanning()
{
//This one is ok.
}
void CXYZRScanPanel::OnPrepareScanning(int k)
{
//This one is ok.
}
void CXYZRScanPanel::OnPrepareScanning(int k,
int j)
{
//This one fails.
}
I'm thinking if there is anything "stronger" than the .* which can skip the \r\n.
Thanks for any help.
If there is no such a thing, I will probably remove all /r/n within () before doing the such.
You could write the pattern using a negated character class starting with [^ matching any char except ( and ) which will also match a newline.
Note that you can omit the | in the character class.
[a-zA-Z]+\s*::\s*~?[a-zA-Z]+(\([^()]*\))
The pattern matches:
[a-zA-Z]+ Match 1+ times chars a-zA-Z
\s*::\s* Match :: between optional whitespace chars
~? Match an optional ~ char
[a-zA-Z]+ Match 1+ times chars a-zA-Z
( Capture group 1
\([^()]*\) Optionally match any char except ( and ) between parenthesis
) Close group 1
See a regex demo

regex match longest substring with equal first and last char

/(\w)(\w*)\1/
For this string:"mgntdygtxrvxjnwksqhxuxtrv" I match "txrvxjnwksqhxuxt" (using Ruby), but not the even longer valid substring "tdygtxrvxjnwksqhxuxt".
For a given string, here are two ways to find the longest substring that begins and ends with the same character.
Suppose
str = "mgntdygtxrvxjnwksqhxuxtrv"
Use a regular expression
r = /(.)(?=(.*\1))/
str.gsub(r).map { $1 + $2 }.max_by(&:length)
#=> "tdygtxrvxjnwksqhxuxt".
When, as here, the regular expression contains capture groups, it may be more convenient to use String#gsub without a second argument or block (in which case it returns an enumerator, which can be chained) than String#scan (" If the pattern contains groups, each individual result is itself an array containing one entry per group.") Here gsub performs no substitutions; it merely generates matches of the regular expression.
The regular expression can be made self-documenting by writing it in free-spacing mode.
r = /
(.) # match any char and save to capture group 1
(?= # begin a positive lookahead
(.*\1) # match >= 0 characters followed by the contents of capture group 1
) # end the postive lookahead
/x # free-spacing regex definition mode
The following intermediate calculation is performed:
str.gsub(r).map { $1 + $2 }
#=> ["gntdyg", "ntdygtxrvxjn", "tdygtxrvxjnwksqhxuxt", "txrvxjnwksqhxuxt",
# "xrvxjnwksqhxux", "rvxjnwksqhxuxtr", "vxjnwksqhxuxtrv", "xjnwksqhxux",
# "xux"]
Notice that this does not enumerate all substrings beginning and ending with the same character (because .* is greedy). It does not generate, for example, the substring "xrvx".
Do not use a regular expression
v = str.each_char.with_index.with_object({}) do |(c,i),h|
if h.key?(c)
h[c][:size] = i - h[c][:start] + 1
else
h[c] = { start: i, size: 1 }
end
end.max_by { |_,h| h[:size] }.last
str[v[:start], v[:size]]
#=> "tdygtxrvxjnwksqhxuxt"

Split string on commas ignoring commas, brackets, braces in parenthesis, quotes

I am attempting to split a comma separated list. I want to ignore commas that are in parenthesis, brackets, braces and quotes using regex. To be more precise I am trying to do this in postgres POSIX regexp_split_to_array.
My knowledge of regex is not great and by searching on stack overflow I was able to get a partial solution, I can split the string if it does not contain nested parenthesis, brackets, braces. Here is the regex:
,(?![^()]*+\))(?![^{}]*+})(?![^\[\]]*+\])(?=(?:[^"]|"[^"]*")*$)
Test case:
0, (1,2), (1,2,(1,2)) [1,2,3,[1,2]], [1,2,3], "text, text (test)", {a1:1, a2:3, a3:{a1=1, s2=2}, a4:"asasad, sadsas, asasdasd"}
Here is the demo
The problem is that in i.e. (1,2,(1,2)) the first 2 commas get matched if there is a nested parenthesis.
Even though regex is not the best way to go, here is a solution with recursive matching:
(?>(?>\([^()]*(?R)?[^()]*\))|(?>\[[^[\]]*(?R)?[^[\]]*\])|(?>{[^{}]*(?R)?[^{}]*})|(?>"[^"]*")|(?>[^(){}[\]", ]+))(?>[ ]*(?R))*
If we break it down, there is a group with some stuff inside, followed by more of the same kind of matching, separated by optional spaces.
(?> <---- start matching
... <---- some stuff inside
) <---- end matching
(?>
[ ]* <---- optional spaces
(?R) <---- match the entire thing again
)* <---- can be repeated
From your example 0, (1,2), (1,2,(1,2)) [1,2,3,[1,2]], [1,2,3],..., we want to match:
0
(1,2)
(1,2,(1,2)) [1,2,3,[1,2]]
[1,2,3]
...
For the third match, the stuff inside will match (1,2,(1,2)) and [1,2,3,[1,2]], which are separated by a space.
The stuff inside is a series of options:
(?>
(?>...)| <---- will match balanced ()
(?>...)| <---- will match balanced []
(?>...)| <---- will match balanced {}
(?>...)| <---- will match "..."
(?>...) <---- will match anything else without space or comma
)
Here are the options:
\( <---- literal (
[^()]* <---- any number of chars except ( or )
(?R)? <---- match the entire thing optionally
[^()]* <---- any number of chars except ( or )
\) <---- literal )
\[ <---- literal [
[^[\]]* <---- any number of chars except [ or ]
(?R)? <---- match the entire thing optionally
[^[\]]* <---- any number of chars except [ or ]
\] <---- literal ]
{ <---- literal {
[^{}]* <---- any number of chars except { or }
(?R)? <---- match the entire thing optionally
[^{}]* <---- any number of chars except { or }
} <---- literal }
" <---- literal "
[^"]* <---- any number of chars except "
" <---- literal "
[^(){}[\]", ]+ <---- one or more chars except comma, or space, or these: (){}[]"
Note that this does not match a comma-separated list, but the items in such a list. The exclusion of comma and space in the last option above causes it to stop matching at comma or space (except for space we explicitly allowed between repeated matches).