A regex capturing multiple groups starting with a pattern - regex

I am trying to figure out a regex that would capture multiple groups in a string where each group is defined as follows:
The group's title starts with ${{
An optional string may follow
The group's title ends with }}
Optional content may follow the title
An example would be
'${{an optional title}} some optional content'
Here are some examples of inputs and expected results
Input 1: '${{}} some text '
Result 1: ['${{}} some text ']
Input 2: '${{title1}} some text1 ${{title 2}} some text2'
Result 2: ['${{title1}} some text1 ', '${{title 2}} some text2']
Input 3 (no third group as the second ending curly bracket is missing)
'${{title1}} some text1 ${{}} some text2 ${{title2} some text3'
Result 3 ['${{title1}} some text1 ', '${{}} some text2 ${{title2} some text3']
Input 4 (a group with empty content immediately followed by another group)
'${{title1}}${{}} some text2'
Result 4 ['${{title1}}', '${{}} some text2']
Any suggestions will be appreciated!

You can achieve that with Lookaheads. Try the following pattern:
\$\{\{.*?\}\}.*?(?=\$\{\{.*?\}\}|$)
Demo.
Breakdown:
\$\{\{.*?\}\} # Matches a "group" (i.e., "${{}}") containing zero or more chars (lazy).
.*? # Matches zero or more characters after the "group" (lazy).
(?= # Start of a positive Lookahead.
\$\{\{.*?\}\} # Ensure that the match is either followed by a "group"...
| # Or...
$ # ..is at the end of the string.
) # Close the Lookahead.

Related

Match specific letter from group N Regex

I have the following log message:
Aug 25 03:07:19 localhost.localdomainASM:unit_hostname="bigip1",management_ip_address="192.168.41.200",management_ip_address_2="N/A",http_class_name="/Common/log_to_elk_policy",web_application_name="/Common/log_to_elk_policy",policy_name="/Common/log_to_elk_policy",policy_apply_date="2020-08-10 06:50:39",violations="HTTP protocol compliance failed",support_id="5666478231990524056",request_status="blocked",response_code="0",ip_client="10.43.0.86",route_domain="0",method="GET",protocol="HTTP",query_string="name='",x_forwarded_for_header_value="N/A",sig_ids="N/A",sig_names="N/A",date_time="2020-08-25 03:07:19",severity="Eror",attack_type="Non-browser Client,HTTP Parser Attack",geo_location="N/A",ip_address_intelligence="N/A",username="N/A",session_id="0",src_port="39348",dest_port="80",dest_ip="10.43.0.201",sub_violations="HTTP protocol compliance failed:Bad HTTP version",virus_name="N/A",violation_rating="5",websocket_direction="N/A",websocket_message_type="N/A",device_id="N/A",staged_sig_ids="",staged_sig_names="",threat_campaign_names="N/A",staged_threat_campaign_names="N/A",blocking_exception_reason="N/A",captcha_result="not_received",microservice="N/A",tap_event_id="N/A",tap_vid="N/A",vs_name="/Common/adv_waf_vs",sig_cves="N/A",staged_sig_cves="N/A",uri="/random",fragment="",request="GET /random?name=' or 1 = 1' HTTP/1.1\r\n",response="Response logging disabled"
And I have the following RegEx:
request="(?<Flag1>.*?)"
I trying now to match some text again from the previous group under name "Flag1", the new match that I'm trying to flag it is /random?name=' or 1 = 1' as Flag2.
How can I match the needed text from other matched group number or flag name without insert the new flag inside the targeted group like:
request="(?<Flag1>\w+\s+(?<Flag2>.*?)\s+HTTP.*?)"
https://regex101.com/r/EcBv7p/1
Thanks.
You can use
request="(?<Flag1>[A-Z]+\s+(?<Flag2>\/\S+='[^']*')[^"]*)"
See the regex demo.
Details:
(?<Flag1> - Flag1 group:
[A-Z]+ - one or more uppercase ASCII letters
\s+ - one or more whitespaces
(?<Flag2>\/\S+='[^']*') - Group Flag2: /, one or more non-whitespace chars, =', zero or more chars other than ', and then a ' char
[^"]* - zero or more chars other than "
) - end of Flag1 group.
If I understand you correctly, you want to match whatever string a previous group has matches, right?
In that case you can use \n or in this case \1 to match the same thing that your first capture group matched

Regex to generate dynamic sql

I want to generate dynamic sql on Notepad++ based on some rules. These rules include everything, so no sql knowledge is needed, and are the following:
Dynamic sql must have each single quote escaped by another single quote ( 'hello' becomes ''hello'')
Each line should begin with "+#lin"
If a line has only whitespace, nothing should be following the "+#lin", despite following rules
Replace each \t directly following "+#lin" with "+#tab"
Add " +' " after the #lin/#tab sequence
Add a single quote at the end of line
So, as an example, this input:
select 1,'hello'
from --two tabs exist after from
table1
should become:
+#lin+'select 1,''hello'''
+#lin+'from --two tabs exist after from'
+#lin
+#lin+#tab+'table1'
What I have for now is the following 4 steps:
Replace single quote with double quotes to cover rule 1
Replace ^(\t*)(.*)$ with \+#lin\1\+'\2' to cover rules 2,5,6
Replace \t with \+#tab to cover rule 4
Replace (\+#tab)*\+''$ with nothing to cover rule 3
Notice that this mostly works, except for the third replacement, which replaces all tabs, and not only the ones at the beginning. I tried (?<=^\t*)\t with no success- it matches nothing.
I'm looking for a solution which satisfies the rules in as few replacement steps as possible.
After replacing single quotes with 2 quotes, you can do the rest in a single step:
Not very elegant for processing multiple TABs, but it works.
Ctrl+H
Find what: ^(?:(\t)(\t)?(\t)?(\t)?(\t)?(\S.*)|\h*|(.+))$
Replace with: +#lin(?1+#tab+(?2#tab+)(?3#tab+)(?4#tab+)(?5#tab+)'$6')(?7+'$7')
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(?: # non capture group
(\t) # group 1, tabulation
(\t)? # group 2, tabulation, optional
(\t)? # group 3, tabulation, optional
(\t)? # group 4, tabulation, optional
(\t)? # group 5, tabulation, optional
(\S.*) # group 6, a non-space character followed by 0 or more any character but newline
| # OR
\h* # 0 or more horizontal spaces
| # OR
(.+) # group 7, 1 or more any character but newline
) # end group
$ # end of line
Replacement:
+#lin # literally
(?1 # if group 1 exists
+#tab+ # add this
(?2#tab+) # if group 2 exists, add a second #tab+
(?3#tab+) # id
(?4#tab+) # id
(?5#tab+) # id
'$6' # content of group 6 with single quotes
) # endif
(?7 # if group 7 exists
+ # plus sign
'$7' # content of group 3 with single quotes
) # endif
Screenshot (before):
Screenshot (after):
You can use three substitutions here, it is not quite possible (without additional assumptions) to reduce the number of steps here since you need to replace at the same positions.
Step 1: Replace single quotes with double - ' with ''. No regex so far, but you can have the regex checkbox on.
Step 2: Add +#lin+ at the start of the line and only wrap its contents with ' if there is any non-whitespace char on the line (while keeping all TABs before the first '):
Find What: ^(\t*+)(\h*\S)?+(.*)
Replace With: +#lin+$1(?2'$2$3':)
Details:
^ - start of a line
(\t*+) - Group 1 ($1): zero or more TABs
(\h*\S)?+ - Group 2 ($2): an optional sequence of any zero or more horizontal whitespace chars and then a non-whitespace char
(.*) - Group 3 ($3): the rest of the line
+#lin+$1(?2'$2$3':) - replaces the match with +#lin+ + Group 1 value (i.e. tabs found), and then - only if Group 2 matches - ' + Group 2 + Group 3 values + '
Step 3: Replace each TAB after +#lin+ with #tab+:
Find What: (\G(?!^)|^\+#lin\+)\t
Replace With: $1#tab+
Details:
(\G(?!^)|^\+#lin\+) - Group 1: either
\G(?!^) - end of the previous match
| - or
^\+#lin\+ - start of a line and +#lin+ string
\t - a TAB char.
The replacement is the concatenation of Group 1 value and #tab+ string.
See this regex online demo.

Regex handling multiple groups form a potentially comma delimited list

I'm trying to parse a comma separated list with multiple capture groups in each element via regex.
Sample Text
col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37
I've tried using various variants of this regex
(.*?)\s?=\s?(.*?)\s?,?
But it never gives me what I want or if it gets close it can't cope with there being just one element or vice versa.
What I'm expecting is a list of Matches with 3 groups
Match1 group 0 the whole match
Match1 group 1 col1
Match1 group 2 'Test String'
Match2 group 0 the whole match
Match2 group 1 col2
Match2 group 2 'Next Test String'
Match3 group 0 the whole match
Match3 group 1 col3
Match3 group 2 'Last Test String'
Match4 group 0 the whole match
Match4 group 1 col4
Match4 group 2 37
(Note I'm only interested in groups 1 & 2)
I'm deliberately making this non language specific as I can't get it to work in online Regex debuggers, however, my target language is Python 3
Thank you in advance and I hope I've made myself clear
The (.*?)\s?=\s?(.*?)\s?,? regex has got only one obligatory pattern, =. The (.*?) at the start gets expanded up to the leftmost = and the group captures any text up to the leftmost = and an optional whitespace after it. The rest of the subpatterns do not have to match, if there is a whitespace, it is matched with \s?, if there are two, they are matched, too, and if there is a comma, it is also matched and consumed, the .*? part is simply skipped as it is lazy.
If you want to get the second capturing group with single quotes included, you can use
(?:,|^)\s*([^\s=]+)\s*=\s*('[^']*'|\S+)
See this regex pattern. It matches
(?:,|^) - a non-capturing group matching a , or start of string
\s* - zero or more whitespaces
([^\s=]+) - Group 1: one or more chars other than whitespace and =
\s*=\s* - a = char enclosed with zero or more whitespaces
('[^']*'|\S+) - Group 2: either ', zero or more non-'s, and a ', or one or more non-whitespaces.
If you want to exclude single quotes you can post-process the matches, or use an extra capturing group in '([^']*)', and then check if the group matched or not:
import re
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
pattern = r"([^,\s=]+)\s*=\s*(?:'([^']*)'|(\S+))"
matches = re.findall(pattern, text)
print( dict([(x, z or y) for x,y,z in matches]) )
# => {'col1': 'Test String', 'col2': 'Next Test String', 'col3': 'Last Text String', 'col4': '37'}
See this Python demo.
If you want to do that with a pure regex, you can use a branch reset group:
import regex # pip install regex
text = "col1 = 'Test String' , col2= 'Next Test String',col3='Last Text String', col4=37"
print( dict(regex.findall(r"([^,\s=]+)\s*=\s*(?|'([^']*)'|(\S+))", text)) )
See the Python demo (regex demo).

RegExp: match everything till next occurrence

I'm trying to split lyrics into sections with name = group(1) and lyrics = group(2) using RegExp:
#Chorus <-- section name [group(1)]
This is the chorus <-- section lyrics [group(2)]
Got no words for it <-- [group(2)]
#Verse <-- next section name
This is the verse <-- next section lyrics
I've managed to split the first occurrence of group(1) from the rest but it matches all the other occurrences to group(2).
List<SongSection> _sections = [];
RegExp regExp = RegExp(r'\n#([a-zA-Z0-9]+)\n((.|\n)*)', multiLine: true);
List<RegExpMatch> matches = regExp.allMatches('\n' + lyrics).toList();
for (RegExpMatch match in matches) {
_sections.add(
SongSection(
name: match.group(1)!,
lyrics: match.group(2)!.trim(),
),
);
}
print(_sections.toString())
OUTPUT:
[SongSection(name: 'Chorus', lyrics: 'This is the chorus\nGot no words for it\n\n#Verse\nThis is the verse')]
How can I always match everything up to the next occurrence of group(1)?
You could match the # and the allowed characters for the name, and for the lyrics capture all lines that do not start with the name pattern.
As the leading newline is part of group 2, you can remove that from the group 2 value afterwards.
#([a-zA-Z0-9]+)((?:\n(?!#[a-zA-Z0-9]+$).*)*)
#([a-zA-Z0-9]+) Match # and capture 1 or more of the listed characters in group 1
( Capture group 2
(?: Non capture group
\n Match a newline
(?!#[a-zA-Z0-9]+$) Assert not # and 1 or more listed chars to the right
- .*` Match the whole line
)* Close the non capture group and optionally repeat it
) Close group 2
Regex demo

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line