Extract folder and keep regex group order intact - regex

Given te following strings:
/folder/subfolder/all
/folder/subfolder/all?a=b
/folder/anothersubfolder/all?a=b
/folder/all
/folder/all?a=b
/folder/anothersubfolder
/folder/anothersubfolder/all
/folder
The subfolder "all" is predefined and needs to be extracted seperatetly from the any other subfolder that may or may not exist in the string.
A regex like
^\/(folder)(\/[^/?]*)?(\/[^/?]*)?(\?.*)?$
does not work for me. The group containing the different folders should be fixed. With this regex the subfolder "all" is either in group 2 or 3.
The results of the regex should be something like:
Group 1: /folder (mandatory can only be "/folder")
Group 2: /subfolder (optional can be any string except "/all")
Group 3: /all (optional can only be "/all")
Group 4: ?a=b (optional any set of parameters)

^\/(folder)((?:\/(?!all)[^/?]*)?)((?:\/all)?)((?:\?.*)?)$
[["folder", "/subfolder", "/all", "" ],
["folder", "/subfolder", "/all", "?a=b"],
["folder", "/anothersubfolder", "/all", "?a=b"],
["folder", "", "/all", "" ],
["folder", "", "/all", "?a=b"],
["folder", "/anothersubfolder", "", "" ],
["folder", "/anothersubfolder", "/all", "" ],
["folder", "", "", "" ]]
There are two main tricks here:
Non-capturing groups ?: which tell the regex engine not to hold on to a match, but still use it for clumping regex parts together. It lets us do things like ((?:stuff)?), which makes a mandatory group that can be empty.
Negative lookahead ?! which tells the regex to NOT match a certain pattern. So in this case (?!all) says that "all" can't be in the second directory block. (note: this means the second directory can't start with "all")

Related

Regex multiple exclusion and match for different patterns

I want to exclude some specific words and if those words doesnt match, then should match an md5 hash for example.
Here a small log as example
"value": "ef51be4506d7d287abc8c26ea6c495f6", "u_jira_status": "", "u_quarter_closed": "", "file_hash": "ef51be4506d7d287abc8c26ea6c495f6", "escalation": "0", "upon_approval": "proceed", "correlation_id": "", "cyber_kill_change": "ef51be4506d7d287abc8c26ea6c495f6", "sys_id": "ef51be4506d7d287abc8c26ea6c495f6", "u_business_service": "", "destination_ip": "ef51be4506d7d287abc8c26ea6c495f6", u'test': u'9db92f08db4f951423c87d84f39619ef'
As you can see there is multiple values that should match, just excluding "value" and "id"
Here the regex I am using so far
([^value|^id](\":\s\"|':\su')\b)[a-fA-F\d]{32}\b
There is two cases where after the exclusion could be
"something": "hash"
'something': u'hash'
Whit the previous regex the result is the following.
The result is excluding value and id as expected, but there is a value called "cyber_kill_change" that is not matching for some reason and for the other ones is matching "file_hash", "destination_ip" and 'test' as expected.
Now as you can see in the previous image the matches are
h": "ef51be4506d7d287abc8c26ea6c495f6
p": "ef51be4506d7d287abc8c26ea6c495f6
t': u'9db92f08db4f951423c87d84f39619ef
Instead of just the MD5 (In this example is the same for the all 3 matches)
9db92f08db4f951423c87d84f39619ef
Can someone explain to me how to match correctly, please?
Note
For the exclusions I cannot use something similar to this
(?<!value|id)
The < and ! are not accepted by the software where I want to add the regex.
If it helps I am trying to use this regex for XSOAR, here some documentation of the permitted Syntax
"cyber_kill_change" ends with the character 'e' which is the same as the last character in "value", which is why it was also excluded. The problem started when you use the brackets [], which is a "character class", which means "any character in the word value or Id will be match as a single character, not as a word". It is the same as:
[value|id]=(v|a|l|u|e|i|d)
To match the exact word, you can use (value|id) you may try this Expression:
((?<!(value|id))(\":\s\"|':\su')\b)[a-fA-F\d]{32}\b
I used CyrilEx Regex Tester to check the expression and I got the same result as shown in the following image:
Regex Tester

Matching pattern repeats for unknown times. How to replace each matched string?

I have this string
mark:: string1, string2, string3
I want it to be
mark:: xxstring1xx, xxstring2xx, xxstring3xx
The point is, I don't know how many times the matched string repeated. Sometimes there are 10 strings in the line, sometimes there is none. So far I have come up with this matching pattern mark:: ((.*)(, )+)*, but I'm unable to find a way to substitute individual matched string.
If possible I would like to have this output:
mark:: xxstring1xx
mark:: xxstring2xx
mark:: xxstring3xx
But if it's not possible it's fine to have the one-line solution
By using snippets you can make use of their ability to use conditionals.
IF you can select the line first, this is quite easy. Use this keybinding in your keybindings.json:
{
"key": "alt+w", // whatever keybinding you want
"command": "editor.action.insertSnippet",
"args": {
"snippet": "${TM_SELECTED_TEXT/(mark::\\s*)|([^,]+)(, )?/$1${2:+xx}$2${2:+xx}$3/g}"
}
}
The find is simple: (mark::\\s*)|([^,]+)(, )?
replace: $1${2:+xx}$2${2:+xx}$3
Capture group 1 followed by xx if there is a group 2 ${2:+xx} : conditional, followed by group 2, followed by another conditional.
Demo:
If you have a bunch of these lines in a file and you want to transform them all at once, then follow these steps:
In the Find widget, Find: (mark::\s*)(.*)$ with the regex option enabled.
Alt+Enter to select all matches.
Trigger your snippet keybinding from above.
Demo:
For your other version with separate lines for each entry, use this in the keybinding:
{
"key": "alt+w",
"command": "editor.action.insertSnippet",
"args": {
// single line version
// "snippet": "${TM_SELECTED_TEXT/(mark::\\s*)|([^,]+)(, )?/$1${2:+xx}$2${2:+xx}$3/g}"
// each on its own line
"snippet": "${TM_SELECTED_TEXT/(mark::\\s*)|([^,]+)(, )?/${2:+mark:: }${2:+xx}$2${2:+xx}${3:+\n}/g}"
}
}
You can use
(\G(?!\A)\s*,\s*|mark::\s*)([^\s,](?:[^,]*[^\s,])?)
And replace with $1xx$2xx.
See the regex demo. Details:
(\G(?!\A)\s*,\s*|mark::\s*) - Group 1 ($1):
\G(?!\A)\s*,\s* - end of the previous successful match and then a comma enclosed with zero or more whitespaces
| - or
mark::\s* - mark:: and zero or more whitespaces
([^\s,](?:[^,]*[^\s,])?) - Group 2 ($2):
[^\s,] - a char other than whitespace and comma
(?:[^,]*[^\s,])? - an optional sequence of zero or more non-commas and then a char other than a whitespace and a comma.
In Visual Studio Code file search and replace feature, you can use a Rust regex compliant regex:
(mark::(?:\s*(?:,\s*)?xx\w*xx)*\s*(?:,\s*)?)([^\s,](?:[^,]*[^\s,])?)
Replace with the same $1xx$2xx replacement pattern. Caveat: you need to hit the replace button as many times as there are matches.
See this regex demo showing the replacement stages.

Extract text starting from negated set up til (but not including) first occurance of #

good day community.
Say I have the following line:
[ ] This is a sentence about apples. #fruit #tag
I wish to create a regex that can generically extract the portion:
"This is a sentence about apples." only.
That is, ignore the [ ] before the sentence, and ignore #fruit #tag after.
What I have so far is: ([^\s*\[\s\]\s])(.*#)
Which is creating the following match:
This is a sentence about apples. #fruit #
How would I match up to, but not including the first occurrence of # symbol, while still negating [ ] pattern with ([^\s*\[\s\]\s]) group?
EDIT: Thanks to Wiktor Stribiżew for the critical piece to help:
RegExMatch(str, "O)\[\s*]\s*([^#]*[^#\s])", output)
Final code:
; Zim Inbox txt file
FileEncoding, UTF-8
File := "C:\Users\dragoon\Desktop\anki_cards.txt"
; sleep is necessary
;;Highlight line and copy
#IfWinActive ahk_exe zim.exe
{
clipboard=
sleep, 500
Send ^+c
ClipWait
Send ^{Down}
clipboardQuestion := clipboard
FoundQuestion := RegExMatch(clipboardQuestion,"O)\[\s*]\s*([^#]*[^#\s])",outputquestion)
clipboard=
sleep, 500
Send ^+c
ClipWait
clipboardAnswer := clipboard
FoundAnswer := RegExMatch(clipboardAnswer,"O)\[\s*]\s*([^#]*[^#\s])",outputanswer)
quotedQuestionAnswer := outputquestion[1] """" outputanswer[1] """"
Fileappend, %quotedQuestionAnswer%, %File%
}
What it does:
In Zim Wiki notebook, on Windows, press Win+V hotkey over Question? in the following structure:
[ ] Question Header
[ ] Question?
[ ] Answer about dogs #cat #dog
This will result in the text being formatted as such in an external file:
Question?"Answer about dogs"
This is an acceptable format for Anki card importing, and can be used to quickly make cards from a review structure. Thanks again for all the help on my first SO question.
You can use
\[\s*]\s*\K[^#]*[^#\s]
See the regex demo. Details:
\[\s*]\s* - [, zero or more whitespaces, ], zero or more whitespaces
\K - "forget" what has just been matched
[^#]* - zero or more chars other than #
[^#\s] - a char other than # and whitespace.
Note that in AutoHotKey, you can also capture the part of a match if use Object mode:
RegExMatch(str, "O)\[\s*]\s*([^#]*[^#\s])", output)
The string you want to use is captured with Group 1 pattern (defined with a pair of unescaped parentheses) and you can access it via output[1]. See documentation:
Object mode. [v1.1.05+]: This causes RegExMatch() to yield all information of the match and its subpatterns to a match object in OutputVar. For details, see OutputVar.

vscode snippet - transform and replace filename

my filename is
some-fancy-ui.component.html
I want to use a vscode snippet to transform it to
SOME_FANCY_UI
So basically
apply upcase to each character
Replace all - with _
Remove .component.html
Currently I have
'${TM_FILENAME/(.)(-)(.)/${1:/upcase}${2:/_}${3:/upcase}/g}'
which gives me this
'SETUP-PRINTER-SERVER-LIST.COMPONENT.HTML'
The docs doesn't explain how to apply replace in combination with their transforms on regex groups.
If the chunks you need to upper are separated with - or . you may use
"Filename to UPPER_SNAKE_CASE": {
"prefix": "usc_",
"body": [
"${TM_FILENAME/\\.component\\.html$|(^|[-.])([^-.]+)/${1:+_}${2:/upcase}/g}"
],
"description": "Convert filename to UPPER_SNAKE_CASE dropping .component.html at the end"
}
You may check the regex workings here.
\.component\.html$ - matches .component.html at the end of the string
| - or
(^|[-.]) capture start of string or - / . into Group 1
([^-.]+) capture any 1+ chars other than - and . into Group 2.
The ${1:+_}${2:/upcase} replacement means:
${1:+ - if Group 1 is not empty,
_ - replace with _
} - end of the first group handling
${2:/upcase} - put the uppered Group 2 value back.
Here is a pretty simple alternation regex:
"upcaseSnake": {
"prefix": "rf1",
"body": [
"${TM_FILENAME_BASE/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}",
"${TM_FILENAME/(\\..*)|(-)|(.)/${2:+_}${3:/upcase}/g}"
],
"description": "upcase and snake the filename"
},
Either version works.
(\\..*)|(-)|(.) alternation of three capture groups is conceptually simple. The order of the groups is important, and it is also what makes the regex so simple.
(\\..*) everything after and including the first dot . in the filename goes into group 1 which will not be used in the transform.
(-) group 2, if there is a group 2, replace it with an underscore ${2:+_}.
(.) group 3, all other characters go into group 3 which will be upcased ${3:/upcase}.
See regex101 demo.

Issue on parsing logs using regex

I have tried separating the wowza logs using regex for data analysis, but I couldn't separate the section below.
I need a SINGLE regex pattern that would satisfy below both log formats.
Format 1:
live wowz://test1.example.com:443/live/_definst_/demo01|wowz://test2.example.com:443/live/_definst_/demo01 test
Format 2:
live demo01 test
I am trying to split the line on the 3 parameters and capturing them in the groups app, streamname and id, but streamname should only capture the text after the last /.
This is what I've tried:
(?<stream_name>[^/]+)$ --> Using this pattern I could only separate the format 1 "wowz" section. Not entire Format 1 example mentioned above.
Expected Output
{
"app": [
[
"live"
]
],
"streamname": [
[
"demo1"
]
],
"id": [
[
"test"
]
]
}
You can achieve what you specified using the following regex:
^(?<app>\S+) (?:\S*/)?(?<streamname>\S+) (?<id>\S+)$
regex101 demo
\S+ matches any number of characters except whitespace.
(?:\S*/)? to optionally consume the characters in the second parameter up to the last /. This is not included in the group, so it won't be captured.