Regular Expression to Match List of File Extensions - regex

I would like to have a regular expression that will match a list of file extensions that are delimited with a pipe | such as doc|xls|pdf This list could also just be a single extension such as pdf or it could also be a wild card * or ? I would also like to exclude the | at the start or the end of the list and also not match the \<>/:" characters.
I have tried the following but it doesn't account for a single * wildcard.
^([^|\\<>\/:"]|[^\\<>:"])[^\/\\<>:"]*[^|\/\\<>:"]$
I have been on one of the online testers but can't seem to get over the final hurdle. If someone could point me in the right direction I would be most grateful.

You can construct this from smaller building blocks. A single extension, excluding the characters you mention, would be:
[^\\<>/:"]+
We should probably also exclude | since that's our delimiter:
[^\\<>/:"|]+
This can automatically match wildcards as well, since they're not forbidden.
To construct the |-separated list from those is then easy:
[^\\<>/:"|]+
followed by an arbitrary number of the same thing with a | before that:
[^\\<>/:"|]+(\|[^\\<>/:"|]+)*
And if you want a complete string to match this, add the ^ and $ anchors:
^[^\\<>/:"|]+(\|[^\\<>/:"|]+)*$

Related

Azure Data Explorer, Kusto: regex not semantically correct in extract()

I am trying to grab a substring of a column value in Kusto.
I know that the string is always preceded by the format 'text-for-fun-' then the string of letters I want, followed by anything that is not a letter.
I thought I should use extract() as that allows me to enter a regular expression to handle the multiple possibilities of characters that can follow the string I want.
However, when I attempt to enter the regex, I keep getting a SEM0420: Semantic error: Regex pattern is ill formed.
Can you help me figure out how to enter the regex properly?
Example string: stuff milk-cow-cocoa a/123
Desired substring: cocoa
Current regex: (?<=milk-cow-\s*).*?(?=\s*[^A-Za-z])
Note: looks like the single asterisks are being removed. They appear below in code.
At this point, the \s are to defensively parse the string and remove whitespaces. The end of the overall string may also exist immediately after the desired substring.
I have tried something similar to this Data Explorer statement:
cluster("mine").database("mine").
DataTable
| where PreciseTimeStamp >ago(5h) and resourceProvider == "Provider"
| where info has "cow-milk-"
| take 200
| project extract("(?<=milk-cow-\\s*).*?(?=\\s*[^A-Za-z])", 0, info), info
I had to add an extra \ before each \ for the Data Explorer to parse the strings correctly.
Your regex engine chokes on a lookbehind, and possibly on lookahead, too.
You have a second argument to extract that tells the function to return the capture only, so you may use
| project extract("milk-cow-\\s*([a-zA-Z]+)", 1, info)
It means
milk-cow- - match milk-cow-
\s* - match 0 or more whitespaces
([a-zA-Z]+) - match and capture into Group 1 only one or more ASCII letters.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Regex to match PowerShell drive path

PowerShell's New-PsDrive Cmdlet allows for drives to be created with more-flexible names like HKLM.
I'd like to match these drive\path\file patterns in the NavigationCmdletProvider that I'm building:
csb:
csb:\
csb:\foo\bar
csb:\foo\bar\
csb:\foo\bar bar\test.txt
but not these
csb:\\
csb:\\\
([a-zA-Z]+:)?(\\[a-zA-Z0-9_.-: :]+)*\\? matches everything that I want, but still includes the two that I don't. I can't seem to get it to match 0 or 1 \ at the end of the string.
What am I missing?
All you should need to do is tie your regular expression to the beginning and end of the line using a ^ and a $ respectively:
^([a-zA-Z]+:)?(\\[a-zA-Z0-9_.-: :]+)*\\?$
This is necessary almost any time you are trying to count a specific number of character in a regex.

How to match either a subset (preferred), or the whole line in a regex?

I have a string that looks something like this:
"Element 1 | Element 2| Element 3: element 4"
I want to substring the portion of the source string that follows the colon (to the end of the source string), but if there is no colon, then I want to grab the whole string.
What I've tried so far are variations around this:
:.*|.*
:?.*
etc.
However, while they'll match if either the colon is present or not, they don't prefer the substring when the colon is found.
I've been playing with this on http://regexpal.com.
Ultimately, this will be used in a CMDB tool for matching CIs - so a general solution would be ideal, rather than language- or engine-specific.
You can use the following:
(:.*|[^:]*)$
See DEMO
Explanation:
if there is no colon, then I want to grab the whole string
This if condition can be specified using a negitive character class of colon
You can use:
(?:^|:)[^:\n]*$
RegEx Demo

Regular expression to extract all words starting with colon

I would like to use a regular expression to extract "bind variable" parameters from a string that contains a SQL statement. In Oracle, the parameters are prefixed with a colon.
For example, like this:
SELECT * FROM employee WHERE name = :variable1 OR empno = :variable2
Can I use a regular expression to extract "variable1" and "variable2" from the string? That is, get all words that start with colon and end with space, comma, or the end of the string.
(I don't care if I get the same name multiple times if the same variable has been used several times in the SQL statement; I can sort that out later.)
This might work:
:\w+
This just means "a colon, followed by one or more word-class characters".
This obviously assumes you have a POSIX-compliant regular expression system, that supports the word-class syntax.
Of course, this only matches a single such reference. To get both, and skip the noise, something like this should work:
(:\w+).+(:\w+)
For being able to handle such an easy case by yourself you should have a look at regex quickstart.
For the meantime use:
:\w+
If your regex parser supports word boundaries,
:[a-zA-Z_0-9]\b
Try the following:
sed -e 's/[ ,]/\\n/g' yourFile.sql | grep '^:.*$' | sort | uniq
assuming your SQL is in a file called "yourFile.sql".
This should give a list of variables with no duplicates.