powershell regex capture to pull version number - regex

I'm struggling to understand regex's in powershell. On Friday user mike-z helped me out with a script to extract the number from a group of folders with a naming convention like this -
Core_1.1.2
Core_1.3.4
The following regex;
-replace '.*_(\d+(\.\d+){1,3})', '$1')
works perfectly to extract only the numbers (eg "1.1.2").
Unfortunately, I later realized that a couple of the folder names had some other junk text trailing the version numbers (eg. Core_1.2.4_Prod). I tried on my own to tweak the above regex to make it also ignore the trailing text but I didn't get too far. I used various online regex generators as well as my own limited regex experience but I didn't get anywhere; I was able to generate regexes which should capture the text I need but they didn't work in powershell. Converesely, the working regex above (as in it works in powershell) fails in any regex tool I used.
Basically, given a list of folder names like this
Core_1.1.2
Core_1.2.4_Prod
Core_1.2.6
Core_1.3.1_Prod
Core_1.4.4
I need to capture just the version number. Also, it would be greatly appreciated if you could explain why the regex works as I'm extremely confused by PS regexes at this point.

You just need to add .* at the last in your pattern,
.*_(\d+(\.\d+){1,3}).*
So your code would be,
-replace '.*_(\d+(\.\d+){1,3}).*', '$1')
DEMO
By default replace function in all the languages should replace the matched characters only. Your regex .*_(\d+(\.\d+){1,3}) matches upto the last digit in the version number. It won't match the remaining part. So whenever you replace the matched characters with $1, the trailing part _Prod should be printed along with the characters inside the first capturing group , because the trailing part is not matched. Just match also the trailing part , inorder to replace the whole line with $1(ie; version number).

Related

Regex: extract characters from two patterns

I have the following string:
https://www.google.com/today/sunday/abcde2.hopeho.3345GETD?weatherType=RAOM&...
https://www.google.com/today/monday/jbkwe3.ho4eho.8495GETD?weatherType=WHTDSG&...
I'd like to extract jbkwe3.ho4eho.8495GETD or abcde2.hopeho.3345GETD. Anything between the {weekday}/ and the ?weatherType=.
I've tried (?<=sunday\/)$.*?(?=\?weatherType=) but it only works for the first line and I want to make it applicable to all strings regardless the value of {weekday}.
I tried (?<=\/.*\/)$.*?(?=\?weatherType=) but it didn't work. Could anyone familiar with Regex can lend some help? Thank you!
[Update]
I'm new to regex but I was experimenting it on sublime text editor via the "find" functionality which I think should be PCRE (according to this post)
Try this regex:
(?:sun|mon|tues|wednes|thurs|fri|satur)day\/\K[^?]+(?=\?weatherType)
Click for Demo
Link to Code
Explanation:
(?:sun|mon|tues|wednes|thurs|fri|satur)day - matches the day of a week i.e, sunday,monday,tuesday,wednesday,thursday,friday,saturday
\/ - matches /
\K - unmatches whatever has been matched so far and pretends that the match starts from the current position. This can be used for the PCRE.
[^?]+ - matches 1 or more occurences of any character that is not a ?
(?=\?weatherType) - the above subpattern[^?]+ will match all characters that are not ? until it reaches a position which is immediately followed by a ? followed by weatherType
To make the match case-insensitive, you can prepend the regex with (?i) as shown here
In the examples given, you actually only need to grab the characters between the last forward slash ("/") and the first question mark ("?").
You didn't mention what flavor regex (ie, PCRE, grep, Oracle, etc) you're using, and the actual syntax will vary depending on this, but in general, something like the following (Perl) replacement regex would handle the examples given:
s/.*\/([^?]*)\?.*/$1/gm
There are other (and more efficient) ways, but this will do the job.

VSCode Regex Find/Replace In Files: can't get a numbered capturing group followed by numbers to work out

I have a need to replace this:
fixed variable 123
with this:
fixed variable 234
In VSCode this matches fine:
fixed(.*)123
I can't find any way to make it put the capture in the output if a number follows:
fixed$1234
fixed${1}234
But the find replace window just looks like this:
I read that VSCode uses rust flavoured rexes.. Here indicates ${1}234 should work, but VSCode just puts it in the output..
Tried named capture in a style according to here
fixed(?P<n>.*)123 //"invalid regular expression" error
VSCode doesn't seem to understand ${1}:
ps; I appreciate I could hack it in the contrived example with
FIND: fixed (.*) 123
REPL: fixed $1 234
And this does work in vscode:
but not all my data consistently has the same character before the number
After a lot of investigation by myself and #Wiktor we discovered a workaround for this apparent bug in vscode's search (aka find across files) and replace functionality in the specific case where the replace would have a single capture group followed by digits, like
$1234 where the intent is to replace with capture group 1 $1 followed by 234 or any digits. But $1234 is the actual undesired replaced output.
[This works fine in the find/replace widget for the current file but not in the find/search across files.]
There are (at least) two workarounds. Using two consecutive groups, like $1$2234 works properly as does $1$`234 (or precede with the $backtick).
So you could create a sham capture group as in (.*?)()(\d{3}) where capture group 2 has nothing in it just to get 2 consecutive capture groups in the replace or
use your intial search regex (.*?)(\d{3}) and then use $` just before or after your "real" capture group $1.
OP has filed an issue https://github.com/microsoft/vscode/issues/102221
Oddly, I just discovered that replacing with a single digit like $11 works fine but as soon as you add two or more it fails, so $112 fails.
I'd like to share some more insights and my reasoning when I searched for a workaround.
Main workaround idea is using two consecutive backreferences in the replacement.
I tried all backreference syntax described at Replacement Strings Reference: Matched Text and Backreferences. It appeared that none of \g<1>, \g{1}, ${1}, $<1>, $+{1}, etc. work. However, there are some other backreferences, like $' (inserts the portion of the string that follows the matched substring) or $` (inserts the portion of the string that precedes the matched substring). However, these two backreferences do not work in VS Code file search and replace feature, they do not insert any text when used in the replacement pattern.
So, we may use $` or $' as empty placeholders in the replacement pattern.
Find What:      fix(.*?)123
Replace With:
fix$'$1234
fix$`$1234
Or, as in my preliminary test, already provided in Mark's answer, a "technical" capturing group matching an empty string, (), can be introduced into the pattern so that a backreference to that group can be used as a "guard" before the subsequent "meaningful" backreference:
Find What: fixed()(.*)123 (see () in the pattern that can be referred to using $1)
Replace With: fixed$1$2234
Here, $1 is a "guard" placeholder allowing correct parsing of $2 backreference.
Side note about named capturing groups
Named capturing groups are supported, but you should use .NET/PCRE/Java named capturing group syntax, (?<name>...). Unfortunately, the none of the known named backreferences work replacement pattern. I tried $+{name} Boost/Perl syntax, $<name>, ${name}, none work.
Conclusion
So, there are several issues here that need to be addressed:
We need an unambiguous numbered backerence syntax (\g<1>, ${1}, or $<1>)
We need to make sure $' or $` work as expected or are parsed as literal text (same as $_ (used to include the entire input string in the replacement string) or $+ (used to insert the text matched by the highest-numbered capturing group that actually participated in the match) backreferences that are not recognized by Visual Studio Code file search and replace feature), current behavior when they do not insert any text is rather undefined
We need to introduce named backreference syntax (like \g<name> or ${name}).

Improving a regex

I am looking for alternate methods to get john from the provided example.
My expression works as is but was hoping for some examples of better methods.
Example: john&home
my regexp: [a-z]{3,6}[^&home]
Im matching any character of length 3-6 upto but not including &home
Every item i run the regexp on is in the same format. 3-6 characters followed by &home
I have looked at other posts but was hoping for a reply specific to my regexp.
Most regex engines allow you to capture parts of a regex with capture groups. For instance:
^([A-Za-z]{3,6})&home$
The brackets here mean that you are interested in the part before the &home. The ^ and $ mean that you want to match the entire string. Without it, averylongname&homeofsomeone will be matched as well.
Since you use rubular, I assume you use the Ruby regex engine. In that case you can for instance use:
full = "john&home"
name = full.match(/^([A-Za-z]{3,6})&home$/).captures
And name will in this case contain john.

Regex to replace email address domains?

I need a regex to obfuscate emails in a database dump file I have. I'd like to replace all domains with a set domain like #fake.com so I don't risk sending out emails to real people during development. The emails do have to be unique to match database constraints, so I only want to replace the domain and keep the usernames.
I current have this regex for finding emails
\b[A-Z0-9._%-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
How do I convert this search regex into a regex I can use in a find and replace operation in either Sublime Text or SED or Vim?
EDIT:
Just a note, I just realized I could replace all strings found by #[A-Z0-9.-]+\.[A-Z]{2,4}\b in this case, but academically I am still interested in how you could treat each section of the email regex as a token and replace the username / domain independently.
SublimeText
SublimeText uses Boost syntax, which supports quite a large subset of features in Perl regex. But for this task, you don't need all those advanced constructs.
Below are 2 possible approaches:
If you can assume that # doesn't appear in any other context (which is quite a fair assumption for normal text), then you can just search for the domain part #[A-Z0-9.-]+\.[A-Z]{2,4}\b and replace it.
If you use capturing groups (pattern) and backreference in replacement string.
Find what
\b([A-Z0-9._%-]+)#[A-Z0-9.-]+\.[A-Z]{2,4}\b
([A-Z0-9._%-]+) is the first (and only) capturing group in the regex.
Replace with
$1#fake.com
$1 refers to the text captured by the first capturing group.
Note that for both methods above, you need to turn off case-sensitivity (indicated as the 2nd button on the lower left corner), unless you specifically want to remove only emails written in ALL CAPS.
You may use the following command for Vim:
:%s/\(\<[A-Za-z0-9._%-]\+#\)[A-Za-z0-9.-]\+\.[A-Za-z]\{2,4}\>/\1fake.com/g
Everything between \( and \) will become a group that will be replaced by an escaped number of the group (\1 in this case). I've also modified the regexp to match the small letters and to have Vim-compatible syntax.
Also you may turn off the case sensitivity by putting \c anywhere in your regexp like this:
:%s/\c\(\<[A-Z0-9._%-]\+#\)[A-Z0-9.-]\+\.[A-Z]\{2,4}\>/\1fake.com/g
Please also note that % in the beginning of the line asks Vim to do the replacement in a whole file and g at the end to do multiple replacements in the same line.
One more approach is using the zero-width matching (\#<=):
:%s/\c\(\<[A-Z0-9._%-]\+#\)\#<=[A-Z0-9.-]\+\.[A-Z]\{2,4}\>/fake.com/g

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
cià
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.