Powershell regex for string between two special characters - regex

A file name as below
$inpFiledev = "abc_XYZ.bak"
I need only XYZ in a variable to do a compare with other file name.
i tried below:
[String]$findev = [regex]::match($inpFiledev ,'_*.').Value
Write-Host $findev

Asterisks in regex don't behave in the same way as they do in filesystem listing commands. As it stands your regex is looking for underscore, repeated zero or more times, followed by any character (represented in regex by a period). So the regex finds zero underscores right at the start of the string, then it finds 'a', and that's the match it returns.
First, correct that bit:
'_*.'
Becomes "underscore, followed by any number of characters, followed by a literal period". The 'literal period' means we need to escape the period in the regex, by using \., remembering that period means any character:
'_.*\.'
_ underscore
.* any number of characters
\. a literal period
That returns:
_XYZ.
So, not far off.
If you're looking to return something from between characters, you'll need to use capturing groups. Put parentheses around the bit you want to keep:
'_(.*)\.'
Then you'll need to use PowerShell regex groups to get the value:
[regex]::match($inpFiledev ,'_(.*)\.').Groups[1].Value
Which returns: XYZ
The number 1 in the Groups[1] just means the first capturing group, you can add as many as you like to the expression by using more parentheses, but you only need one in this case.

To complement mjsqu's helpful answer with two PowerShell-idiomatic alternatives:
For an overview of how regexes (regular expressions) are used in PowerShell, see Get-Help about_regular_expressions.
Using -split to split by _ and ., extracting the resulting 3-element array's middle element:
PS> ("abc_XYZ.bak" -split '[_.]')[1]
XYZ
-split's (first) RHS operand is a regex; regex [_.] is a character set ([...]) that matches a single char. that is either a literal _ or a literal . Therefore, input abc_XYZ.bak is broken into an array containing the strings abc, XYZ, and bak. Applying index [1] therefore extracts the middle token, XYZ.
Using -replace to extract the token of interest via a capture group ((...), referred to in the replacement operand as $1):
PS> "abc_XYZ.bak" -replace '^.+_([^.]+).+$', '$1'
XYZ
-replace too operates on a regex as the first RHS operand - what to replace - whereas the second operand specifies what to replace the matched (sub)string with.
Regex ^.+_([^.]+).+$:
^.+_ matches one or more (+) characters (.) at the start of the input (^) - note how . - used outside of a character set ([...]) - is a regex metacharacter that represents any character (in a single-line input string).
([^.]+) is a capture group ((...)) that matches a negated character set ([^...]): [^.] matches any literal char. that isn't a literal ., one or more times (+).
Whatever matched the sub-expression inside (...) can be referenced in the replacement operand as $<n>, where <n> represents the 1-based index of the capture group in the regex; in this case, $1 can be used to refer to this first (and only) capture group.
.+$ matches one or more (+) remaining characters (.) until the end of the input is reached ($).
Replacement operand $1 simply refers to what the first capture group matched; in this case: XYZ.
For a comprehensive overview of the syntax of -replace replacement operands, see this answer.

Because you're using the [regex] accelerator, you need the backslash to escape your end . (if you want to match it), and you need a dot before your asterix to match any characters after your underscore. If the characters in between are all letters, then use \w+
$findev = [regex]::match($inpFiledev ,'_.*\.')
$findev
_XYZ.

this demos two other ways to get the desired info from the sample string. the 1st uses the basic .Split() string method on the raw string. the 2nd presumes you are dealing with file objects and starts off by getting the .BaseName for the file. that already removes the extension, so you need not bother doing it yourself.
if you are dealing with a large number of strings, and not file objects, then the previous regex answers will likely be faster. [grin]
$inpFiledev = 'abc_XYZ.bak'
$findev = $inpFiledev.Split('.')[0].Split('_')[-1]
# fake reading in a file with Get-Item or Get-ChildItem
$File = [System.IO.FileInfo]'c:\temp\testing\abc_XYZ.bak'
$WantedPart = $File.BaseName.Split('_')[-1]
'split on a string = {0}' -f $findev
'split on BaseName of file = {0}' -f $WantedPart
output ...
split on a string = XYZ
split on BaseName of file = XYZ

Related

Perl: How to substitute the content after pattern CLOSED

So I cant use $' variable
But i need to find the pattern that in a file that starts with the string “by: ” followed by any characters , then replace whatever characters comes after “by: ” with an existing string $foo
im using $^I and a while loop since i need to update multiple fields in a file.
I was thinking something along the lines of [s///]
s/(by\:[a-z]+)/$foo/i
I need help. Yes this is an assignment question but im 5 hours and ive lost many brain cells in the process
Some problems with your substitution:
You say you want to match by: (space after colon), but your regex will never match the space.
The pattern [a-z]+ means to match one or more occurrences of letters a to z. But you said you want to match "any characters". That might be zero characters, and it might contain non-letters.
You've replaced the match with $foo, but have lost by:. The entire matched string is replaced with the replacement.
No need to escape : in your pattern.
You're capturing the entire match in parentheses, but not using that anywhere.
I'm assuming you're processing the file line-by line. You want "starts with the string by: followed by any characters". This is the regex:
/^by: .*/
^ matches beginning of line. Then by: matches exactly those characters. . matches any character except for a newline, and * means zero-or more of the preceding item. So .* matches all the rest of the characters on the line.
"replace whatever characters that come after by: with an existing string $foo. I assume you mean the contents of the variable $foo and not the literal characters $foo. This is:
s/^by: .*/by: $foo/;
Since we matched by:, I repeated it in the replacement string because you want to preserve it. $foo will be interpolated in the replacement string.
Another way to write this would be:
s/^(by: ).*/$1$foo/
Here we've captured the text by: in the first set of parentheses. That text will be available in the $1 variable, so we can interpolate that into the replacement string.

What is the meaning of this line in perl?

$line =~ s/^<(\w+)=\"(.*?)\">//;
What is the meaning of this line in perl?
The s/.../.../ is the substitution operator. It matches its first operand, which is a regular expression and replaces it with its second operand.
By default, the substitution operator works on a string stored in $_. But your code uses the binding operator (=~) to make it work on $line instead.
The two operands to the substitution operator are the bits delimited by the / characters (there are more advanced versions of these delimiters, but we'll ignore them for now). So the first operand is ^<(\w+)=\"(.*?)\"> and the second operand is an empty string (because there is nothing between the second and third / characters).
So your code says:
Examine the variable $line
Look for a section of the string which matches ^<(\w+)=\"(.*?)\">
Replace that part of the string with an empty string
All that is left now is for us to untangle the regular expression and see what that matchs.
^ - matches the start of the string
< - matches a literal < character
(...) - means capture this bit of the match and store it in $1
\w+ - matches one or more "word characters" (where a word character is a letter, a digit or an underscore)
= - matches a literal = character
\" - matches a literal " character (the \ is unnecessary here)
(...) - means capture this bit of the match and store it in $2
.*? - matches zero or more instances of any character
\" - matches a literal " character (once again, the \ is unnecessary here)
> - matches a literal >
So, all in all, this looks like a slightly broken attempt to match XML or HTML. It matches tags of the form <foo="bar"> (which isn't valid XML or HTML) and replaces them with an empty string.
It's searching for an XML tag at the start of a string, and substituting it with nothing (i.e. removing it).
For example, in the input:
<hello="world">example
The regex will match <hello="world">, and substitute it with nothing - so the final result is just:
example
In general, this is something that you shouldn't do with regex. There are a dozen different ways you could create false negatives here, that don't get stripped from the string.
But if this is a "quick and dirty" script, where you don't need to worry about all possible edge cases, then it may be OK to use.

regular expression what's the meaning of this regular expression s#^.*/##s

what is the meaning of s#^.*/##s
because i know that in the pattern '.' denotes that it can represent random letter except the \n.
then '.* 'should represent the random quantity number of random letter .
but in the book it said that this would be delete all the unix type of path.
My question is that, does it means I could substitute random quantity number of random letter by space?
s -> subsitution
# -> pattern delimiter
^.* -> all chars 0 or more times from the begining
/ -> literal /
## -> replace by nothing (2 delimiters)
s -> single line mode ( the dot can match newline)
Substitutions conventionally use the / character as a delimiter (s/this/that/), but you can use other punctuation characters if it's more convenient. In this case, # is used because the regexp itself contains a / character; if / were used as the delimiter, any / in the pattern would have to be escaped as \/. (# is not the character I would have chosen, but it's perfectly valid.)
^ matches the beginning of the string (or line; see below)
.*/ matches any sequence of characters up to and including a / character. Since * is greedy, it will match all characters up to an including the last / character; any precedng / characters are "eaten" by the .*. (The final / is not, because if .* matched all / characters the final / would fail to match.)
The trailing s modifier treats the string as a single line, i.e., causes . to match any character including a newline. See the m and s modifiers in perldoc perlre for more information.
So this:
s#^.*/##s
replaces everything from the beginning of the string ($_ in this case, since that's the default) up to the last / character by nothing.
If there are no / characters in $_, the match fails and the substitution does nothing.
This might be used to replace all directory components of an absolute or relative path name, for example changing /home/username/dir/file.txt to file.txt.
It will delete all characters, including line breaks because of the s modifier, in a string until the last slash included.
Please excuse a little pedantry. But I keep seeing this and I think it's important to get it right.
s#^.*/##s is not a regular expression.
^.* is a regular expression.
s/// is the substitution operator.
The substitution operator takes two arguments. The first is a regular expression. The second is a replacement string.
The substitution operator (like many other quote-like operators in Perl) allows you you change the delimiter character that you use.
So s### is also a substitution operator (just using # instead of /).
s#^.*/## means "find the text that matches the regular expression ^.*/ and replace it with an empty string. And the s on the end is a option which changes the regex so that the . matches "\n" as well as all other characters.

Replace repeating characters with one with a regex

I need a regex script to remove double repetition for these particular words..If these character occurs replace it with single.
/[\s.'-,{2,0}]
These are character that if they comes I need to replace it with single same character.
Is this the regex you're looking for?
/([\s.'-,])\1+/
Okay, now that will match it. If you're using Perl, you can replace it using the following expression:
s/([\s.'-,])\1+/$1/g
Edit: If you're using :ahem: PHP, then you would use this syntax:
$out = preg_replace('/([\s.\'-,])\1+/', '$1', $in);
The () group matches the character and the \1 means that the same thing it just matched in the parentheses occurs at least once more. In the replacement, the $1 refers to the match in first set of parentheses.
Note: this is Perl-Compatible Regular Expression (PCRE) syntax.
From the perlretut man page:
Matching repetitions
The examples in the previous section display an annoying weakness. We were only matching 3-letter words, or chunks of words of 4 letters or less. We'd like to be able to match words or, more generally, strings of any length, without writing out tedious alternatives like \w\w\w\w|\w\w\w|\w\w|\w.
This is exactly the problem the quantifier metacharacters ?, *, +, and {} were created for. They allow us to delimit the number of repeats for a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following meanings:
a? means: match 'a' 1 or 0 times
a* means: match 'a' 0 or more times, i.e., any number of times
a+ means: match 'a' 1 or more times, i.e., at least once
a{n,m} means: match at least "n" times, but not more than "m" times.
a{n,} means: match at least "n" or more times
a{n} means: match exactly "n" times
As others said it depends on you regex engine but a small example how you could do this:
/([ _-,.])\1*/\1/g
With sed:
$ echo "foo , bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo , bar
$ echo "foo,. bar" | sed 's/\([ _-,.]\)\1*/\1/g'
foo,. bar
Using Javascript as mentioned in a commennt, and assuming (It's not too clear from your question) the characters you want to replace are space characters, ., ', -, and ,:
var str = 'a b....,,';
str = str.replace(/(\s){2}|(\.){2}|('){2}|(-){2}|(,){2}/g, '$1$2$3$4$5');
// Now str === 'a b..,'
If I understand correctly, you want to do the following: given a set of characters, replace any multiple occurrence of each of them with a single character. Here's how I would do it in perl:
perl -pi.bak -e "s/\.{2,}/\./g; s/\-{2,}/\-/g; s/'{2,}/'/g" text.txt
If, for example, text.txt originally contains:
Here is . and here are 2 .. that should become a single one. Here's
also a double -- that should become a single one. Finally here we have
three ''' which should be substituted with one '.
it is modified as follows:
Here is . and here are 2 . that should become a single one. Here's
also a double - that should become a single one. Finally here we have
three ' which should be substituted with one '.
I simply use the same replacement regex for each character in in the set: for example
s/\.{2,}/\./g;
replaces 2 or more occurrences of a dot character with a single dot. I concatenate several of this expressions, one for each character of your original set.
There may be more compact ways of doing this, but, I think this is simple and it works :)
I hope it helps.

extract word with regular expression

I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.