What's special about a "space" character in an "expr match" regexp?

What's special about a "space" character in an "expr match" regexp? - regex

In a bash shell, I set line like so:
line="total active bytes: 256"
Now, I just want to get the digits from that line so I do:
echo $(expr match "$line" '.*\([[:digit:]]*\)' )
and I don't get anything. But, if I add a space character before the first backslash in the regexp, then it works:
echo $(expr match "$line" '.* \([[:digit:]]*\)' )
Why?

The space isn't special at all. What's happening is that in the first case, the .* matches the entire string (i.e., it matches "greedily"), including the numbers, and since you've quantified the digits with * (as opposed to \+), that part of the regex is allowed to match 0 characters.
By putting a space before the digit match, the first part can only match up to but not including the last space in the string, leaving the digits to be matched by \([[:digit:]]*\).

Related

Powershell regex for string between two special characters

A file name as below
$inpFiledev = "abc_XYZ.bak"
I need only XYZ in a variable to do a compare with other file name.
i tried below:
[String]$findev = [regex]::match($inpFiledev ,'_*.').Value
Write-Host $findev

Asterisks in regex don't behave in the same way as they do in filesystem listing commands. As it stands your regex is looking for underscore, repeated zero or more times, followed by any character (represented in regex by a period). So the regex finds zero underscores right at the start of the string, then it finds 'a', and that's the match it returns.
First, correct that bit:
'_*.'
Becomes "underscore, followed by any number of characters, followed by a literal period". The 'literal period' means we need to escape the period in the regex, by using \., remembering that period means any character:
'_.*\.'
_ underscore
.* any number of characters
\. a literal period
That returns:
_XYZ.
So, not far off.
If you're looking to return something from between characters, you'll need to use capturing groups. Put parentheses around the bit you want to keep:
'_(.*)\.'
Then you'll need to use PowerShell regex groups to get the value:
[regex]::match($inpFiledev ,'_(.*)\.').Groups[1].Value
Which returns: XYZ
The number 1 in the Groups[1] just means the first capturing group, you can add as many as you like to the expression by using more parentheses, but you only need one in this case.

To complement mjsqu's helpful answer with two PowerShell-idiomatic alternatives:
For an overview of how regexes (regular expressions) are used in PowerShell, see Get-Help about_regular_expressions.
Using -split to split by _ and ., extracting the resulting 3-element array's middle element:
PS> ("abc_XYZ.bak" -split '[_.]')[1]
XYZ
-split's (first) RHS operand is a regex; regex [_.] is a character set ([...]) that matches a single char. that is either a literal _ or a literal . Therefore, input abc_XYZ.bak is broken into an array containing the strings abc, XYZ, and bak. Applying index [1] therefore extracts the middle token, XYZ.
Using -replace to extract the token of interest via a capture group ((...), referred to in the replacement operand as $1):
PS> "abc_XYZ.bak" -replace '^.+_([^.]+).+$', '$1'
XYZ
-replace too operates on a regex as the first RHS operand - what to replace - whereas the second operand specifies what to replace the matched (sub)string with.
Regex ^.+_([^.]+).+$:
^.+_ matches one or more (+) characters (.) at the start of the input (^) - note how . - used outside of a character set ([...]) - is a regex metacharacter that represents any character (in a single-line input string).
([^.]+) is a capture group ((...)) that matches a negated character set ([^...]): [^.] matches any literal char. that isn't a literal ., one or more times (+).
Whatever matched the sub-expression inside (...) can be referenced in the replacement operand as $<n>, where <n> represents the 1-based index of the capture group in the regex; in this case, $1 can be used to refer to this first (and only) capture group.
.+$ matches one or more (+) remaining characters (.) until the end of the input is reached ($).
Replacement operand $1 simply refers to what the first capture group matched; in this case: XYZ.
For a comprehensive overview of the syntax of -replace replacement operands, see this answer.

Because you're using the [regex] accelerator, you need the backslash to escape your end . (if you want to match it), and you need a dot before your asterix to match any characters after your underscore. If the characters in between are all letters, then use \w+
$findev = [regex]::match($inpFiledev ,'_.*\.')
$findev
_XYZ.

this demos two other ways to get the desired info from the sample string. the 1st uses the basic .Split() string method on the raw string. the 2nd presumes you are dealing with file objects and starts off by getting the .BaseName for the file. that already removes the extension, so you need not bother doing it yourself.
if you are dealing with a large number of strings, and not file objects, then the previous regex answers will likely be faster. [grin]
$inpFiledev = 'abc_XYZ.bak'
$findev = $inpFiledev.Split('.')[0].Split('_')[-1]
# fake reading in a file with Get-Item or Get-ChildItem
$File = [System.IO.FileInfo]'c:\temp\testing\abc_XYZ.bak'
$WantedPart = $File.BaseName.Split('_')[-1]
'split on a string = {0}' -f $findev
'split on BaseName of file = {0}' -f $WantedPart
output ...
split on a string = XYZ
split on BaseName of file = XYZ

Perl: How to substitute the content after pattern CLOSED

So I cant use $' variable
But i need to find the pattern that in a file that starts with the string “by: ” followed by any characters , then replace whatever characters comes after “by: ” with an existing string $foo
im using $^I and a while loop since i need to update multiple fields in a file.
I was thinking something along the lines of [s///]
s/(by\:[a-z]+)/$foo/i
I need help. Yes this is an assignment question but im 5 hours and ive lost many brain cells in the process

Some problems with your substitution:
You say you want to match by: (space after colon), but your regex will never match the space.
The pattern [a-z]+ means to match one or more occurrences of letters a to z. But you said you want to match "any characters". That might be zero characters, and it might contain non-letters.
You've replaced the match with $foo, but have lost by:. The entire matched string is replaced with the replacement.
No need to escape : in your pattern.
You're capturing the entire match in parentheses, but not using that anywhere.
I'm assuming you're processing the file line-by line. You want "starts with the string by: followed by any characters". This is the regex:
/^by: .*/
^ matches beginning of line. Then by: matches exactly those characters. . matches any character except for a newline, and * means zero-or more of the preceding item. So .* matches all the rest of the characters on the line.
"replace whatever characters that come after by: with an existing string $foo. I assume you mean the contents of the variable $foo and not the literal characters $foo. This is:
s/^by: .*/by: $foo/;
Since we matched by:, I repeated it in the replacement string because you want to preserve it. $foo will be interpolated in the replacement string.
Another way to write this would be:
s/^(by: ).*/$1$foo/
Here we've captured the text by: in the first set of parentheses. That text will be available in the $1 variable, so we can interpolate that into the replacement string.

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.

Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"

Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

matching two chars with multiple lines in between

I am new to regex and I am using Perl.
I have below tag:
<CFSC>cfsc_service=TRUE
SEC=1
licenses=10
expires=20170511
</CFSC>
I want to match anything between <CFSC> and </CFSC> tags.
I tried /<CFSC>.*?\n.*?\n.*?\n.*?\n<\/CFSC>/
and /<CFSC>(.*)<\/CFSC>/ but had no luck.

You need the /s single line modifier to make the regex engine include line breaks in ..
Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
See this example.
my $foo = qq{<CFSC>cfsc_service=TRUE
SEC=1
licenses=10
expires=20170511
</CFSC>};
$foo =~ m{>(.*)</CFSC>}s;
print $1;
You also need to use a different delimiter than /, or escape it.

Try
/<CFSC>(.*)<\/CFSC>/s
The final s makes the . match newline chars (\n = 0x0a) which is usually doesn't match:
Treat string as single line. That is, change "." to match any
character whatsoever, even a newline, which normally it would not
match.
from http://perldoc.perl.org/perlre.html#Modifiers

Try this:
$foo =~ m/<CFSC>((?:(?!<\/CFSC>).)*)<\/CFSC>/gs;
Modifiers:
g - Matches global
s - newline
i - case sensitive
\ - escape sequence

Only allow some characters with grep?

I would like to check a string, so it only contains the characters 0-9 a-z -.
When I do
regex='[-a-z0-9]*'
string='abcd!'
if [[ $string =~ $regex ]]
then
echo "valid"
else
echo "not valid"
fi
it outputs valid, where I would have expected not valid because $string contains a !.

try this: regex='^[-a-z0-9]*$'. It will force the complete line to match this class. Otherwise, only a single match, or no match at all (due to *) will return valid. ^...$ says the string starts and ends without anything that fails to match.

You will have to add boundaries for this regex to work.
'[-a-z0-9]*' says: match these characters 0 or more times anywhere in the string.
So adding start and end of line characters to the regex will do what you are looking for:
regex='^[-a-z0-9]*$'
The next step is to limit the number of occurrences of the '-' to only once. Can the dash charcter occur at the start or at the end of the string? If not try:
regex='^[a-z0-9]*-?[a-z0-9]*$'
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What's special about a "space" character in an "expr match" regexp? - regex

Related

Powershell regex for string between two special characters

Perl: How to substitute the content after pattern CLOSED

Regular Expression to follow a specific pattern

matching two chars with multiple lines in between

Only allow some characters with grep?

Categories

Resources