Perl Split using "*" - regex

If I use split like this:
my #split = split(/\s*/, $line);
print "$split[1]\n";
with input:
cat dog
I get:
a
However if I use \s+ in split, I get:
dog
I'm curious as to why they don't produce the same result? Also, what is the proper way to split a string by character?
Thanks for your help.

\s* effectively means zero or more whitespace characters. Between c and a in cat are zero spaces, yielding the result you're seeing.
To the regex engine, your string looks as follows:
c
zero spaces
a
zero spaces
t
multiple spaces
d
zero spaces
o
zero spaces
g
Following this logic, if you use \s+ as a separator, it will only match the multiple spaces between cat and dog.

* matches 0 or more times. Which means it can match the empty string between characters. + matches 1 or more times, which means it must match at least one character.
This is described in the documentation for split:
If PATTERN matches the empty string, the EXPR is split at the match position (between characters).
Additionally, when you split on whitespace, most of the time you really want to use a literal space:
.. split ' ', $line;
As described here:
As another special case, "split" emulates the default behavior of the
command line tool awk when the PATTERN is either omitted or a literal
string composed of a single space character (such as ' ' or "\x20",
but not e.g. "/ /"). In this case, any leading whitespace in EXPR is
removed before splitting occurs, and the PATTERN is instead treated as
if it were "/\s+/"; in particular, this means that any contiguous
whitespace (not just a single space character) is used as a separator.
However, this special treatment can be avoided by specifying the
pattern "/ /" instead of the string " ", thereby allowing only a
single space character to be a separator.

If you want to split a string into a list of individual characters then you should use an empty regex pattern for split, like this
my $line = 'cat';
my #split = split //, $line;
print "$_\n" for #split;
output
c
a
t
Some people prefer unpack, like this
my #split = unpack '(A1)*', $line;
which gives exactly the same result.

Related

Powershell regex for string between two special characters

A file name as below
$inpFiledev = "abc_XYZ.bak"
I need only XYZ in a variable to do a compare with other file name.
i tried below:
[String]$findev = [regex]::match($inpFiledev ,'_*.').Value
Write-Host $findev
Asterisks in regex don't behave in the same way as they do in filesystem listing commands. As it stands your regex is looking for underscore, repeated zero or more times, followed by any character (represented in regex by a period). So the regex finds zero underscores right at the start of the string, then it finds 'a', and that's the match it returns.
First, correct that bit:
'_*.'
Becomes "underscore, followed by any number of characters, followed by a literal period". The 'literal period' means we need to escape the period in the regex, by using \., remembering that period means any character:
'_.*\.'
_ underscore
.* any number of characters
\. a literal period
That returns:
_XYZ.
So, not far off.
If you're looking to return something from between characters, you'll need to use capturing groups. Put parentheses around the bit you want to keep:
'_(.*)\.'
Then you'll need to use PowerShell regex groups to get the value:
[regex]::match($inpFiledev ,'_(.*)\.').Groups[1].Value
Which returns: XYZ
The number 1 in the Groups[1] just means the first capturing group, you can add as many as you like to the expression by using more parentheses, but you only need one in this case.
To complement mjsqu's helpful answer with two PowerShell-idiomatic alternatives:
For an overview of how regexes (regular expressions) are used in PowerShell, see Get-Help about_regular_expressions.
Using -split to split by _ and ., extracting the resulting 3-element array's middle element:
PS> ("abc_XYZ.bak" -split '[_.]')[1]
XYZ
-split's (first) RHS operand is a regex; regex [_.] is a character set ([...]) that matches a single char. that is either a literal _ or a literal . Therefore, input abc_XYZ.bak is broken into an array containing the strings abc, XYZ, and bak. Applying index [1] therefore extracts the middle token, XYZ.
Using -replace to extract the token of interest via a capture group ((...), referred to in the replacement operand as $1):
PS> "abc_XYZ.bak" -replace '^.+_([^.]+).+$', '$1'
XYZ
-replace too operates on a regex as the first RHS operand - what to replace - whereas the second operand specifies what to replace the matched (sub)string with.
Regex ^.+_([^.]+).+$:
^.+_ matches one or more (+) characters (.) at the start of the input (^) - note how . - used outside of a character set ([...]) - is a regex metacharacter that represents any character (in a single-line input string).
([^.]+) is a capture group ((...)) that matches a negated character set ([^...]): [^.] matches any literal char. that isn't a literal ., one or more times (+).
Whatever matched the sub-expression inside (...) can be referenced in the replacement operand as $<n>, where <n> represents the 1-based index of the capture group in the regex; in this case, $1 can be used to refer to this first (and only) capture group.
.+$ matches one or more (+) remaining characters (.) until the end of the input is reached ($).
Replacement operand $1 simply refers to what the first capture group matched; in this case: XYZ.
For a comprehensive overview of the syntax of -replace replacement operands, see this answer.
Because you're using the [regex] accelerator, you need the backslash to escape your end . (if you want to match it), and you need a dot before your asterix to match any characters after your underscore. If the characters in between are all letters, then use \w+
$findev = [regex]::match($inpFiledev ,'_.*\.')
$findev
_XYZ.
this demos two other ways to get the desired info from the sample string. the 1st uses the basic .Split() string method on the raw string. the 2nd presumes you are dealing with file objects and starts off by getting the .BaseName for the file. that already removes the extension, so you need not bother doing it yourself.
if you are dealing with a large number of strings, and not file objects, then the previous regex answers will likely be faster. [grin]
$inpFiledev = 'abc_XYZ.bak'
$findev = $inpFiledev.Split('.')[0].Split('_')[-1]
# fake reading in a file with Get-Item or Get-ChildItem
$File = [System.IO.FileInfo]'c:\temp\testing\abc_XYZ.bak'
$WantedPart = $File.BaseName.Split('_')[-1]
'split on a string = {0}' -f $findev
'split on BaseName of file = {0}' -f $WantedPart
output ...
split on a string = XYZ
split on BaseName of file = XYZ

escape special character in perl when splitting a string

i have a file in this format
string: string1
string: string2
string: string3
i want to split the lines by space and :,so initially i wrote this:
my #array = split(/[:\s]/,$lineOfFile);
the result wasn't as expected, because inside #array the split inserts also white space , so after some researches i understood that i have to escape the \s so i wrote
my #array = split(/[:\\s]/,$lineOfFile);
why i have to escape \s, the character : isn't a special character or not?
can someone explain me that?
thanks in advance.
You don't have to double up the backslash. Have you tried it?
split /[:\\s]/, $line
will split on a colon : or a backslash \ or a small S s, giving
("", "tring", " ", "tring1")
which isn't what you want at all. I suggest you split on a colon followed by zero or more spaces
my #fields = split /:\s*/, $line
which gives this result
("string", "string1")
which I think is what you want.
You do not need to double escape \s and the colon is not a character of special meaning. But in your case, it makes sense to avoid using a character class altogether and split on a colon followed by whitespace "one or more" times.
my #array = split(/:\s+/, $lineOfFile);
The problem is, that /[:\s]/ only searches for a single character. Thus, when applying this regex, you get something like
print $array[0], ' - ', $array[1], ' - ', $array[2];
string - - string1
because it splits between : and the whitespace before string1. The string string: string1 is therefore splitted into three parts, string, the empty place between : and the whitespace and string1. However, allowing more characters
my #array = split(/[:\s]+/,$lineOfFile);
works well, since :+whitespace is used for splitting.
print $array[0], ' - ', $array[1];
string - string1

perl: how to check alphanumeric values and limit the string size to 30 by using regex

I am trying to write a regex for perl that would check for alphanumeric values (having spaces) but not including underscore "_" and limit the number of character to 30 I am trying this but this is not working could anyone please tell me what I am doing wrong! This code is even taking special characters as alphanumeric values. $currLine = 'Kapil# 123' this should not be a valid value.
** apologies by $currLine = "regex" i meant $currLine =~ "regex"
if ($currLine = /^[a-zA-Z0-9]{1,30}$/){
say "Line3 Good: ", $currLine;
} else {
say "Error in Line 3: Name not alphamumeric ";
}
$currLine = /^[a-zA-Z0-9]{1,30}$/
means
$currLine = $_ =~ /^[a-zA-Z0-9]{1,30}$/
You want to use
$currLine =~ /^[a-zA-Z0-9]{1,30}$/
Now on to the other problems.
You didn't allow spaces. (What follows allows whitespace. If you mean SPACE specifically, use that instead of \s).
You allow a trailing newline.
You allow 31 characters if the 31st is a newline.
You forbid many alphanumeric characters.
You forbid zero characters.
$currLine =~ /^[\p{Alnum}\s]{0,30}\z/
You are using = (assignment) where you should have =~ (bind).
Enabling warnings may have alerted you to this. The code you have is matching $_ and then assigning the results of the match to $currLine.
For your regular expression to match all alphanumeric values including spaces, you need to include for space inside your character class. You should also be using the bind operator =~ instead of = here.
if ( $currLine =~ /^[a-z0-9\s]{1,30}$/i ) { ...
Note: I included the i modifier for case-insensitive matching.
You are using assignment operator(=) instead of match operator(=~). You should change the if statement to:
if ($currLine =~ /^[a-zA-Z0-9]{1,30}$/)
This can also be shortened to:
if ($currLine =~ /^[^\W_]{1,30}$/)
[^\W] already matches anything apart from what is represented by \w. To discard _, we add it to negated character class, thus using - [^\W_]. Note however that, this matches much more than mere [a-zA-Z0-9]. It includes other unicode characters that come under word character. To just allow that regex to consider ASCII text, add /a character set modifier:
/^[^\W_]{1,30}$/a

Wildcard beginning of a line in perl

How to use wildcard for beginning of a line?
Example, I want to replace abc with def.
This is what my file looks like
abc
abc
abc
hg abc
Now I want that abc should be replaced in only first 3 lines. How to do it?
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
What condition to be put before beginning of first space?
Thanks
What about:
s/(^ *)abc/$1def/g
(^ *) -> zero or morespaces at start of line
This will strictly replace abc with def.
Also note I've used a real space and not \s because you said "beginning of first space". \s matches more characters than only space.
You are making a couple of mistakes in your regex
$_ =~ s/['\s'] * abc ['\s'] * /def/g;
You don't need /g (global, match as many times as possible) if you only want to replace from the beginning of the string (since that can only match once).
Inside a character class bracket all characters are literal except ], - and ^, so ['\s'] means "match whitespace or apostrophe '"
Spaces inside the regex is interpreted literally, unless the /x modifier is used (which it is not)
Quantifiers apply to whatever they immediately precede, so \s* means "zero or more whitespace", but \s * means "exactly one whitespace, followed by zero or more space". Again, unless /x is used.
You do not need to supply $_ =~, since that is the variable any regex uses unless otherwise specified.
If you want to replace abc, and only abc when it is the first non-whitespace in a line, you can do this:
s/^\s*\Kabc/def/
An alternate for the \K (keep) escape is to capture and put back
s/^(\s*)abc/$1def/
If you want to keep the whitespace following the target string abc, you do not need to do anything. If you want it removed, just add \s* at the end
s/^\s*\Kabc\s*/def/
Also note that this is simply a way to condense logic into one statement. You can also achieve the same by using very simple building blocks:
if (/^\s*abc/) { # if abc is the first non-whitespace
s/abc/def/; # ...substitute it
}
Since the substitution only happens once (if the /g modifier is not used), and only the first match is affected, this will flawlessly substitute abc for def.
Try this:
$_ =~ s/^['\s'] * abc ['\s'] * /def/g;
If you need to check from start of a line then use ^.
Also, I am not sure why you have ' and spaces in your regex. This should also work for you:
$_ =~ s/^[\s]*abc[\s]*/def/g;
Use ^ character, and remove unnecessary apostrophes, spaces and [ ] :
$_ =~ s/^\s*abc/def/g
If you want to keep those spaces that were before the "abc":
$_ =~ s/^(\s*)abc/\1def/g

Extract strings between two separators using regex in perl

I have a file which looks like:
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
and I wish to extract strings between : and | separators, the output should be:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
tab delimited between the two columns.
I wrote in unix a perl command:
perl -l -ne '/:([^|]*)?[^:]*:([^|]*)/ and print($1,"\t",$2)' <file>
the output that I got is:
Q9VNB0 EBI-102551 uniprotkb:A1ZBG6
P91682 EBI-142245 uniprotkb:Q24117
P92177-3 EBI-204491 uniprotkb:Q9VDK2
I wish to know what am I doing wrong and how can I fix the problem.
I don't wish to use split function.
Thanks,
Tom.
The expression you give is too greedy and thus consumes more characters than you wanted. The following expression works on your sample data set:
perl -l -ne '/:([^|]*)\|.*:([^|]*)\|/ and print($1,"\t",$2)'
It anchors the search with explicit matches for something between a ":" and "|" pair. If your data doesn't match exactly, it should ignore the input line, but I have not tested this. I.e., this regex assumes exactly two entries between ":" and "|" will exist per line.
Try m/: ( [^:|]+ ) \| .+ : ( [^:|]+ ) \| /x instead.
A fix could be to use a greeding expression between the first string and the second one. With .* it goes until the end and begins to backtrack searching for the last colon followed by a pipe.
perl -l -ne '/:([^|]*).*:([^|]*)\|/ and print($1,"\t",$2)' <file>
Output:
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2
See it in action:
:([\w\-]*?)\|
Another method:
:(\S*?)\|
The way you've specified it, it has to match that way. You want a single colon
followed by any number of non-pipe, followed by any number of non-colon.
single colon -> :
non-pipe -> Q9VNB0
non-colon -> |intact
colon -> :
non-pipe -> EBI-102551 uniprotkb:A1ZBG6
Instead I make a space the end-of-contract, and require all my patterns to begin
with a colon, end with a pipe and consist of non-space/non-pipe characters.
perl -M5.010 -lne 'say join( "\t", m/[:]([^\s|]+)[|]/g )';
perl -nle'print "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Or with 5.10+:
perl -nE'say "$1\t$2" if /:([^|]*)\S*\s[^:]*:([^|]*)/'
Explanation:
: Matches the start of the first "word".
([^|]*) Matches the desired part of the first "word".
\S* Matches the end of the first "word".
\s+ Matches the "word" separator.
[^:]*: Matches the start of the second "word".
([^|]*) Matches the desired part of the second "word".
This isn't the shortest answer (although it's close) because each part is quite independent of the others. This makes it more robust, less error-prone, and easier to maintain.
Why do you not want to use the split function. On the face of it this would be easily solved by writing
my #fields = map /:([^|]+)/, split
I am not sure how your regex is supposed to work. Using the /x modifier to allow non-significant whitespace it looks like this
/ : ([^|]*)? [^:]* : ([^|]*) /x
which finds a colon and optionally captures as many non-pipe characters as possible. Then skips over as many non-colon characters as possible to the next colon. Then captures zero asm many non-pipe characters as possible. Because all of your matches are greedy, any one of them is allowed to consume all of the rest of the string as long as the characters match the character class. Note that a ? that indicates an optional sequence will first of all match all that it can, and the option to skip the sequence will be taken only if the rest of the pattern cannot then be made to match
It is hard to judge from your examples the precise criteria for a field, but this code should do the trick. It finds sequences of characters that are neither a colon nor a pipe that are preceded by a colon and terminated by a pipe
use strict;
use warnings;
while (<DATA>) {
my #fields = /:([^:|]+)\|/g;
print join("\t", #fields), "\n";
}
__DATA__
uniprotkb:Q9VNB0|intact:EBI-102551 uniprotkb:A1ZBG6|intact:EBI-195768
uniprotkb:P91682|intact:EBI-142245 uniprotkb:Q24117|intact:EBI-156442
uniprotkb:P92177-3|intact:EBI-204491 uniprotkb:Q9VDK2|intact:EBI-87444
output
Q9VNB0 A1ZBG6
P91682 Q24117
P92177-3 Q9VDK2