Perl split string based on forward slash - regex

I am new to Perl, so this is basic question. I have a string as shown below. I am interested in taking date out of it, so thinking of splitting it using slash
my $path = "/bla/bla/bla/20160306";
my $date = (split(/\//,$path))[3];#ideally 3 is date position in array after split
print $date;
However, I don't see the expected output, but instead I see 5 getting printed.

Since the path starts with the pattern / itself, split returns a list with an empty string first (to the left of the first /); one element more. Thus the posted code miscounts by one and returns the one before last element (subdirectory) in the path, not the date.
If date is always the last thing in the string you can pick the last element
my $date = (split '/', $path)[-1];
where i've used '' for delimiters so to not have to escape /. (This, however, may confuse since the separator pattern is a regex and // convey that, while '' may appear to merely quote a string.)
This can also be done with regex
my #parts = $path =~ m{([^/]+)}g;
With this there can be no inital empty string. Or, the last part can be picked out of the full list as above, with ($path =~ m{...}g)[-1], but if you indeed only need the last bit then extract it directly
my ($last_part) = $path =~ m{.*/(.*)};
Here the "greedy" .* matches everything in the string up to the last instance of the next subpattern (/ here), thus getting us to the last part of the path, which is then captured. The regex match operator returns its matches only when it is in the list context so parens on the left are needed.
What brings us to the fact that you are parsing a path, and there are libraries dedicated to that.
For splitting a path into its components one tool is splitdir from File::Spec
use File::Spec;
my #parts = File::Spec->splitdir($path);
If the path starts with a / we'll again get an empty string for the first element (by design, see docs). That can then be removed, if there
shift #parts if $parts[0] eq '';
Again, the last element alone can be had like in the other examples.

Simply bind it to the end:
(\d+)$
# look for digits at the end of the string
See a demo on regex101.com. The capturing group is only for clarification though not really needed in this case.
In Perl this would be (I am a PHP/Python guy, so bear with me when it is ugly)
my $path = "/bla/bla/bla/20160306";
$path =~ /(\d+)$/;
print $1;
See a demo on ideone.com.

Try this
Use look ahead for to do it. It capture the / by splitting. Then substitute the data using / for remove the slash.
my $path = "/a/b/c/20160306";
my $date = (split(/(?=\/)/,$path))[3];
$date=~s/^\///;
print $date;
Or else use pattern matching with grouping for to do it.
my $path = "/a/b/c/20160306";
my #data = $path =~m/\/(\w+)/g;
print $data[3];

Related

PwSh RegEx capture Version information - omit (surrounding) x(32|64|86) & 32|64(-)Bit

I'm pulling my hair, to RegEx-tract the bare version information from some filenames.
e.g. "1.2.3.4"
Let's assume, I have the following Filenames:
VendorSetup-x64-1.23.4.exe
VendorSetup-1-2-3-4.exe
Vendor Setup 1.23.456Update.exe
SoftwareName-1.2.34.5-x64.msi
SoftwareName-1.2.3.4-64bit.msi
SoftwareName-64-Bit-1.2.3.4.msi
VendorName_SoftwareName_64_1.2.3_Setup.exe
(And I know there are still some filenames out there, that have "x32" as well as "x86" in them, so I've added them to the title)
First of all, I replaced the _'s & -'s by .'s which I'd like to avoid in general, but haven't found a cleverer approach and to be honest - only works well if there's no other "digit"-information in the String for example like the 2nd Filename.
I then tried to extract the Version information using Regex like
-replace '^(?:\D+)?(\d+((\.\d+){1,4})?)(?:.*)?', '$1'
Which lacks the ability to omit "x64", "64Bit", "64-Bit" or any variation of that generally.
Additionally, I played around with RegExes like
-replace '^(?:[xX]*\d{2})?(?:\D+)?(\d+((\.\d+){1,4})?)(?:.*)?$', '$1'
to try to omit a leading "x64" or "64", but with no success (most probably because of the replacement from -'s to .'s.
And before it gets even worse, I'd like to ask if there's anybody who could help me or lead me in the right direction?
Thanks in advance!
This could be done using a single pattern, but by splitting it up into two separate patterns and let PowerShell do some of the work, the overall solution can be much easier.
Pattern 1 matches version numbers that are separated by . (dot):
(?<=[\s_-])\d+(?:\.\d+){1,3}
Pattern 2 matches version numbers that are separated by - (dash):
(?<=[\s_-])\d+(?:-\d+){1,3}
The patterns start with (?<=[\s_-]) which is a positive lookbehind assertion that makes sure that the version is separated by space, underscore or dash on the left side, without including these in the captured value. This prevents sub string 64-1 from the first sample to match as a version.
Detailed explanations of the pattern can be found at regex101.
Powershell code:
# Create an array of sample filenames
$names = #'
VendorSetup-2022-05-x64-1.23.4.exe
VendorSetup-x64-1.23.4-2022-05.exe
VendorSetup-1-2-3-4.exe
VendorSetup_2022-05_1-2-3-4.exe
Vendor Setup 1.23.456Update.exe
SoftwareName-1.2.34.5-x64.msi
SoftwareName-1.2.3.4-64bit.msi
SoftwareName-64-Bit-1.2.3.4.msi
VendorName_SoftwareName_64_1.2.3_Setup.exe
NoVersion.exe
'# -split '\r?\n'
# Array of RegEx patterns in order of precedence.
$versionPatterns = '(?<=[\s_-])\d+(?:\.\d+){1,3}', # 2..4 numbers separated by '.'
'(?<=[\s_-])\d+(?:-\d+){1,3}' # 2..4 numbers separated by '-'
foreach( $name in $names ) {
$version = $versionPatterns.
ForEach{ [regex]::Match( $name, $_, 'RightToLeft' ).Value }. # Apply each pattern from right to left of string.
Where({ $_ }, 'First'). # Get first matching pattern (non-empty value).
ForEach{ $_ -replace '\D+', '.' }[0] # Normalize the number separator and get single string.
# Output custom object for nice table formatting
[PSCustomObject]#{ Name = $name; Version = $version }
}
Output:
Name Version
---- -------
VendorSetup-2022-05-x64-1.23.4.exe 1.23.4
VendorSetup-x64-1.23.4-2022-05.exe 1.23.4
VendorSetup-1-2-3-4.exe 1.2.3.4
VendorSetup_2022-05_1-2-3-4.exe 1.2.3.4
Vendor Setup 1.23.456Update.exe 1.23.456
SoftwareName-1.2.34.5-x64.msi 1.2.34.5
SoftwareName-1.2.3.4-64bit.msi 1.2.3.4
SoftwareName-64-Bit-1.2.3.4.msi 1.2.3.4
VendorName_SoftwareName_64_1.2.3_Setup.exe 1.2.3
NoVersion.exe
Explanation of the Powershell code:
To resolve ambiguities when a filename has multiple matches of the patterns, we use the following rules:
Version with . separator is preferred over version with - separator. We simply apply the patterns in this order and stop when the first pattern matches.
Rightmost version is preferred (by passing the RightToLeft flag to [regex]::Match()).
.ForEach and .Where are PowerShell intrinsic methods. They are basically faster variants of the ForEach-Object and Where-Object cmdlets.
The index [0] operator after the last .ForEach is required because .ForEach and .Where always return arrays, even if there is only a single value, contrary to the behaviour cmdlets.

Telling regex search to only start searching at a certain index

Normally, a regex search will start searching for matches from the beginning of the string I provide. In this particular case, I'm working with a very large string (up to several megabytes), and I'd like to run successive regex searches on that string, but beginning at specific indices.
Now, I'm aware that I could use the substr function to simply throw away the part at the beginning I want to exclude from the search, but I'm afraid this is not very efficient, since I'll be doing it several thousand times.
The specific purpose I want to use this for is to jump from word to word in a very large text, skipping whitespace (regardless of whether it's simple space, tabs, newlines, etc). I know that I could just use the split function to split the text into words by passing \s+ as the delimiter, but that would make things for more complicated for me later on, as there a various other possible word delimiters such as quotes (ok, I'm using the term 'word' a bit generously here), so it would be easier for me if I could just hop from word to word using successive regex searches on the same string, always specifying the next index at which to start looking as I go. Is this doable in Perl?
So you want to match against the words of a body of text.
(The examples find words that contain i.)
You think having the starting positions of the words would help, but it isn't useful. The following illustrates what it might look like to obtain the positions and use them:
my #positions;
while ($text =~ /\w+/g) {
push #positions, $-[0];
}
my #matches;
for my $pos (#positions) {
pos($text) = $pos;
push #matches $1 if $text =~ /\G(\w*i\w*)/g;
}
If would far simpler not to use the starting positions at all. Aside from being far simpler, we also remove the need for two different regex patterns to agree as to what constitute a word. The result is the following:
my #matches;
while ($text =~ /\b(\w*i\w*)/g) {
push #matches $1;
}
or
my #matches = $text =~ /\b(\w*i\w*)/g;
A far better idea, however, is to extra the words themselves in advance. This approach allows for simpler patterns and more advanced definitions of "word"[1].
my #matches;
while ($text =~ /(\w+)/g) {
my $word = $1;
push #matches, $word if $word =~ /i/;
}
or
my #matches = grep { /i/ } $text =~ /\w+/g;
For example, a proper tokenizer could be used.
In the absence of more information, I can only suggest the pos function
When doing a global regex search, the engine saves the position where the previous match ended so that it knows where to start searching for the next iteration. The pos function gives access to that value and allows it to be set explicitly, so that a subsequent m//g will start looking at the specified position instead of at the start of the string
This program gives an example. The string is searched for the first non-space character after each of a list of offsets, and displays the character found, if any
Note that the global match must be done in scalar context, which is applied by if here, so that only the next match will be reported. Otherwise the global search will just run on to the end of the file and leave information about only the very last match
use strict;
use warnings 'all';
use feature 'say';
my $str = 'a b c d e f g h i j k l m n';
# 0123456789012345678901234567890123456789
# 1 2 3
for ( 4, 31, 16, 22 ) {
pos($str) = $_;
say $1 if $str =~ /(\S)/g;
}
output
c
l
g
i

matching two strings which differ in elements and spaces in perl

I want to match two string which differ only in element and newlines
$string1 = "perl is <match>scripting language</match>";
$string2 = "perl<TAG> is<TAG> scr<TAG>ipt<TAG>inglanguage";
Note: spaces and <TAG> and newline can come anywhere in string2. space may or may not present in string2 for e.g. in above instance in $string2 spaces between words scripting language is missing. we have to ignore space,tags,newline while matching string1 against string2. <match> tag in string1 indicates the data to be matched against string2
output required :
whole content of string2 in addition with <match> tag.
perl<TAG> is<TAG> <match>scr<TAG>ipt<TAG>inglanguage</match>
Code i tried :
while($string =~ /<match>(.*?)<\/match>/gs)
{
my $data_to_match = $1;
$data_to_match = add_pat($data_to_match);
$string2 =~ s{($data_to_match)}
{
"<match>$&<\/match>"
}esi;
}
sub add_pat
{
my ($data) = (#_);
my #array = split//,$data;
foreach my $each(#array)
{
$each = quotemeta $each;
$each = '(?:(<TAG>|\s)+)?'.$each.'(?:(<TAG>|\s)+)?';
}
$data = join '',#array;
return $data;
}
Problem : since space is missing in string2 it is not matching.i tried making space optional while appending pattern to each character. but making space optional. $string pattern goes on running.
In reality, i have large string to match. these space is causing problem..Please suggest
Use regular expressions to remove all the characters that you wish to ignore from both of the strings. Then compare the remaining values of the two strings.
So you will end up both strings, for example:
'perlisscriptinglanguage' and 'perlisscriptinglanguage'
If you want you can also upper/lower case them to match too.
If they match then just return the original string 2.
I think its weird that you are expected to "match". but $string2, if you take out the tags, doesnt match the original string.
Anyway, since your code is tolerant of Additional spaces and tags in $string2, then you can wipe all spaces (and tags if applicable) from $string1.
I added $data_to_match =~ s/ +//; before your call to add_pat. That didnt quite work because this line "$each = '(?:(|\s)+)?'.$each.'(?:(|\s)+)?';" adds the (?:(|\s)+)?' even before your first letter of the match from $string1. You actually have a lot of redundant TAG patterns, you add one to the front and back of each letter. I dont know what quotemeta does so im not sure how to fix the code there. I just added
$data_to_match =~ s/\Q(?:(<TAG>|\s)+)?\E//; line after the call to add_pat to strip off the first TAG pattern from the front of the pattern. otherwise it'll match wrong and output this 'perl < TAG> is< match>< TAG> scr< TAG>ipt< TAG>inglanguage< /match>'
Really you should only be putting one "(?:(|\s)+)?" inbetween each letter of the $string1 match, and more importantly; you should not be putting "(?:(|\s)+)?" before the first letter or after the last letter.

Sensethising domains

So I'm trying to put all numbered domains into on element of a hash doing this:
### Domanis ###
my $dom = $name;
$dom =~ /(\w+\.\w+)$/; #this regex get the domain names only
my $temp = $1;
if ($temp =~ /(^d+\.\d+)/) { # this regex will take out the domains with number
my $foo = $1;
$foo = "OTHER";
$domain{$foo}++;
}
else {
$domain{$temp}++;
}
where $name will be something like:
something.something.72.154
something.something.72.155
something.something.72.173
something.something.72.175
something.something.73.194
something.something.73.205
something.something.73.214
something.something.abbnebraska.com
something.something.cableone.net
something.something.com.br
something.something.cox.net
something.something.googlebot.com
My code currently print this:
72.175
73.194
73.205
73.214
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
lstn.net
but I want it to print like this:
abbnebraska.com
cableone.net
com.br
cox.net
googlebot.com
OTHER
lstn.net
where OTHER is all the numbered domains, so any ideas how?
You really shouldn't need to split the variable into two, e.g. this regex will match the case you want to trap:
/\d{1,3}\.\d{1,3}$/ -- returns true if the string ends with two 1-3 long digits separated by a dot
but I mean if you only need to separate those domains that are not numbered you could just check the last character in the domain whether it is a letter, because TLDs cannot contain numbers, so you would do something like
/\w$/ -- if returns true, it is not a numbered domain (providing you've stripped spaces and new lines)
But I suppose it is better to be more specific in the regex, which also better illustrates the logic you are looking for in your script, so I'd use the former regex.
And actually you could do something like this:
if (my ($domain) = $name =~ /\.(\w+.\w+)$/)
{
#the domain is assigned to the variable $domain
} else {
#it is a number domain
}
Take what it currently puts, and use the regex:
/\d+\.\d+/
if it matches this, then its a pair of numbers, so remove it.
This way you'll be able to keep any words with numbers in them.
Please, please indent your code correctly, and use whitespace to separate out various bits and pieces. It'll make your code so much easier to read.
Interestingly, you mentioned that you're getting the wrong output, but the section of the code you post has no print, printf, or say statement. It looks like you're attempting to count up the various domain names.
If these are the value of $name, there are several issues here:
if ($temp =~ /(^d+\.\d+)/) {
Matches nothing. This is saying that your string starts with one or more letter d followed by a period followed by one or more digits. The ^ anchors your regular expression to the beginning of the string.
I think, but not 100% sure, you want this:
if ( $temp =~ /\d\.\d/ ) {
This will find all cases where there are two digits with a period in between them. This is the sub-pattern to /\d+\.\d+/, so both regular expressions will match the same thing.
The
$dom =~ /(\w+\.\w+)$/;
Is matching anywhere in the entire string $dom where there are two letters, digits. or underscores with a decimal between them. Is that what you want?
I also believe this may indicate an error of some sort:
my $foo = $1;
$foo = "OTHER";
$domain{$foo} ++;
This is setting $foo to whatever $dom is matching, but then immediately resets $foo to OTHER, and increments $domain{OTHER}.
We need a sample of your initial data, and maybe the actual routine that prints your output.

Save Matched Perl Regex as Variable

I have a simple Perl regex that I need to save as a variable.
If I print it:
print($html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g);
It prints what I want to save, but when trying to save it with:
$link = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
I get back a '1' as the value of $link. I assume this is because it found '1' match. But how do I save the content of the match instead?
Note the /g to get all matches. Those can't possibly be put into a scalar. You need an array.
my #links = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/g;
If you just want the first match:
my ($link) = $html_data =~ m/<iframe id="pdfDocument" src=.(.*)pdf/;
Note the parens (and the lack of now-useless /g). You need them to call m// in list context.
The matched subexpressions of a pattern are saved in variables $1, $2, etc. You can also get the entire matched pattern ($&) but this is expensive and should be avoided.
The distinction in behavior here, by the way, is the result of scalar vs. list context; you should get to know them, how they differ, and how they affect the behavior of various Perl expressions.
From 'perlfunc' docs:
print LIST
Prints a string or a list of strings.
So,print m//, where m// determines that the return value
wanted (wantarray?) is a list
(It appers m// without capture groups returns 1 or 0 match pass
or fail, where as m//g returns a list of matches).
and
$link = m// can only be scalar (as opposed to list) context.
So, m// returns match results 1 (true) or 0 (false).
I just wrote code like this. It may help. It's basically like yours except mine has a few more parentheses.
my $path = `ls -l -d ~/`;
#print "\n path is $path";
($user) = ($path=~/\.*\s+(\w+)\susers/);
So yours from this example may be something like this if your trying to store the whole thing? I'm not sure but you can use mine as an example. I am storing whatever is in (\w+):
($link) = ($html_data =~ (m/<iframe id="pdfDocument" src=.(.*)pdf/g));