Extract the id from long file path ( regex) perl - regex

I am trying to extract an id ( here eg 11894373690) from a file path that i read int my perl script -
/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/
and I will further use it create a new path like
/my/local/projects/Samplename/analysis/test/output/tool2/11894373690_NEW/
I am not able to extract just the id from the path, can anyone please suggest an easy method in perl. I should definitely start learning regular expressions!
Thanks.
I am able to get only the last directory name
$file = "/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/ ";
my ($id) = $file =~ /\.(A1[^]+)/i;
Update - Sorry all I misspelled "not" as "now" earlier! I am not able to extract the id. Thanks!

A simple regex or split are fine, but there are multiple core packages for working with paths.
This uses File::Spec to split the path and to later join the new one. Note that there is no escaping or such, no / counting -- in fact no need to even mention the separator.
use warnings 'all';
use strict;
use File::Spec::Functions qw(splitdir catdir);
my $path_orig = '...';
my #path = splitdir $path_orig;
my ($mark, $dir);
foreach my $i (0..$#path)
{
if ($path[$i] =~ m/(\d+)_cast/)
{
$dir = $1;
$mark = $i;
last;
}
}
my $path_new = catdir #path[0..$mark-1], $dir . '_NEW';
You can manipulate the #path array in other ways, of course -- peel components off of the back of it (pop #path while $path[-1] !~ /.../), or iterate and copy into a new array, etc.
The code above is simple and doesn't need extra data copy nor multiple regex matches.
Apparently the old and new path have another difference (tool1 vs tool2), please adjust. The main point is that once the path is split it is simple to go through the array.
As for a simple regex to fetch the id
my ($id) = $path =~ m{/(\d+)_cast/};
If \d+_cast is certain to be un-ambiguous (only one dir with that in its name) drop the / above.

What so you need to be fixed? and what will be dynamic? for this url, supposing that the posfix will aways be _cast you can use the expression:
(\d+)_cast
so the ID will be in the first selection group

I did find a way to get the id - it may not be efficient but works for now
I did
my $dir_path = "/my/local/projects/Samplename/analysis/test/output/tool1/11894373690_cast/A1/A1a/ ";
my #keys =(split(/[\/_]+/,$dir_path));
print "Key is $keys3[9]\n";
it prints out 11894373690
Thanks all for the suggestions!

Related

Perl regex store matches in array

I have a file with strings in each row as follows
"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1
the next line could look like
84545,X,2
I'm trying to parse this text in Perl. Note: quotes are present in the strings when there are several of them in a row, but not present if there is only item
I would like to parse each item into an array. I tried the following regex
#fields = ($_ =~ /(\d+\_\d+),*/g);
but it is missing the last 2714. How do I capture that edge case? Any help appreciated. Thanks in advance
It looks like you have a CSV File, so use an actual CSV parser for it like Text::CSV.
After you parse the columns, you can separate your first field into the array:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($csv->parse($line)) {
my #columns = $csv->fields();
my #nums = split ',', $columns[0];
print "#nums\n";
}
Outputs:
229269_2 190594_2 94552_2 266076_2 269628_2 165328_2 99319_2 263339_2 263300_2 99315_2 271509_2 2714
Why not a regex ?
Yes, of course it's possible to use a regex for practically anything. But what you need to understand is that this will make your code extremely fragile and difficult to maintain.
Even if you want to use a regular expression, you should STILL do this in two steps. First separate the initial column(s) of your CSV, and then process the specific column that you're worried about.
Because you're just working with the first column, you could use code like the following:
use strict;
use warnings;
my $line = qq{"229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1 the next line could look like 84545,X,2};
if ($line =~ /^"(.*?)"|^([^,]*)/) {
my $column0 = $1 // $2;
my #nums = split ',', $column0;
print "#nums\n";
}
The above happens to accomplish the same thing as the previous code. However, it has one big flaw, it's not nearly as obvious to the maintaining programmer what's going on.
Whenever a new coder, or even yourself in 6 months, views the first set of code, it is extremely obvious what format your data is in. You're working with a CSV file, and the first column is a list separated by commas. The second code also works, but the new maintainer must actually read the regex and figure out what's going on to understand both what format the data is in, and whether the code is actually doing it correctly.
Anyway, do whatever you will, but I strongly advise you to use an actual CSV Parser for parsing csv files.
If all you want is all but the last two fields...
my $string = qq("229269_2,190594_2,94552_2,266076_2,269628_2,165328_2,99319_2,263339_2,263300_2,99315_2,271509_2,2714",A,1);
$string =~ s/"//g; # delete the quotes
my #f = split (/,/, $string); # split on the comma
pop #f; pop #f; # jettison the last two columns
# #f contains what you're looking for

Efficiently matching a set of filenames with regex in Perl

I'm using Perl to capture the names of files in some specified folders that have certain words in them. The keywords in those filenames are "offers" or "cleared" and "regup" or "regdn". In other words, one of "offers" or "cleared" AND one of "regup" or "regdn" must appear in the filename to be a positive match. The two words could be in any order and there are characters/words that will appear in front of and behind them. A sample matching filename is:
2day_Agg_AS_Offers_REGDN-09-JUN-11.csv
I have a regex that successfully captures each of the matching filenames as a full path, which is what I wanted, but it seems inelegant and inefficient. Attempts at slightly better code have all failed.
Working approach:
# Get the folder names
my #folders = grep /^\d{2}-/, readdir DIR;
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# For each filename in the list, if it matches, print it
foreach my $item ( #contents ) {
if ($item =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/i){
print "$item\n";
}
}
}
Attempt at something shorter/cleaner:
foreach my $folder ( #folders ) {
# glob the contents of the folder (to get the file names)
my #contents = <$folder/*>;
# Seems to determine that there are four matches in each folder
# but then prints the first matching filename four times
my $single = join("\n", #contents);
for ($single =~ /^$folder(?=.*(offers|cleared))(?=.*(regup|regdn)).*csv$/im) {
print "$&\n";#"Matched: |$`<$&>$'|\n\n";
}
}
I've tried other formatting with the regex, using other options (/img, /ig, etc.), and sending the output of the regex to an array, but nothing has worked properly. I'm not great with Perl, so I'm positive I'm missing some big opportunities to make this whole procedure more efficient. Thanks!
Collect only these file names which contain offers or cleared AND regup or regdn
my #contents = grep { /offers|cleared/i && /regup|regdn/i } <$folder/*>;
Why would it be shorter or cleaner to use join instead of a loop? I'd say it makes it more complicated. What you seem to be doing is just matching loosely based on the conditions
name contains offers or cleared
name contains regup or regdn
name ends with .csv.
So why not just do this:
if ( $file =~ /offers|cleared/i and
$file =~ /regup|regdn/i and
$file =~ /csv$/i)
You might be interested in something like this:
use strict;
use warnings;
use File::Find;
my $dir = "/some/dir";
my #files;
find(sub { /offers|cleared/i &&
/regup|regdn/i &&
/csv$/i && push #files, $File::Find::name }, $dir);
Which would completely exclude the use of readdir and other loops. File::Find is recursive.

Perl regexp how to get the file name out?

I have this directory path:
\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1
How can I get the file name testQEM.txt from the above string?
I use this:
$file =~ /(.+\\)(.+\..+)(\\.+)/;
But get this result:
file = testQEM.txt\main\ABC_QEM
Thanks,
Jirong
I'm not sure I understand, as paths cannot have a file node half way through them! Have multiple paths got concatenated somehow?
Anyway, I suggest you work though the path looking for the first node that validates as a real file using -f
Here is an example
use strict;
use warnings;
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my #path = split /\\/, $path;
my $file = shift #path;
$file .= '\\'.shift #path until -f $file or #path == 0;
print "$file\n";
/[^\\]+\.[^\\]+/
Capture anything separated by a . between two backslashes. Is this what you where looking for?
This is a bit difficult, as directory names can contain contain periods. This is especially true for *nix Systems, but is valid under Windows as well.
Therefore, each possible subpath has to be tested iteratively for file-ness.
I'd maybe try something like this:
my $file;
my $weirdPath = q(/main/ABC_PRD/ABC_QEM/1/testQEM.txt/main/ABC_QEM/1);
my #parts = split m{/} $weirdPath;
for my $i (0 .. $#parts) {
my $path = join "/", #parts[0 .. $i];
if (-f $path) { # optionally "not -d $path"
$file = $parts[$i];
last;
}
}
print "file=$file\n"; # "file=testQEM.txt\n"
I split the weird path at all slashes (change to backslashes if interoperability is not an issue for you). Then I join the first $i+1 elements together and test if the path is a normal file. If so, I store the last part of the path and exit the loop.
If you can guarantee that the file is the only part of the path that contains periods, then using one of the other solutions will be preferable.
my $file = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
my ($result) = $file =~ /\\([^\\]+\.[^\\]+)\\/;
Parentheses around $result force the list context on the right hand side expression, which in turn returns what matches in parentheses.
Use regex pattern /(?=[^\\]+\.)([^\\]+)/
my $path = '\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1';
print $1 if $path =~ /(?=[^\\]+\.)([^\\]+)/;
Test this code here.
>echo "\main\ABC_PRD\ABC_QEM\1\testQEM.txt\main\ABC_QEM\1"|perl -pi -e "s/.*([\\][a-zA-Z]*\.txt).*/\1/"
\testQEM.txt
i suggest you may comprehend principle of regexp Backtracking ,such as how * and + to work.
you only make a little change about your regexp as:
/(.+\\)(.+\..+?)(\\.+)/

How do I get the host name from a URL in Perl using regex?

so what I want to do is remove everything after and including the first "/" to appear after a "."
so: http://linux.pacific.net.au/primary.xml.gz
would become: http://linux.pacific.net.au
How do I do this using regex? The system I'm running on can't use URI tool.
$url = 'http://linux.pacific.net.au/primary.xml.gz';
($domain) = $url =~ m!(https?://[^:/]+)!;
print $domain;
output:
http://linux.pacific.net.au
and this is the official regular expression can be used to decode a URI:
my($scheme, $authority, $path, $query, $fragment) =
$uri =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?|;
I suggest you use URI::Split which will separate a standard URL into its constuent parts for you and rejoin them. You want the first two parts - the scheme and the host.
use strict;
use warnings;
use URI::Split qw/ uri_split uri_join /;
my $scheme_host = do {
my (#parts) = uri_split 'http://linux.pacific.net.au/primary.xml.gz';
uri_join #parts[0,1];
};
print $scheme_host;
output
http://linux.pacific.net.au
Update
If your comment The system I'm running on can't use URI tool means you can't install modules, then here is a regular expression solution.
You say you want to remove everything after and including the first "/" to appear after a ".", so /^.*?\./ finds the first dot, and m|[^/]+| finds everything after it up tot he next slash.
The output is identical to that of the preceding code
use strict;
use warnings;
my $url = 'http://linux.pacific.net.au/primary.xml.gz';
my ($scheme_host) = $url =~ m|^( .*?\. [^/]+ )|x;
print $scheme_host;
The system I'm running on can't use URI tool.
I really recommend doing whatever you can to fix that problem first. If you're not able to use CPAN modules then you'll be missing out on a lot of the power of Perl and your Perl programming life will be far more frustrating than it needs to be.

How could I get this regex statment to capture just a select piece

I am trying to get the regex in this loop,
my $vmsn_file = $snapshots{$snapshot_num}{"filename"};
my #current_vmsn_files = $ssh->capture("find -name $vmsn_file");
foreach my $vmsn (#current_vmsn_files) {
$vmsn =~ /(.+\.vmsn)/xm;
print "$1\n";
}
to capture the filename from this line,
./vmfs/volumes/4cbcad5b-b51efa39-c3d8-001517585013/MX01/MX01-Snapshot9.vmsn
The only part I want is the part is the actual filename, not the path.
I tried using an expression that was anchored to the end of the line using $ but that did not seem to make any difference. I also tried using 2 .+ inputs, one before the capture group and the one inside the capture group. Once again no luck, also that felt kinda messy to me so I don't want to do that unless I must.
Any idea how I can get at just the file name after the last / to the end of the line?
More can be added as needed, I am not sure what I needed to post to give enough information.
--Update--
With 5 minutes of tinkering I seemed to have figured it out. (what a surprise)
So now I am left with this, (and it works)
my $vmsn_file = $snapshots{$snapshot_num}{"filename"};
my #current_vmsn_files = $ssh->capture("find -name $vmsn_file");
foreach my $vmsn (#current_vmsn_files) {
$vmsn =~ /.+\/(\w+\-Snapshot\d+\.vmsn)/xm;
print "$1\n";
}
Is there anyway to make this better?
Probably the best way is using the core module File::Basename. That will make your code most portable.
If you really want to do it with a regex and you are based on Unix, then you could use:
$vmsn =~ m%.*/([^/]+)$%;
$file = $1;
well, if you are going to use find command from the shell, and considering you stated that you only want the file name, why not
... $ssh->capture("find -name $vmsn_file -printf \"%f\n\" ");
If not, the simplest way is to split() your string on "/" and then get the last element. No need regular expressions that are too long or complicated.
See perldoc -f split for more information on usage