How to extract a substring in perl - regex

I am new in perl and need your help.
I am reading the contents of files in a directory.
I need to extract the substring from the files containing *.dat
Sample strings:
1) # ** Template Name: IFDOS_ARCHIVE.dat
2) # ** profile for IFNEW_UNIX_CMD.dat template **
3) # ** Template IFWIN_MV.dat **
Need to Extract:
1) IFDOS_ARCHIVE.dat
2) IFNEW_UNIX_CMD.dat
3) IFWIN_MV.dat
My code:
if(open(my $jobprofile, "./profiles/$vitem[0].profile")) {
my #jobprofiletemp = <$jobprofile>;
close($jobprofile);
#proftemplates = grep /.dat/,#jobprofiletemp;
my $strproftemp = $proftemplates[0];
my ($tempksh) = $strproftemp =~ / ([^_]*)./;
print "tempksh: $tempksh","\n";
} else { warn "problem opening ./$_\n\n"; }
my regex is not working.
what do you suggest?

I think you'll be better with something like:
while (<$jobprofile>) {
if ( /(\S+)\.dat/ ) {
print "$1\n";
}
}
(the while is there to make sure you parse every single line)
The regular expressions looks for a sequence of non-white-space characters (\S) followed by .dat.
The parenthesis surrounding \S+ capture the match of that part into the special variable $1.

Try this
open my $fh,"<","file.txt";
while (<$fh>)
{
next if /^\s+/; #skip the loop for empty line
($match) = /\s(\w+\.dat)/g; # group the matching word and store the input into the $match
print "$match\n";
}
or simply try the perl one liner
perl -ne' print $1,"\n" if (m/(\w+\.dat)/) ' file.txt
Or you are working in linux try the linux command for to do it
grep -ohP '\w+\.dat' one.txt
o display the matched element only
h for display the output without filename
P for perl regex

Related

How to verify if a variable value contains a character and ends with a number using Perl

I am trying to check if a variable contains a character "C" and ends with a number, in minor version. I have :
my $str1 = "1.0.99.10C9";
my $str2 = "1.0.99.10C10";
my $str3 = "1.0.999.101C9";
my $str4 = "1.0.995.511";
my $str5 = "1.0.995.AC";
I would like to put a regex to print some message if the variable has C in 4th place and ends with number. so, for str1,str2,str3 -> it should print "matches". I am trying below regexes, but none of them working, can you help correcting it.
my $str1 = "1.0.99.10C9";
if ( $str1 =~ /\D+\d+$/ ) {
print "Candy match1\n";
}
if ( $str1 =~ /\D+C\d+$/ ) {
print "Candy match2\n";
}
if ($str1 =~ /\D+"C"+\d+$/) {
print "candy match3";
}
if ($str1 =~ /\D+[Cc]+\d+$/) {
print "candy match4";
}
if ($str1 =~ /\D+\\C\d+$/) {
print "candy match5";
}
if ($str1 =~ /C[^.]*\d$/)
C matches the letter C.
[^.]* matches any number of characters that aren't .. This ensures that the match won't go across multiple fields of the version number, it will only match the last field.
\d matches a digit.
$ matches the end of the string. So the digit has to be at the end.
I found it really helpful to use https://www.regextester.com/109925 to test and analyse my regex strings.
Let me know if this regex works for you:
((.*\.){3}(.*C\d{1}))
Following your format, this regex assums 3 . with characters between, and then after the third . it checks if the rest of the string contains a C.
EDIT:
If you want to make sure the string ends in a digit, and don't want to use it to check longer strings containing the formula, use:
^((.*\.){3}(.*C\d{1}))$
Lets look what regex should look like:
start{digit}.{digit}.{2-3 digits}.{2-3 digits}C{1-2 digits}end
very very strict qr/^1\.0\.9{2,3}\.101?C\d+\z/ - must start with 1.0.99[9]?.
very strict qr/^1\.\0.\d{2,3}\.\d{2,3}C\d{1,2}\z/ - must start with 1.0.
strict qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d{1,2}\z/
relaxed qr/^\d\.\d\.\d+\.\d+C\d+\z/
very relaxed qr/\.\d+C\d+\z/
use strict;
use warnings;
use feature 'say';
my #data = qw/1.0.99.10C9 1.0.99.10C10 1.0.999.101C9 1.0.995.511 1.0.995.AC/;
#my $re = qr/^\d\.\d\.\d+\.\d+C\d+\z/;
my $re = qr/^\d\.\d\.\d{2,3}\.\d{2,3}C\d+\z/;
say '--- Input Data ---';
say for #data;
say '--- Matching -----';
for( #data ) {
say 'match ' . $_ if /$re/;
}
Output
--- Input Data ---
1.0.99.10C9
1.0.99.10C10
1.0.999.101C9
1.0.995.511
1.0.995.AC
--- Matching -----
match 1.0.99.10C9
match 1.0.99.10C10
match 1.0.999.101C9

How to send a regex through another regex match without any affection

I use below perl code to split a given string by comma or space in general. I also need to pass a regex content,but that regex getting wrong.
sub specificStrChecker
{
my $input_line = shift;
my #specificStrs = split(/[,\s]+/, $input_line);
print "----------------------------\n";
foreach (#specificStrs) # list of perl regex | normal str
{
print "$_\n";
}
print "----------------------------\n\n";
}
my $str_1 = "abc,pqr";
my $str_2 = "/ {2,}/, ghi"; # i need to print / {2,}/
specificStrChecker($str_1);
specificStrChecker($str_2); # i need to print / {2,}/
Output :
abc
pqr
/
{2
}/
ghi
In above second part, i need "/ {2,}/" but here it affects with the split regex. How do i avoid that.?
If your regexes don't contain any / characters, this is simply changing your split to instead match a series of /-delimited substrings or strings of non-comma, non-whitespace characters:
my #specificStrs = $input_line =~ m{(/[^/]*/|[^,\s]+)}g;

How can I properly stop and start metacharacter interpolation in regexp in Perl

Editing to be more concise, pardon.
I need to be able to grep from an array using a string that may contain one of the following characters: '.', '+', '/', '-'. The string will be captured via from the user. The array contains each line of the file I'm searching through (I'm chomping the file into the array to avoid keeping it open while the user is interfacing with the program because it is on a cron and I do not want to have it open when the cron runs), and each line has a unique identifier within it which is the basis for the search string used in the regexp. The code below shows the grep statement I am using, and I use OUR and MY in my programs to make the variables I want access to in all namespaces available, and the ones I use only in subroutines not. If you do want to try and replicate the issue
#!/usr/bin/perl -w
use strict;
use Switch;
use Data::Dumper;
our $pgm_path = "/tmp/";
our $device_info = "";
our #new_filetype1 = ();
our #new_filetype2 = ();
our #dev_info = ();
our #pgm_files = ();
our %arch_rtgs = ();
our $file = "/path/file.csv";
open my $fh, '<', $file or die "Couldn't open $file!\n";
chomp(our #source_file = <$fh>);
close $fh;
print "Please enter the device name:\n";
chomp(our $dev = <STDIN>);
while ($device_info eq "") {
# Grep the device info from the sms file
my #sms_device = grep(/\Q$dev\E/, #source_file);
if (scalar(#sms_device) > 1) {
my $which_dup = find_the_duplicate(\#sms_device);
if ($which_dup eq "program") {
print "\n-> $sms_dev <- must be a program name instead of a device name." .
"\nChoose the device from the list you are working on, specifically.\n";
foreach my $fix(#sms_device) {
my #fix_array = split(',', $fix);
print "$fix_array[1]\n";
undef #fix_array;
}
chomp($sms_dev = <STDIN>);
} else { $device_info = $which_dup; }
} elsif (scalar(#sms_device) == 1) {
($device_info) = #sms_device;
#sms_device = ();
}
}
When I try the code with an anchor:
my #sms_device = grep(/\Q$dev\E^/, #source_file);
No more activity from the program is noticed. It just sits there like it's waiting on some more input from the user. This is not what I expected to happen. The reason I would like to anchor the search pattern is because there are many, many examples of similarly named devices that have the same character order as the search pattern, but also include additional characters that are ignored in the regexp evaluation. I don't want them to be ignored, in the sense that they are included in matches. I want to force an exact match of the string in the variable.
Thanks in advance for wading through my terribly inexperienced code and communication attempts at detailing my problem.
The device id followed by the start of the string? /\Q$dev\E^/ makes no sense. You want the device id to be preceded by the start of the string and followed by the end of the string.
grep { /^\Q$dev\E\z/ }
Better yet, let's avoid spinning up the regex engine for nothing.
grep { $_ eq $dev }
For example,
$ perl -e'my $dev = "ccc"; CORE::say for grep { /^\Q$dev\E\z/ } qw( accc ccc ccce );'
ccc
$ perl -e'my $dev = "ccc"; CORE::say for grep { $_ eq $dev } qw( accc ccc ccce );'
ccc
I would use quotemeta. Here is an example of how it compares:
my $regexp = '\t';
my $metaxp = quotemeta ($regexp);
while (<DATA>) {
print "match \$regexp - $_" if /$regexp/;
print "match \$metaxp - $_" if /$metaxp/;
}
__DATA__
This \t is not a tab
This is a tab
(there is literally a tab in the second line)
The meta version will match line 1, as it turned "\t" into essentially "\t," and the non-meta (original) version will match line 2, which assumes you are looking for a tab.
match $metaxp - This \t is not a tab
match $regexp - This is a tab
Hopefully you get my meaning.
I think adding $regexp = quotemeta ($regexp) (or doing it when you capture the standard input) should meet your need.

Dynamic regular expression for Nesting brackets failed due to unknow bugs

rencently I have met a strange bug when use a dynamic regular expressions in perl for Nesting brackets' match. The origin string is " {...test{...}...} ", I want to grep the pair brace begain with test, "test{...}". actually there are probably many pairs of brace before and end this group , I don't really know the deepth of them.
Following is my match scripts: nesting_parser.pl
#! /usr/bin/env perl
use Getopt::Long;
use Data::Dumper;
my %args = #ARGV;
if(exists$args{'-help'}) {printhelp();}
unless ($args{'-file'}) {printhelp();}
unless ($args{'-regex'}) {printhelp();}
my $OpenParents;
my $counts;
my $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;$counts++; print "\nLeft:".$OpenParents." ;"})
| \} (?(?{$OpenParents ne 0; $counts++}) (?{$OpenParents--;print "Right: ".$OpenParents." ;"})) (?(?{$OpenParents eq 0}) (?!))
)*
)
}x;
my $string = `cat $args{'-file'}`;
my $partten = $args{'-regex'} ;
print "####################################################\n";
print "Grep [$partten\{...\}] from $args{'-file'}\n";
print "####################################################\n";
while ($string =~ /($partten$NestedGuts)/xmgs){
print $1."}\n";
print $2."####\n";
}
print "Regex has seen $counts brackts\n";
sub printhelp{
print "Usage:\n";
print "\t./nesting_parser.pl -file [file] -regex '[regex expression]'\n";
print "\t[file] : file path\n";
print "\t[regex] : regex string\n";
exit;
}
Actually my regex is:
our $OpenParents;
our $NestedGuts = qr {
(?{$OpenParents = 0})
(?>
(?:
[^{}]+
| \{ (?{$OpenParents++;})
| \} (?(?{$OpenParents ne 0}) (?{$OpenParents--})) (?(?{$OpenParents eq 0} (?!))
)*
)
}x;
I have add brace counts in nesting_parser.pl
I also write a string generator for debug: gen_nesting.pl
#! /usr/bin/env perl
use strict;
my $buffer = "{{{test{";
unless ($ARGV[0]) {print "Please specify the nest pair number!\n"; exit}
for (1..$ARGV[0]){
$buffer.= "\n\{\{\{\{$_\}\}\}\}";
#$buffer.= "\n\{\{\{\{\{\{\{\{\{$_\}\}\}\}\}\}\}\}\}";
}
$buffer .= "\n\}}}}";
open TEXT, ">log_$ARGV[0]";
print TEXT $buffer;
close TEXT;
You can generate a test file by
./gen_nesting.pl 1000
It will create a log file named log_1000, which include 1000 lines brace pairs
Now we test our match scripts:
./nesting_parser.pl -file log_1000 -regex "test" > debug_1000
debug_1000 looks like a great perfect result, matched successfully! But when I gen a 4000 lines test log file and match it again, it seem crashed:
./gen_nesting.pl 4000
./nesting_parser.pl -file log_4000 -regex "test" > debug_4000
The end of debug_4000 shows
{{{{3277}
####
Regex has seen 26213 brackts
I don't know what's wrong with the regex expresions, mostly it works well for paired brackets, untill recently I found it crashed when I try to match a text file more than 600,000 lines.
I'm really confused by this problems,
I really hope to solve this problem.
thank you all!
First for matching nested brackets I normally use Regexp::Common.
Next, I'm guessing that your problem is that Perl's regular expression engine breaks after matching 32767 groups. You can verify this by turning on warnings and looking for a message like Complex regular subexpression recursion limit (32766) exceeded.
If so, you can rewrite your code using /g and \G and pos. The idea being that you match the brackets in a loop like this untested code:
my $start = pos($string);
my $open_brackets = 0;
my $failed;
while (0 < $open_brackets or $start == pos($string)) {
if ($string =~ m/\G[^{}]*(\{|\})/g) {
if ($1 eq '{') {
$open_brackets++;
}
else {
$open_brackets--;
}
}
else {
$failed = 1;
break; # WE FAILED TO MATCH
}
}
if (not $failed and 0 == $open_brackets) {
my $matched = substr($string, $start, pos($string));
}

Regex to read the id

I have the log file with the following content:
(2947:_dRW00T3WEeSkhZ9pqkt5dQ) ---$ ABC XY "Share" 16-Sep-2014 03:22 PM
(2948:_3nFSwz3TEeSkhZ9pqkt5dQ) ---$ ABC XY "Share" 16-Sep-2014 03:05 PM
(2949:_voeYED3AEeSkhZ9pqkt5dQ) ---$ ABC XY "Initial for Re,oved" 16-Sep-2014 12:44 PM
I want to read the unique id say _dRW00T3WEeSkhZ9pqkt5dQ from each line and store it in a array.
My current code is:
while(<$fh>) {
if ($_ =~ /\((.*?)\)/) {
push #cs_ids , $1;
}
}
Try this:
while(<$fh>) {
if ($_ =~ /\(\d+:(.+?)\)/) {
push #cs_ids , $1;
}
}
The regexp checks all string which starts with ( then one or more digits a double point and than one or more characters ( Which will be stored in $1). THe end of the string is a ).
You were almost there:
perl -e '$string = "(2947:_dRW00T3WEeSkhZ9pqkt5dQ)"; if ($string =~ /^\((\d+:)(.*?)\)$/) { die $2; }'
_dRW00T3WEeSkhZ9pqkt5dQ at -e line 1.
Change your regular expression condition to:
/^\((\d+:)(.*?)\)$/
What that does is match and group the 4 digits and colon into special var $1 and the id you want into special var $2.
If every line of the log file is guaranteed to have an ID string, then you can write just
while (<$fh>) {
/:(\w+)/ and push #cs_ids , $1;
}
The \w ("word") character class matches alphanumeric characters or underscore, and this regex just snags the first sequence of word characters that follow a colon. It is best to avoid the non-greedy modifier if possible as it is a sloppy specification and can be much slower than a simple multiple character match.