Regex in perl and variable name - regex

I have some issues with a regular expression in perl. I'm trying to add a string at the beginning of a another string (in fact, insert a string at the beginning of the name of a file). What I want is check before inserting that string if the file already begins by it.
This is the code I have:
if ($ficheroSinExt !~ m/^$strCadena/){
# if it doesn't exist at the beginning, I insert it...
$ficheroSinExt = $strCadena . " " . $ficheroSinExt;
}
else{
print "---->It already exists!!!\n";
}
I'm testing it with two filenames with only one containing [Perl] at the beginning ("[Perl] File1.pdf" and "File2.pdf"), and $strCadena contains [Perl]. I end up adding [Perl] for both files, so their new names are "[Perl] [Perl] File1.pdf" and "[Perl] File2.pdf".
I think the problem comes from the ^$strCadena of the match operator, but I don't arrive to work-around it. Could you please give me a hand?
Thanks in advance,
Diego

Quote the special characters:
if ($ficheroSinExt !~ m/^\Q$strCadena/){
# here __^

You want to disable pattern metacharacters (see perlre)
if ($ficheroSinExt !~ m/^\Q$strCadena\E/){

Related

Tcl - How to Add Text after last character through regex?

I need a tip, tip or suggestion followed by some example of how I can add an extension in .txt format after the last character of a variable's output line.
For example:
set txt " ONLINE ENGLISH COURSE - LESSON 5 "
set result [concat "$txt" .txt]
Print:
Note that there is space in the start, means and fin of the variable phrase (txt). What must be maintained are the spaces of the start and means. But replace the last space after the end of the sentence, with the format of the extension [.txt].
With the built-in concat method of Tcl, it does not achieve the desired effect.
The expected result was something like this:
ONLINE ENGLISH COURSE - LESSON 5.txt
I know I could remove spaces with string map but I don't know how to remove just the last occurrence on the line.
And otherwise I don’t know how to remove the last space to add the text [.txt]
If anyone can point me to one or more solutions, thank you in advance.
set result "[string trimright $txt].txt"
or
set result [regsub {\s*$} $txt ".txt"]

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

Match a file name that includes the path and a period within the name

I have a file I need to take just its name:
/var/www/foo/dog.tur-tles.chickens.txt
I want to match just the:
dog.tur-tles.chickens
I have tried this in regexer:
([^\/]*)$
This matches:
dog.tur-tles.chickens.txt
I can't figure out how to only exclude that last period.
You can assume it will always be a .txt, but I wanted to build in the ability that if a file was named dog-turtles.txt.txt it would see that the name is dog-turtles.txt.
You could use something like so: ([^\/]*)(\.).+?$.
An example is available here. Not though that this will fail for extensions such as .tar.gz and so on.
You may use File::Basename.fileparse to get the file name, then use rindex to get the last index of . and then get the required substring using substr:
use File::Basename;
$x = fileparse('/var/www/foo/dog.tur-tles.chickens.txt');
print substr($x, 0, rindex($x, '.')) . "\n";
Output of a sample program:
dog.tur-tles.chickens
$name = ($pathname =~ s{.*/}{}r =~ s{\.[^.]+$}{}r)
substitution 1 : just remove dir
substitution 2 : just remove extension if presente
Just add .txt to your regex and since * is greedy by default it will match everything till last .txt
([^\/]*)\.txt$
Input:
/var/www/foo/dog.tur-tles.chickens.txt.txt
/var/www/foo/dog.tur-tles.chickens.txt
Output:
dog.tur-tles.chickens.txt
dog.tur-tles.chickens
See DEMO

Perl String Regular Expression - Need Explanation

I am pretty new to Perl. I have the following code fragment that works just fine, but I don't fully understand it:
for ($i = 1; $i <= $pop->Count(); $i++) {
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
}
$pop->Head is a string or an array of strings returned by the function Mail::POP3Client, and it is the headers of a bunch of emails. Line 3 is some kind of regular expression that extracts the FROM and the SUBJECT from the header.
My question is how does the print function only print the From and the Subject without all the other stuff in the header? What does "and" mean - this surely can't be a boolean and can it? Most important, I want to put the From string into its own variable (my $fromline). How do I do this?
I am hoping that this will be easy for some Perl professional, it has got me baffled!
Thanks in advance.
ARGHHH... The question was edited while I was typing the answer. OK, throwing out the part of my answer that's no longer relevant, and focusing on the specific questions:
The outer loop iterates over all the messages in the mailbox.
The inner loop doesn't specify a loop variable, so the special variable $_ is used.
In each iteration through the inner loop, $_ is one header line from message number $i.
/^(From|Subject):\s+/i and print $_, "\n";
The first part of this line, up to the and is a pattern. We didn't specify what to do with the pattern, so it's implicitly matched against $_. (That's one of the things that makes $_ special.) This gives us a yes/no test: does the pattern match the header line or not?
The pattern tests whether that item begins with (<) either of the words "From" or "Subject", followed immediately by a colon and one or more whitespace characters. (This not the correct pattern to match an RFC 822 header. Whitespace is optional on both sides of the colon. The pattern should more properly be /^(From|Subject)\s*:\s*/i. But that's a separate issue.) the i at the end of the pattern says to ignore case, so from or SUBJECT would be OK.
The and says to continue evaluating (i.e., executing) the expression if there is a match. If there's no match, whatever follows and is ignored.
The rest of the expression prints the header line ($_) and a newline ("\n").
In perl, and and or are boolean operators. They're synonyms for && and ||, except that they have much lower precedence, making it easier to write short-ciruit expressions without clutter from lots of parentheses.
The smallest change that captures the From line into a separate variable would be to add the following line to the inner loop:
/^From\s*:\s*(.*)$/i and $fromline = $1;
You should probably also put
$fromline = undef
before the loop so you can test, after the loop, whether there was a From: line.
There are other ways to do it. In fact, that's one of the mantras of perl: "There's more than one way to do it." I've stripped out the "From: " from the beginning of the line before storing the balance in $fromline, but I don't know your needs.
It's a logical and with short-circuiting. If the left side evaluates to true -- say, if that regular expression matches -- it'll evaluate the right side, the print.
If the expression on the left is false, it doesn't need to evaluate the right hand side, because the net result would still be false, so it skips it.
See also: perldoc perlop

Excluding a file extension while parsing a CSV file

So I'm new to Perl and writing a script that would read through rows in a CSV file, and rename a directory of files associated with a certain column in that CSV file.
my $filename_formatted = "$row->[3]"."_"."$row->[4]"."_"."$row->[2]\n";
my $resume_id = $row->[1];
if (-e $resume_id){
rename($resume_id, $filename_formatted);
}
Basically, how could I format $resume_id to accept only the contents up to the file extension? The $row->[1] variable contains something like "resume_1231.pdf" or "resume_1231.doc". I basically want everything up to the .
I understand I would probably need a regex, but, I've never utilized it in Perl.
$formatted_resume_id = /($row->[1])?!\..*$/
I don't know.
I suppose you would want everything up to the final dot in the file name (so you would get the full name even if the filename contained dots).
Something like this should do it:
if ( $row->[1] =~ /(.*)\./ ) {
$formatted_resume_id = $1;
}
The $row->[1] variable contains something like "resume_1231.pdf" or "resume_1231.doc".
I basically want everything up to the .
Try with capturing group.
^([^.]*)
Live demo
OR using Lazy way.
^(.*?)\.
Sample code:
$mystring = "resume_1231.pdf";
if($mystring =~ m/^([^.]*)/) {
print "The file name is $1";
}
So the answer was apparently this,
my $resume_file = "bogus_filename.doc";
my ($name) = $resume_file =~ /(.+?)(\.[^.]*$|$)/;
my($ext) = $resume_file =~ /(\.[^.]+)$/;
This would account for any extra periods, as it only accepts up to the very last period.
I'm still a bit unsure as to how this works, so if anyone can break down the first regex, that would be great. I understand (.+?) but I'm lost as to how the second part of that regex means to not include the extension.