regex maching after new line in perl - regex

i am trying to match with regex in perl different parts of a text which are not in the same line.
I have a file sized 200 mb aprox with all cases similar to the following example:
rewfww
vfresrgt
rter
*** BLOCK 049 Aeee/Ed "ewewew"U 141202 0206
BLAH1
BLAH2
END
and i want to extract all what is in the same line after the "***" in $1, BLAH1 in $2 and BLAH2 in $3.
i have tried the following without success:
open(archive, "C:/Users/g/Desktop/blahs.txt") or die "die\n";
while(< archive>){
if($_ =~ /^\*\*\*(.*)\n(.*)/s){
print $1;
print $2;
}
}
One more complexity: i don´t know how many BLAH´s are in each case. Perhaps one case have only BLAH1, other case with BLAH1, BLAH2 and BLAH3 etc. The only thing thats sure is the final "END" who separates the cases.
Regards

\*\*\*([^\n]*)\n|(?!^)\G\s*(?!\bEND\b)([^\n]+)
Try this.See demo.
https://regex101.com/r/vN3sH3/17

How about:
#!/usr/bin/perl
use strict;
use warnings;
open(my $archive, '<', "C:/Users/g/Desktop/blahs.txt") or die "die: $!";
while(<$archive>){
if (/^\*{3}/ .. /END/) {
s/^\*{3}//;
print unless /END/;
}
}

As far as I understand your question the following works for me. Please update or provide feedback if you are looking for something more or less strict (or spot any mistakes!).
^(\*{3}.*\n{2})(([a-zA-Z])*([0-9]*)\n{2})*(END)$
^(\*{3}\n{2}) - Find line consisting of three *s followed by two newlines - You could repeat this by adding * after the last closing parenthesis if you want/need to check for a "false" start. While it looks like you may have data in the file before this but this is the start of the data you actually care about/want to capture.
(([a-zA-Z])*([0-9]*)\n{2})* -The desired word characters followed by a number (or numbers if your BLAH count >9) and also check for two trailing spaces. The * at the end denotes that this can repeat zero or more times which accounts for the case where you have no data. If you want a fail if there is not data use ? instead of * to denote it must repeat 1 or more times. this segment assumes you wanted to check for data in the format word+number. If that is not the case this part can be easily modified to accept a wider range of data - let me know if you want/need a more or less strict case
(END)$ - The regex ends with sequence "END". If it is permissible for the data to continue and you just want to stop capture at this point do not include the $
I don't have permissions to post pics yet but a great site to check and to see a visual representation of your regex imo is https://www.debuggex.com/

Related

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

PERL: What's missing in my regexes?

This is part of a school project. I cannot figure out what's the problem in my regexes. I have more that work but these are giving me a hard time. Apache doesn't tell you exactly where you went wrong.
First and Last name must be two simple names and output in Lastname, Firstname format
my $name = param('name');
if($name =~ {2}) {
print "Name will be stored as $2, $1<br/><br/>";
} else {
print "Bad name. Enter just two names, your first and last<br/><br/>";
}
Password must be in this order of regexes. Begin with a single upper case character, 2 digits, a single space, 2-3 lower case letters, one special character (not a letter or digit).
my $password = param('password');
if ($password =~ /[A-Z]+\d{2}+\s+[a-z]{2,3}+-]\^$/) {
print "Password $password was accepted<br/><br/>";
} else {
print "Bad password, $password was not accepted<br/><br/>";
}
Apache doesn't tell you exactly where you went wrong.
First, find your Apache error log. It will contain the actual error. I can't tell you where it is, but I'd start with /var/log.
Second, debugging code through a web server just makes things more difficult. You're probably using CGI.pm which can accept arguments on the command line for debugging.
perl /path/to/your/program name='Michael Schwern'
Second, turn on strict and warnings. They will point out typos and silly mistakes like this one...
$ perl -w ~/tmp/test.plx name=foo
Odd number of elements in anonymous hash at /Users/schwern/tmp/test.plx line 5.
Bad name. Enter just two names, your first and last<br/><br/>
That's this.
$name =~ {2}
That says to make an anonymous hash with the key 2 and an undefined value. Then stringify it to something like HASH(0x7fca01805668) and then use that as a regex. In other words: nonsense.
What you're looking for is something like this that looks for two words separated by some spaces.
$name =~ m{^(\w+)\s+(\w+)$};
Read the Perl regex tutorial for more info.
It's hard to tell what you're trying to do, and getting other people to do your homework for you won't teach you a thing
$name =~ {2}
isn't a regular expression at all: you're building an anonymous hash { 2 => undef } and using its stringified reference as a pattern. It will be something like HASH(0x71c328) so that isn't going to work
And this one is incomprehensible
$password =~ /[A-Z]+\d{2}+\s+[a-z]{2,3}+-]\^$/
It will match something like A99 aaa-]^, but I doubt if that is what you want. What is the plus sign + for after {2,3}?

Perl regex from file.txt, match columns greater than x

I have a file containing several rows of code, like this:
160101, 0100, 58.8,
160101, 0200, 59.3,
160101, 0300, 59.5,
160101, 0400, 59.1,
I'm trying to print out the third column with a regex, like this:
# Read the text file.
open( IN, "file.txt" ) or die "Can't read words file: $!";
# Print out.
while (<IN>) {
print "Number: $1\n"
while s/[^\,]+\,[^\,]+\,([^\,]+)\,/$1/g;
}
And it works fairly well, however, I'm trying to only fetch the numbers that are greater than or equal to 59 (that includes numbers like 59.1 and 59.0). I've tried several numeric regex combinations (the one below will not give me the right number, obviously, but just making a point), including:
while s/[^\,]+\,[^\,]+\,([^\,]+)\,^[0-9]{3}$/$1/g;
but none seem to work. Any ideas?
My first idea would be to split that line and then pick and choose
while (my $line = <IN>) {
my #nums = split ',\s*', $line;
print "$nums[2]\n" if $nums[2] >= $cutoff;
}
If you insist on doing it all in the regex then you may want to use /e modifier, so in the substitution part you can run code. Then you can test the particular match and print it there.
Assuming that the numbers can't reach 100 (three digits) you could use
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+)\,
which uses your regex except for the capture group which captures the number 59 and it's decimals, or two digit numbers from 60-99 and it's decimals.
Regards
Edit:
To go above 100 you can add another alternative in the capture group:
[^\,]+\,[^\,]+\,\s*(59\.\d+|[6-9]\d\.\d+|[1-9]\d{2,}\.\d+)\,
which allows larger numbers (>=100.0).
Why do you use while? Is it possible to have more than one third column on a line? A simple if will work the same, comunicating the intent more clearly.
Also, if you want to extract, you don't need to substitute. Use m// instead of s///.
Regexes aren't the right tool to do numberic comparisons. Use >= instead:
print "Number: $1\n" if /[^\,]+\,[^\,]+\,([^\,]+)\,/
&& $1 >= 59
Assuming the line ends with a comma :
print foreach map{s/.+?(\d+.\d+),$/$1/;$_} ;
In case there might be someting after the rightmost comma :
print foreach map{s/.+?(\d+.\d+),[^,]*$/$1/;$_} ;
But i wouldn't use regexp in that case :
print foreach map{(split, ',')[-2]} ;
I would suggest not using a regex when split is a better tool for the job. Likewise - regex is very bad at detecting numeric values - it works on text based patterns.
But how about:
while ( <> ) {
print ((split /,\s*/)[2],"\n");
}
If you want to test a conditional:
while ( <> ) {
my #fields = split /,\s*/;
print $fields[2],"\n" if $fields[2] >= 59;
}
Or perhaps:
print join "\n", grep { $_ >= 59 } map { (split /,\s*/)[2] } <>;
map takes your input, and extracts the third field (returning a list). grep then applies a filter condition to every element. And then we print it.
Note - in the above, I use <> which is the magic file handle (reads files specified on command line, or STDIN) but you can use your filehandle.
However it's probably worth noting - 3 argument open with lexical file handles are recommended now.
open ( my $input, '<', 'file.txt' ) or die $!;
It has a number of advantages and is generally good style.

Perl - Regexp to manipulate .csv

I've got a function in Perl that reads the last modified .csv in a folder, and parses it's values into variables.
I'm finding some problems with the regular expressions.
My .csv look like:
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP OUT TOTAL","PDP OUT OK","PDP OUT NOK","PDP OUT OK Rate"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","ARG - NAME 1","536","536","0","100%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","USA - NAME 2","1850","1438","412","77.72%"
"04/12/2014 11:00:00","3600","1","GPRS_OUT","AUS - NAME 3","8","6","2","75%"
.(ignore this dot, you will understand later)
So far, I've had some help to parse the values into some variables, by:
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
while ( my $line = <$file> ) {
my ($date_time, $duration, $sample, $corner, $country_name, $pdp_in_total, $pdp_in_ok, $pdp_in_not_ok, $pdp_in_ok_rate)
= parse_line ',', 0, $line;
my ($date, $time) = split /\s+/, $date_time;
my ($country, $name) = $country_name =~ m/(.+) - (.*)/;
print "$date, $time, $country, $name, $pdp_in_total, $pdp_in_ok_rate";
}
The problems are:
I don't know how to make the first AND second line (that are the column names from the .csv) to be ignored;
The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
How can I do this?
When you have a csv file with column headers and want to parse the data into variables, the simplest choice would be to use Text::CSV. This code shows how you get your data into the hash reference $row. (I.e. my %data = %$row)
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
# open the file, I use the DATA internal file handle here
my $title = <DATA>;
# Set the headers using the header line
$csv->column_names( $csv->getline(*DATA) );
while (my $row = $csv->getline_hr(*DATA)) {
# you can now access the variables via their header names, e.g.:
if (defined $row->{Duration}) { # this will skip the blank lines
say $row->{Duration};
}
}
__DATA__
Title is: "NAME_NAME_NAME"
"Period end","Duration","Sample","Corner","Line","PDP IN TOTAL","PDP IN OK","PDP IN NOT OK","PDP IN OK Rate"
"04/12/2014 10:00:00","3600","1","GRPS_INB","CHN - Name 1","1198","1195","3","99.74%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","ARG - Name 2","1198","1069","129","89.23%"
"04/12/2014 10:00:00","3600","1","GRPS_INB","NLD - Name 3","813","798","15","98.15%"
If we print one of the $row variables with Data::Dumper, it shows the structure we are getting back from Text::CSV:
$VAR1 = {
'PDP IN TOTAL' => '1198',
'PDP IN NOT OK' => '3',
'PDP IN OK' => '1195',
'Period end' => '04/12/2014 10:00:00',
'Line' => 'CHN - Name 1',
'Duration' => '3600',
'Sample' => '1',
'PDP IN OK Rate' => '99.74%',
'Corner' => 'GRPS_INB'
};
open ...
my $names_from_first_line = <$file>; # you can use them or just ignore them
while($my line = <$file>) {
unless ($line =~ /\S/) {
# skip empty lines
next;
}
..
}
Also, consider using Text::CSV to handle CSV format
1) I don't know how to make the first line (that are the column names from the .csv) to be ignored;
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
2) The file sometimes come with 2-5 empty lines in the end of the file, as I show in my sample (ignore the dot in the end of it, it doesn't exists in the file).
while ( my $line = <$file> ) {
chomp $line;
next if $. == 1 || $. == 2;
next if $line =~ /^\s*$/;
You know that the valid lines will start with dates. I suggest you simply skip lines that don't start with dates in the format you expect:
while ( my $line = <$file> ) {
warn qq(next if not $line =~ /^"\d{2}-\d{2}-d{4}/;); # Temp debugging line
next if not $line =~ /^"\d{2}-\d{2}-d{4}/;
warn qq($line matched regular expression); # Temp debugging line
...
}
The /^"\d{2}-\d{2}-d{4}",/ is a regular expression pattern. The pattern is between the /.../:
^ - Beginning of the line.
" - Quotation Mark.
\d{2} - Followed by two digits.
- - Followed by a dash.
\d{2] - Followed by two more digits.
- - Followed by a dash.
\d{4} - Followed by four more digits
This should be describing the first part of your line which is the date in MM-DD-YYYY format surrounded by quotes and followed by a comma. The =~ tells Perl that you want the thing on the left to match the regular expression on the right.
Regular expressions can be difficult to understand, and is one of the reasons why Perl has such a reputation of being a write-only language. Regular expressions have been likened to sailor cussing. However, regular expressions is an extremely powerful tool, and worth the effort to learn. And with some experience, you'll be able to easily decode them.
The next if... syntax is similar to:
if (...) {
next;
}
Normally, you shouldn't use post-fix if and never use unless (which is if's opposite). They can make your program more difficult to understand. However, when placed right after the opening line of a loop like this, they make a clear statement that you're filtering out lines you don't want. I could have written this (and many people would argue this is preferable):
next unless $line =~ /^"\d{2}-\d{2}-d{4}",/;
This is saying you want to skip lines unless they match your regular expression. It's all a matter of personal preference and what do you think is easier for the poor schlub who comes along next year and has to figure out what your program is doing.
I actually thought about this and decided that if not ... was saying that I expect almost all lines in the file to match my format, and I want to toss away the few exceptions. To me, next unless ... is saying that there are some lines that match my regular expression, and many lines that don't, and I want to only work on lines that match.
Which gets us to the next part of programming: Watching for things that will break your program. My previous answer didn't do a lot of error checking, but it should. What happens if a line doesn't match your format? What if the split didn't work? What if the fields are not what I expect? You should really check each statement to make sure it actually worked. Almost all functions in Perl will return a zero, a null string, or an undef if they don't work. For example, the open statement.
open my $file, "<", $newest_file
or die qq(Cannot open file "$newest_file" for reading.);
If open doesn't work, it returns a file handle value of zero. The or states that if open doesn't return a non-zero file handle, execute the line that follows which kills your program.
So, look through your program, and see any place where you make an assumption that something works as expected and think what happens if it didn't. Then, add checks in your program to something if you get that exception. It could be that you want to report the error or log the error and skip to the next line. It could be that you want your program to come to a screeching halt. It could be that you can recover from the error and continue. What ever you do, check for possible errors (especially from user input) and handle possible errors.
Debugging
I told you regular expressions are tricky. Yes, I made a mistake assuming that your date was a separate field. Instead, it's followed by a space then the time which means that the final ", in the regular expression should not be there. I've fixed the above code. However, you may still need to test and tweak. Which brings us into debugging in Perl.
You can use warn statements to help debug your program. If you copy a statement, then surround it with warn qq(...);, Perl will print out the line (filling out variables) and the line number. I even create macros in my various editors to do this for me.
The qq(...) is a quote like operator. It's another way to do double quotes around a string. The nice thing is that the string can contain actual quotation marks, and the qq(...); will still work.
Once you've finished debugging, you can search for your warn statements and delete them. Perl comes with a powerful built in debugger, and many IDEs integrate with it. However, sometimes it's just easier to toss in a few warn statements to see what's going on in your code -- especially if you're having issues with regular expressions acting up.

find many matches in nucleotide sequence with a regex

I have some gene sequence (see below), and I want to find all open reading frame (start with ATG and stop TAG).
I have tried this:
my $file = ('ACCCTGCCCAAAATCCCCCCGATCGATAGAGCTAAATGGCCCATGATGCATCGACTAGCTAGCTAAAATGTCGATCGATACAGCTAATAG');
while($file =~ /(ATG\w+?TAG)/g){
print $1;
}
but it only gives
ATGGCCCATGATGCATCGACTAGATGTCGATCGATACAGCTAATAG
how can i get every one?
The trick to find all occurences is to use a zero-width assertion, this will prevent "the eating" of our characters: (?=ATG\w+?TAG).
The problem with this is that we'll get empty matches, so the solution is to use a group:
(?=(ATG\w+?TAG)). You will find all occurences in group 1.
Group 1 output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
Online demo
Result is ok, simply separate them in output:
print "$1\n";
You are getting two matches. To see them, I suggest you print some separator between them:
print "$1\n";
Then we get the output:
ATGGCCCATGATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to find frames that also occur inside another, then you must make sure to not consume too many characters. Work around that via a looahead:
/ATG(?=([ACTG]*+TAG))/g;
Then print "ATG$1\n", Output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to have the start and stop codons in the same frame don't forget to filter the results to the only ones with a length multiple of 3:
print "ATG$1\n" if (length($1)%3) == 0 ;
If you want to check the six frames available in one sequence, don't forget to check also the complementary chain:
$comp_chain = reverse($chain) ;
$comp_chain =~ tr/ATCG/TAGC/ ;
You will then obtain the open reading frames from the six reading frames available in a single sequence.