I received some amazing help from Stack Overflow ... however ... it was so amazing I need a little more help to get to closer to the finish line. I'm parsing multiple enormous 4GB files 2X per month. I need be able to be able to skip the header, count the total lines, matched lines, and the not matched lines. I'm sure this is super-simple for a PowerShell superstar, but at my newbie PS level my skills are not yet strong. Perhaps a little help from you would save the week. :)
Data Sample:
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002MICKLE ROONEY MOUSE COLUMN VALUE LONGSTARTS
Code Block (based on this answer):
#$match_regex matches each fixed length field by length; the () specifies that each matched field be stored in a capture group:
[regex]$match_regex = '^(.{10})(.{50})(.{50})(.{50})(.{50})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{4})(.{25})(.{2})(.{10})(.{3})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{10})(.{10})(.{10})(.{2})(.{10})(.{50})(.{50})(.{50})(.{50})(.{8})(.{4})(.{50})(.{2})(.{30})(.{6})(.{3})(.{2})(.{25})(.{2})(.{10})(.{3})(.{4})(.{2})(.{4})(.{10})(.{38})(.{38})(.{15})(.{1})(.{10})(.{2})(.{10})(.{10})(.{10})(.{10})(.{38})(.{38})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})(.{10})$'
Measure-Command {
& {
switch -File $infile -Regex {
$match_regex {
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
}
}
} | Out-File $outFile
}
You only need to keep track of two counts - matched, and unmatched lines - and then a Boolean to indicate whether you've skipped the first line
$first = $false
$matched = 0
$unmatched = 0
. {
switch -File $infile -Regex {
$match_regex {
if($first){
# Join what all the capture groups matched with a tab char.
$Matches[1..($Matches.Count-1)].Trim() -join "`t"
$matched++
}
$first = $true
}
default{
$unmatched++
# you can remove this, if the pattern always matches the header
$first = $true
}
}
} | Out-File $outFile
$total = $matched + $unmatched
Using System.IO.StreamReader reduced the processing time to about 20% of what it had been. This was absolutely needed for my requirement.
I added logic and counters without sacrificing much on performance. The field counter and row by row comparison is particularly helpful in finding bad records.
This is a copy/paste of actual code but I shortened some things, made some things slightly pseudo code, so you may have to play with it to get things working just so for yourself.
Function Get-Regx-Data-Format() {
Param ([String] $filename)
if ($filename -eq 'FILE NAME') {
[regex]$match_regex = '^(.{10})(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{1})'
}
return $match_regex
}
Foreach ($file in $cutoff_files) {
$starttime_for_file = (Get-Date)
$source_file = $file + '_' + $proc_yyyymm + $source_file_suffix
$source_path = $source_dir + $source_file
$parse_file = $file + '_' + $proc_yyyymm + '_load' +$parse_target_suffix
$parse_file_path = $parse_target_dir + $parse_file
$error_file = $file + '_err_' + $proc_yyyymm + $error_target_suffix
$error_file_path = $error_target_dir + $error_file
[regex]$match_data_regex = Get-Regx-Data-Format $file
Remove-Item -path "$parse_file_path" -Force -ErrorAction SilentlyContinue
Remove-Item -path "$error_file_path" -Force -ErrorAction SilentlyContinue
[long]$matched_cnt = 0
[long]$unmatched_cnt = 0
[long]$loop_counter = 0
[boolean]$has_header_row=$true
[int]$field_cnt=0
[int]$previous_field_cnt=0
[int]$array_length=0
$parse_minutes = Measure-Command {
try {
$stream_log = [System.IO.StreamReader]::new($source_path)
$stream_in = [System.IO.StreamReader]::new($source_path)
$stream_out = [System.IO.StreamWriter]::new($parse_file_path)
$stream_err = [System.IO.StreamWriter]::new($error_file_path)
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
#if matched and it's the header, parse and write to the beg of output file
if (($loop_counter -eq 0) -and $has_header_row) {
$stream_out.WriteLine(($Matches[1..($array_length)].Trim() -join "`t"))
} else {
$previous_field_cnt = $field_cnt
#add year month to line start, trim and join every captured field w/tabs
$stream_out.WriteLine("$proc_yyyymm`t" + `
($Matches[1..($array_length)].Trim() -join "`t"))
$matched_cnt++
$field_cnt=$Matches.Count
if (($previous_field_cnt -ne $field_cnt) -and $loop_counter -gt 1) {
write-host "`nError on line $($loop_counter + 1). `
The field count does not match the previous correctly `
formatted (non-error) row."
}
}
} else {
if (($loop_counter -eq 0) -and $has_header_row) {
#if the header, write to the beginning of the output file
$stream_out.WriteLine($line)
} else {
$stream_err.WriteLine($line)
$unmatched_cnt++
}
}
$loop_counter++
}
} finally {
$stream_in.Dispose()
$stream_out.Dispose()
$stream_err.Dispose()
$stream_log.Dispose()
}
} | Select-Object -Property TotalMinutes
write-host "`n$file_list_idx. File $file parsing results....`nMatched Count =
$matched_cnt UnMatched Count = $unmatched_cnt Parse Minutes = $parse_minutes`n"
$file_list_idx++
$endtime_for_file = (Get-Date)
write-host "`nEnded processing file at $endtime_for_file"
$TimeDiff_for_file = (New-TimeSpan $starttime_for_file $endtime_for_file)
$Hrs_for_file = $TimeDiff_for_file.Hours
$Mins_for_file = $TimeDiff_for_file.Minutes
$Secs_for_file = $TimeDiff_for_file.Seconds
write-host "`nElapsed Time for file $file processing:
$Hrs_for_file`:$Mins_for_file`:$Secs_for_file"
}
$endtime = (Get-Date -format "HH:mm:ss")
$TimeDiff = (New-TimeSpan $starttime $endtime)
$Hrs = $TimeDiff.Hours
$Mins = $TimeDiff.Minutes
$Secs = $TimeDiff.Seconds
write-host "`nTotal Elapsed Time: $Hrs`:$Mins`:$Secs"
Needing to replace a string in multiple text files with the same string , except with capture group 2 replaced by the sum of itself and capture group 4.
String: Total amount $11.39 | Change $0.21
Desired Result: Total amount $11.60 | Change $0.21
I have attempted several methods. Here is my last attempt which seems to run without error, but without any changes to the string .
$Originalfolder = "$ENV:userprofile\Documents\folder\"
$Originalfiles = Get-ChildItem -Path "$Originalfolder\*"
$RegexPattern = '\b(Total\s\amount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
Param($Match)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
$GP1 = $Match.Groups[1].Value
$Total = $Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = $Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
return [string]$Result
}
foreach ($file in $Originalfiles) {
$Lines = Get-Content $file.FullName
$Lines | ForEach-Object {
[Regex]::Replace($_, $RegexPattern, $Substitution)
} | Set-Content $file.FullName
}
For one thing, your regular expression doesn't even match what you're trying to replace, because you escaped the a in amount:
\b(Total\s\amount\s\$)(\d?\d?\d?...
# ^^
\a is an escape sequence that matches the "alarm" or "bell" character \u0007.
Also, if you want to calculate the sum of two captures you need to convert them to numeric values first, otherwise the + operator would just concatenate the two strings.
$Total = $Match.Groups[2].Value
$Change = $Match.Groups[4].Value
$Sumtotal = $Total + $Change # gives 11.390.21
$Sumtotal = [double]$Total + [double]$Change # gives 11.6
And you need to build $Result after you defined the other variables, otherwise the replacement function would just return an empty string.
Change this:
$RegexPattern = '\b(Total\s\amount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
param ($Match)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
$GP1 = $Match.Groups[1].Value
$Total = $Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = $Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
return [string]$Result
}
into this:
$RegexPattern = '\b(Total\samount\s\$)(\d?\d?\d?\d?\d\.?\d?\d?)(\s\|\sChange\s\$)(\d?\d?\d\.?\d?\d?)\b'
$Substitution = {
Param($Match)
$GP1 = $Match.Groups[1].Value
$Total = [double]$Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = [double]$Match.Groups[4].Value
$Sumtotal = ($Total + $Change)
$Result = $GP1 + $Sumtotal + $GP3 + $Change
return [string]$Result
}
and the code will mostly do what you want. "Mostly", because it will not format the calculated number to double decimals. You need to do that yourself. Use the format operator (-f) and change your replacement function to something like this:
$Substitution = {
Param($Match)
$GP1 = $Match.Groups[1].Value
$Total = [double]$Match.Groups[2].Value
$GP3 = $Match.Groups[3].Value
$Change = [double]$Match.Groups[4].Value
$Sumtotal = $Total + $Change
return ('{0}{1:n2}{2}{3:n2}' -f $GP1, $Sumtotal, $GP3, $Change)
}
As a side note: the sub-expression \d?\d?\d?\d?\d\.?\d?\d? could be shortened to \d+(?:\.\d+)? (one or more digit, optionally followed by a period and one or more digits) or, more exactly, to \d{1,4}(?:\.\d{0,2})? (one to four digits, optionally followed by a period and up to 2 digits).
here's how I'd do it: this is pulled out of a larger script that regularly scans a directory for files, then does a similar manipulation, and I've changed variables quickly to obfuscate, so shout if it doesn't work and I'll take a more detailed look tomorrow.
It takes a backup of each file as well, and works on a temp copy before renaming.
Note it also sends an email alert (code at the end) to say if any processing was done - this is because it's designed to run as as scheduled task in the original
$backupDir = "$pwd\backup"
$stringToReplace = "."
$newString = "."
$files = #(Get-ChildItem $directoryOfFiles)
$intFiles = $files.count
$tmpExt = ".tmpDataCorrection"
$DataCorrectionAppend = ".DataprocessBackup"
foreach ($file in $files) {
$content = Get-Content -Path ( $directoryOfFiles + $file )
# Check whether there are any instances of the string
If (!($content -match $stringToReplace)) {
# Do nothing if we didn't match
}
Else {
#Create another blank temporary file which the corrected file contents will be written to
$tmpFileName_DataCorrection = $file.Name + $tmpExt_DataCorrection
$tmpFile_DataCorrection = $directoryOfFiles + $tmpFileName_DataCorrection
New-Item -ItemType File -Path $tmpFile_DataCorrection
foreach ( $line in $content ) {
If ( $line.Contains("#")) {
Add-Content -Path $tmpFile_DataCorrection -Value $line.Replace($stringToReplace,$newString)
#Counter to know whether any processing was done or not
$processed++
}
Else {
Add-Content -Path $tmpFile_DataCorrection -Value $line
}
}
#Backup (rename) the original file, and rename the temp file to be the same name as the original
Rename-Item -Path $file.FullName -NewName ($file.FullName + $DataCorrectionAppend) -Force -Confirm:$false
Move-Item -Path ( $file.FullName + $DataCorrectionAppend ) -Destination backupDir -Force -Confirm:$false
Rename-Item -Path $tmpFile_DataCorrection -NewName $file.FullName -Force -Confirm:$false
# Check to see if anything was done, then populate a variable to use in final email alert if there was
If (!$processed) {
#no message as did nothing
}
Else {
New-Variable -Name ( "processed" + $file.Name) -Value $strProcessed
}
} # Out of If loop
}
I have a command that I ran on PowerShell 2.0 that gives me the output below.
Everything in the screenshot above is one giant string. I want to pull out the table part of the string so that I can potentially format it as a list. Ultimately, I want to output to look like:
INSTANCE_NAME: Sample Name
STATUS: MOUNTED
DATABASE_STATUS: ACTIVE
My first thought was the use regex to pull out the table. I thought something like this might work, but I've so far been unsuccessful.
$tabletext = [regex]::match($rawtext, "(INSTANCE_NAME(.+\r\n)+)")
EDIT:
Here is the text as a string.
SQL*Plus: Release 12.1.0.1.0 Production on Wed Apr 20 16:34:57 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL>
INSTANCE_NAME STATUS DATABASE_STATUS
---------------- ------------ -----------------
sample_name OPEN ACTIVE
SQL> Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
I've done something very similar to parse Firebird sql output.
Here's a script that works on your sample data:
function parse-headers($lines) {
$h = $lines[0]
$headers = $h.Split(" ", [System.StringSplitOptions]::RemoveEmptyEntries)
$headers = $headers | % {
new-object pscustomobject -property #{
Length = $_.Length
Offset = $h.IndexOf($_)
Text = $_
}
}
for($i = 1; $i -lt $headers.length; $i++) {
$headers[$i-1].Length = $headers[$i].Offset - $headers[$i-1].Offset - 1
}
$headers[$header.length-1].Length = $h.length - $headers[$header.length-1].Offset
return $headers
}
function parse-sqloutput($lines) {
$headers = parse-headers $lines
$result = #()
for($l = 2; $l -lt $lines.Length; $l++) {
$line = $lines[$l]
$headernames = $headers | % { $h = #{} } { $h[$_.Text] = $null } { $h }
$r = New-Object -type pscustomobject -Property $headernames
for($i = 0; $i -lt $headers.length; $i++) {
try {
$h = $headers[$i]
$name = $h.text
if ($i -eq $headers.length - 1) {
$value = $line.Substring($h.Offset).Trim()
}
else {
$value = $line.Substring($h.Offset, $h.Length).Trim()
}
$r.$name = $value
} catch {
Write-Warning "failed to parse line $l col $i"
throw
}
}
$result += $r
}
return $result
}
function get-sqltable($sqlout) {
#find sql table output
$startidx = -1
$endidx = -1
for($i = 0; $i -lt $sqlout.Length; $i++) {
$line = $sqlout[$i]
if ($line -match "^\s*([\-]+\s*)+$") {
$startidx = $i - 1
}
if ($startidx -ge 0 -and $line -match "^\s*$") {
$endidx = $i
}
if ($startidx -ge 0 -and $endidx -ge 0) { break }
}
$sqlout = $sqlout | select -Skip $startidx -First ($endidx-$startidx)
return $sqlout
}
$sqlout = get-content "sqlout.txt" | out-string
#split giant string into lines
$sqlout = $sqlout | Split-String "`r`n"
$sqlout = get-sqltable $sqlout
#we should now have only sql table in $sqlout
$result = parse-sqloutput $sqlout
$result | Format-List
The idea is:
Find a line that contains only strings of - chars and assume it marks header row.
Look for first empty line after header row and assume it is the end of the table output.
Parse the header row, get columns' names and lengths.
Get values basing on parsed column lengths.
$string1 = "peachbananaapplepear";
$string2 = "juicenaapplewatermelonpear";
I want to know what's the longest common substring containing the word "apple".
$string2 =~ m/.+apple.+/;
print $string2;
So I use the match operator, and .+ for matching any character before and after the keyword "apple". When I print $string2 it doesn't return naapple but returns the original $string2 instead.
Here is one approach. First get the locations where 'apple' appears in the strings. And for each of those locations in string1, look at all locations in string2.
Look to the left and right to see how far the commonality extends from the initial location.
$string1 = "peachbananaapplepear12345applegrapeapplebcdefghijk";
$string2 = "juicenaapplewatermelonpearkiwi12345applebcdefghijkberryapple";
my $SearchFor="apple";
my $SearchStrLen = length($SearchFor);
# Get locations in first string where the search term appears
my #FirstPositions = getPostions($string1);
# Get locations in second string where the search term appears
my #SecondPositions = getPostions($string2);
CheckForMaxMatch();
sub getPostions
{
my $GivenString = shift;
my #Positions;
my $j=0;
for (my $i=0; $i < length($GivenString); $i += ($SearchStrLen+1) )
{
$j = index($GivenString, $SearchFor, $i);
if ($j == -1) {
last;
}
push (#Positions, $j);
$i = $j;
}
return #Positions;
}
sub CheckForMaxMatch
{
my $MaxLeft=0;
# From the location of 'apple', look to the left and right
# to see how far the characters are same
for my $i (#FirstPositions) {
for my $j (#SecondPositions) {
my $LeftMatchPos = getMaxMatch($i, $j, -1);
my $RightMatchPos = getMaxMatch($i, $j, 1);
if ( ($RightMatchPos - $LeftMatchPos) > ($MaxRight - $MaxLeft) ) {
$MaxLeft = $LeftMatchPos;
$MaxRight = $RightMatchPos;
}
}
}
my $LongestSubString = substr($string1, $MaxLeft, $MaxRight-$MaxLeft);
print "Longest common substring is: $LongestSubString\n";
print "It begins at $MaxLeft and ends at $MaxRight in string1\n";
}
sub getMaxMatch
{
my $i= shift;
my $j= shift;
my $direction= shift;
my $k = ($direction >= 1 ? $SearchStrLen : 0);
my $FirstChar = substr($string1, $i+($k * $direction), 1);
my $SecondChar = substr($string2, $j+($k * $direction), 1);
for ( ; $FirstChar && $SecondChar; $k++ )
{
$FirstChar = substr($string1, $i+($k * $direction), 1);
$SecondChar = substr($string2, $j+($k * $direction), 1);
if ( $FirstChar ne $SecondChar ) {
$direction < 1 ? $k-- : "";
my $pos = ($k ? ($i + $k * $direction) : $i);
return $pos;
}
}
return $i;
}
The =~ operator is not going to reassign the value of $string2. Try this:
$string2 =~ m/(.+apple.+)/;
my $match = $1;
print $match
Based on the general algorithm, but tracks not only the length of the current run (#l), but whether it includes the keyword (#k). Only runs that include the keyword are considered for longest run.
use strict;
use warnings;
use feature qw( say );
sub find_substrs {
our $s; local *s = \shift;
our $key; local *key = \shift;
my #positions;
my $position = -1;
while (1) {
$position = index($s, $key, $position+1);
last if $position < 0;
push #positions, $position;
}
return #positions;
}
sub lcsubstr_which_include {
our $s1; local *s1 = \shift;
our $s2; local *s2 = \shift;
our $key; local *key = \shift;
my #key_starts1 = find_substrs($s1, $key)
or return;
my #key_starts2 = find_substrs($s2, $key)
or return;
my #is_key_start1; $is_key_start1[$_] = 1 for #key_starts1;
my #is_key_start2; $is_key_start2[$_] = 1 for #key_starts2;
my #s1 = split(//, $s1);
my #s2 = split(//, $s2);
my $length = 0;
my #rv;
my #l = ( 0 ) x ( #s1 + 1 ); # Last ele is read when $i1==0.
my #k = ( 0 ) x ( #s1 + 1 ); # Same.
for my $i2 (0..$#s2) {
for my $i1 (reverse 0..$#s1) {
if ($s1[$i1] eq $s2[$i2]) {
$l[$i1] = $l[$i1-1] + 1;
$k[$i1] = $k[$i1-1] || ( $is_key_start1[$i1] && $is_key_start2[$i2] );
if ($k[$i1]) {
if ($l[$i1] > $length) {
$length = $l[$i1];
#rv = [ $i1, $i2, $length ];
}
elsif ($l[$i1] == $length) {
push #rv, [ $i1, $i2, $length ];
}
}
} else {
$l[$i1] = 0;
$k[$i1] = 0;
}
}
}
for (#rv) {
$_->[0] -= $length;
$_->[1] -= $length;
}
return #rv;
}
{
my $s1 = "peachbananaapplepear";
my $s2 = "juicenaapplewatermelonpear";
my $key = "apple";
for (lcsubstr_which_include($s1, $s2, $key)) {
my ($s1_pos, $s2_pos, $length) = #$_;
say substr($s1, $s1_pos, $length);
}
}
This solution in O(NM), meaning it scales amazingly well (for what it does).
I have written a Perl script for the following bioinformatics question, but unfortunately there is a problem with the output.
Question
1) From a file of 40,000 unique sequences, unique meaning the sequence id numbers, extract the following pattern
$gpat = [G]{3,5}; $npat = [A-Z]{1,25};<br>
$pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
2) For each sequence, find if $pattern occurs between the values of
0-100
100-200
200-300
...
900-1000
1000
If a certain sequence is <1000 characters long, even then the division must be maintained i.e. 0-100,100-200 etc.
The Issue
The main issue I am having is with counting the number of times $pattern occurs for each sequence subdivision and then adding its count for all the sequences.
For example, for sequence 1, say $pattern occurs 5 times at a length >1000. For sequence 2, say $pattern occurs 3 times at length>1000. Then total count should be 5+3 =8.
Instead, my result is coming like : (5+4+3+2+1) + (3+2+1) = 21 i.e. a cumulative total.
I am facing the same issue with the count for the first 10 subdivisions of 100 characters each.
I would be grateful if a correct code could be provided for this calculation.
The code I have written is as under. It is heavily derived from Borodin's answer to one of my previous questions here : Perl: Search a pattern across array elements
His answer is here: https://stackoverflow.com/a/11206399/1468737
The Code :
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'small.fa' or die $!;
my ($id, $seq);
my #totals = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0); #intialize the #total arrays...
#..it should contain 10 parts for 10 divisions upto 1000bp
my #thousandcounts =(0); #counting total occurrences of $pattern at >1000 length
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Totals : #totals\n";
print "Thousand Counts total : #thousandcounts\n";
##**SUBROUTINE**
sub process_seq {
my $sequence = shift #_;
my $subseq = substr $sequence,0,1000;
my $length = length $subseq;
print $length,"\n";
if ($length eq 1000) {
my #offsets = map {sprintf '%.0f', $length * $_/ 10} 1..10;
print "Offsets of 10 divisions: #offsets\n";
my #counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
my #count = (0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
print $place,"\n\n";
if ($place <=1000){
for my $i (0..9) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "Counts : #counts\n\n";
$totals[$_] += $counts[$_] for 0..9;
if ($place >1000){
for my $i(0){
$count[$i]++;
last;
}
} print "Count greater than 1000 : #count\n\n";
$thousandcounts[$_] += $count[$_] for 0;
}
}
#This region of code is for those sequences whose total length is less than 1000
#It is working great ! No issues here
elsif ($length != 1000) {
my $substr = join ' ', unpack '(A100)*', $sequence;
my #offsets = map {sprintf '%.0f', $length * $_/ ($length/100)} 1..10;
print "Offsets of 10 divisions: #offsets\n";
my #counts = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0,);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
print "Place : $place","\n\n";
for my $i (0..9) {
next if $place >= $offsets[$i];
$counts[$i]++; .
last;
}
}
print "Counts : #counts\n\n";
$totals[$_] += $counts[$_] for 0..9;
}
}#subroutine ends
I am also attaching a small segment of the file I am working with. This one is titled small.fa and I have been experimenting with this file only before moving onto to the bigger file containing >40,000 sequences.
>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NR_002714 1
gttatacatctctaccattacctagcctgaaaagccacctcagattcagc
caacaagtaagtgggcattacaggagaagggtacctttcacaagggctgt
aatctaaaatcttggggaagatacagcgtcatctgtccaagaggtgtcag
cagtaacgaagcctcagtagaagccaaagttattttggattactgagcct
gtatagtttccagattctcaagagaaatatatgggaatgtagatatctca
gaggaccttcctgctgtcaggaattcagaggaggaaataaggaaggtaat
aggtgctctgctctcattctctcaaaccctcttccctgtgttttcctata
gagattgctgatttgctccttaagcaagagattcactgctgctcagcatg
gctcagaccaactcatgcttcatgctgatctcctgcctgatgttcctgtc
tctgagccaaggtgagattgttttccccacacatacctcccacaacccca
gccctgaagccctcactctatcctcatgcatatgagttcacttgagaaaa
agcagagtcaagttcaggggttgttttgtgttgttcagtgatatttattg
ctgatctcatcccattcaaaaacatcctgacctccctaaggagttagaga
tggaacttagcataaccctttatcagtgaccactgcagttggcattggtt
tgtcatattaacactactcatgatgggggtgttgaggatgtctgtttgta
gacagtcattagtggaatggggaactgaggggagctttgtgtgtagagaa
actggacaggcttgagaaagaagcctcagtccttcaaggaagaaaaagcc
ataagtaaaagggacaatggggacacttttcatgagcctattcattgtgt
gctcttgtcttgagcaaagacatcttgagagcctataggtaagatgcaga
agggcagaagtgaccaatcgcttcgtgacctataggatccttctattcct
ataaagaatcctcagaagctcctacctcatattttagcctttaccttgcc
ctgagggtctttcttaattgtctctcttttcccaggacaggaggcccatg
ctgagttgcccaaggcccagatcagctgcccagaaggcaccagtgcctaa
ggctcccactgctactactttaatgaagagcatgagacctgggtttatgc
agatgtgagtgaggagagcagtgtgggaagggaggctcacgaagggaggg
gaagctgccactctccagtgtgttcagtggctgatatgagatgagactaa
tcccctccctatccaatcatcagcccaaaactttccaatctactttatcc
catcattcagcacagagatgctggtggtcagtgacagcatcatcagggac
atttctgtgctgtcctttttctgttacatcctctgggagggctcaatatg
tctcccacactttcctccttcactgagtgctccattttcttctccaacag
ctctactgccagaacatgaattcaggtaacctggtgtctgtgctcaccca
ggctgagggtgcctttgtggcttcgctgattaaagagagtggcaccaagg
atagcaatgtctggattggcctccatgacccccaccggatcagtctgctg
catcttctacctcctgattatcaggttccagagggtctgatgtctggcac
ctcaagcatcagtttttactatattatgataaaagcaacctctctataaa
tcatataatgtaaaggatatcaaggttctccataggttcttcgagataag
cttaaagctgaatttcctgtgtgtttcaggcattcacagataaactcatt
ctctgtacttctagggtagcatctttatgtatctattatgtacctcttat
ctattgtgttatcatctctgttatagaagagccttctgtagaccatatag
aaaaagattatagaggaggagaatctactgctggcaattgggaaccgcaa
ggtatactaaataatatatcaacaactaatggccatctaatgctatgctg
gatatgaacttttggggcctcaggaaagaaaaaccaggaactagtttcaa
taatgaggtgtcatggttccctgtggcaaatttagaacgcttatcgtttg
gcaggacacagagaggtaggtgaacattccaggaaagaagcagcttagag
aaaatgtggaggaaataatatgacacttagagaaaaaggaaggtttattc
ttgtcttatgtcttgacctgtttctgagtgcgaacacaaaccaggtgttt
ctgtctctttctgagtcacgtctgcccctgttctggcccttccccatcta
gaactgccattatcagtggagtagtgggtccctggtctcctacaaatcct
gggacattggatccccaagctgtgccaatactgcctactgtgctagcctg
acttcaagctcaggtgaggggcacagaatccacacacttattgccatcct
ctcctatttatctctgaggatcgaccggggactgggatagaggaagggtg
agctcctcattcaggaaatagaggagtgtttcctctttatttttgctgag
tcctgcagccaggagggtaatacactctgatcccctcagtctgaatcttc
tcattgtcttataggattcaagaaatggaaggatgattcttgtaaggaga
agttctcctttgtttgcaagttcaaatactggaggcaattgtaaaatgga
cgtctagaattggtctaccagttactatggagtaaaagaattaaactgga
ccatctctctccatatcaatctggaccatctctcctctgctaaatttgca
tgactgatctttagtatctttacctacctcaatttctggagccctaaaca
ataaaaataaacatgtttcccccat
>NR_003569 1
ctgggacccacgacgacagaaggcgccgatggccgcgcctgctgagccct
gcgcggggcagggggtctggaaccagacagagcctgaacctgccgccacc
agcctgctgagcctgtgcttcctgagaacagcaggggtctgggtaccccc
catgtacctctgggtccttggtcccatctacctcctcttcatccaccacc
atggccggggctacctccggatgttccccactcttcaaagccaagatggt
gcttggattcgccctcatagtcctgtgtacctccagcgtggctgtcgctc
tttggaaaatccaacagggaacgcctgaggccccagaattcctcattcat
cctactgtgtggctcaccacgatgagcttcgcagtgttcctgattcacac
caagaggaaaaagggagtccagtcatctggagtgctgtttggttactggc
ttctctgctttgtcttgccagctaccaacgctgcccagcaggcctccgga
gcgggcttccagagcgaccctgtccgccacctgtccacctacctatgcct
gtctctggtggtggcacagtttgtgctgtcctgcctggcggatcaacccc
ccttcttccctgaagacccccagcagtctaacccctgtccagagactggg
gcagccttcccctccaaagccacgttctggtgggtttctggcctggtctg
gaggggatacaggaggccactgagaccaaaagacctctggtcgcttggga
gagaaaactcctcagaagaacttgtttcccggcttgaaaaggagtggatg
aggaaccgcagtgcagcccgggggcacaacaaggcaatagcatttaaaag
gaaaggcggcagtggcatggaggctccagagactgagcccttcctacggc
aagaagggagccagtggcgcccactgctgaaggccatctggcaggtgttc
cattctaccttcctcctggggaccctcagcctcgtcatcagtgatgtctt
caggttcactgtccccaagctgctcagccttttcctggagtttattggtg
atcccaagcctccagcctggaagggctacctcctcgccgtgctgatgttc
ctctcggcctgcctgcaaacgctgtttgagcagcagaacatgtacaggct
caaggtgctgtagatgaggctgcggtcggccatcactggcctggtgtaca
gaaaggcatccacagcatatctgaagaaatattcagaagttaactaatct
cagatgatttcagcaggagtaaagaagagaaacagactcagaaatgccat
tacaacagttaattatgtcaaatttatcaccctgattgatcacgcagcat
taacctcaagaacgccaagccaagtttttttgacaaatgtgagccaaggt
ttccgaaaaactagcagatatgactgtgacttacaaaatggaaaaagtaa
acgagaaacacaatttgatatgatttaataaaagatttgtttccaccact
tctcctgggaacctcagcacattttctttccactgacagttattatctct
acctttattgaacaaagacacccggaacacagctgctgaggatcagtaaa
gaaaatcattcttttattaataagactgttattagcaggaaaaaaaaatc
catgtttgggagtttgcactgaagttacaggccattttgaagaaatatgg
ctgactagtgccaacattatttcaggcaatttcatgatcaaatgtcttat
taggttgtttaaaatttttatagagattgtaaatcagaactattttctat
ttgccctaaatatttagatgctacagggaaagcagatcaaattaaagggt
actgtgcacatttttttactgggaactcccagggatataaatcatttcgc
ctgcagcatggaattcttcagtacacatgcttgtggaaacattccacgct
ccgccagcacgctcattaaagtgatgatttgggttgcaacaacagtgcca
agtacttcctgtgttcaactggggaccatgtggcaagacccaaagcttcc
ccagagatcctatgggaataagttttttgagccaccatattccattattt
cagcctaaaataacaccatgggacaagaatcagaagacagaggagcagac
aaatgtgtgtagacatgctggaaggaatctttctttttagaaacagggtc
aatatctattaaactttaagatgtgtatctcttgacctggcagtttctgt
atttgagttttaacctactgatatacccatgcatgtgaataaagtatctt
cctgcatgtaacaggatatttaatgtaaccttgattatagttgcaaatgc
tgggaaacgatccaaatgtctttcaatatggcactgattaaataaattat
ggcacagtctcacaatgaaaaacaaatgtagccattaaacagaatgaaat
gggtctagctaaattgaaataggactacctctaagatatgttgttaaaaa
gaaaaaaaagaaagtgcagaggaacaagtatgataccattttgtattttt
taacatatgcaagcgtgattgtgcccacacagaatacctttgaaaataaa
ctcagtatttgcctcagtggataaaaacaagaaccagccttattttcact
gttatatcttttggtgccactttttgaactttttaccatatgtgcatatg
taactttctaaataaattttgtaaaaaaaaaaaaaaaaaa
>NR_002817 2
aactcggtctccactgcactgctggccagacgagggatgttattttgggc
agtgcatctggacttggttcaagtggcaccagccaaatccctgccttact
gacctctcccctggaggagcaggagcagtgctcaaggccgccctgggagg
gctgagaggcaggctctggactggggacacagggatagctgagccccagc
tgggggtggaagctgagccagggacagtcacagaggaacaagatcaagat
gcgctttaactgagaagcccccaaggcagaggctgagaatcagaagacat
ttcagcagacatctacaaatctgaaggacaaaacatggttcaagcatctg
ggcacaggcggtccacccgtggctccaaaatggtctcctggtccgtgata
gcaaagatccaggaaatatggtgcgaggaagatgagaggaagatggcgcg
agagttcctggccgagttcatgagcacatatgtcatgatggagtggctga
ccgggatgctccagctgtgtctcttcgccatcgtggaccaggagaacaac
ccagcactgccaggaacacacgcactggtgataggcatcctcgtggtcat
catcagggtgtaccatggcatgaacacaggatatgccatcaatccgtccc
gggacctgccccccccccccgcatcttcaccttcattgctggttggggca
aactggtcttcaggtactgcccctgcccaggcccattcctttgagatttt
ctgtggggcccctgtgtgttgaggtgtggggggtgatgtgaggggcagca
caggagggtcctgcagagcccccaggtggcctggggagcaggagtgagtc
ccaacatttccccaggccagtagagatacagatcctgcacctgcactgag
tgtcaaccctgtccctgagtcgggctgaggctgaccagggccccgggttg
ggggtgtttcctgggttagcctgaggatgactcctctgctcaaccagtct
tggcccgaggtggatgagggtgctgtcctgggcatcagccccctcagccg
gcctctgcctcttgcctgcagcgatggggagaacttgtggtgggtgccag
tggtggcaccacttctgggtgcctctctaggtggcatcatctacctggtc
ttcattggctccaccatcccacgggagcccctgaaattggaggactctgt
ggcatatgaagaccacgggataaccgtattgcccaagatgggatctcatg
aacccatgatctctccccttaccctcatctccgtgagccctgccaacaga
tcttcagtccaccctgccccacccttacatgaatccatggccctagagca
cttctaagcagagattatttgtgatcccatcccttccccaataaagagaa
gcttgtcccacagcagtacccccacttcctgggggcctcctgtggttggg
cttccctcctgggttcttccaggagctctagggctatgtcttagcccaag
gtgtagaggtgaggcacctcaagtctttcatgccctgggaactggggtgc
cccagggggagaatggggaagagctgacctgcgccctcagtaggaacaag
gtaagatgaaagaatgacagaaacagaatgagggattttcaggcaagggg
gaaggaagggcagttttggtgaaaggactgtagctgactggtggggggct
ggctttggaaatactttgaggggatcctgagactggactctagactctcc
cctggttgttcccttccccgagttctggccggttcttggaccagacaagg
catggcccaagaaggtagatcagaattttttagcctttttttcattagtg
ccttccctagtataattccagattttttttcttaatcacatgaaatttta
ataccacagatatactatacatctgtttatgttctgtatatgttctgtgc
tttatacgtaaaaaagagtaagattttttttcacctccccttttaagaat
cagttttaattcccttgagaatgcttgttatagattgaaggctggtaagg
ggttgggctcctctttcttcttcctggtgccagagtgctcccacatgaag
gaataggaaaggaagatgcaaagagggaaatccttcgaacacatgaagac
acaggaagaggcctcttagggctccaagggctccagggaagcagctgcag
aggttgggtggggtgaggggccaggatccactgaccctggggccaggcag
gaatcactctgttgcctggggctcagaaggcagtatcacccatggttcct
gtcattgctcatgtattttgcctttcaacaattattgtgcacctactgtg
tgcaggccctgcctggacactggggatgcgcagtggatgcactgggctct
gcctttgagggttgcagtttaatgggtgacaggtaattataaggaagaag
gtgagtgcagagtgggaggcttggaggctgtggggcttggggtgggggag
ctcacatccagcctctgggccaaggccaggaggcttcccagagcaggaga
cagagcagggtattgtggtggggggtgtcctttttggggctgggatctgc
actttacagtttgaggggatgggcagaggaggctgggcttcattctggag
gtggggacatggtgaggtgaggtttagaaagcacacctgagccgcagtgt
gtaggatgctggaaatggtggagatgggcctgcgaagagagtgctgggaa
gtgatgacccaggagcagcagccgggcacctaacaatgggtcagcaccgt
gggcgtggagacaaaggccgggattgatcaatacccgagaagtacaatgt
acaggacttgggctccatttggatggagtgggtgagggaggagtcagaaa
tggcttccgatttccagcttgggcctggggattggagatgtccccactga
gagtagggcacaagtgaggaaatggtttggagaggaagatgataagttac
atcatggatgtgctgagtctgagttgcctatgggacttggaatggggggt
ggcaaaaggtgtgtgatcttgagcaagatattcaactcttctgggccttg
gtcttctcatttgtaaaacggtgataagaatattacttcccatttgtgtt
gctgtgaatattaaatgcgctaccacatgt
Thank you for taking the time to go through my problem.
Any help and input would be deeply appreciated.
Thank you for taking the time to go through my problem!
This is pretty much the same as your previous problem except that the intervals are independent of the length of the sequence and so can be defined just once instead of changing them for every sequence.
This program is a modification of my previous solution. As I described, it starts with a fixed set of values in #offsets from 100 to 1000 in steps of 100, and the final range > 1000 is terminated at 2E9 or 2 billion. This is close to the maximum positive 32-bit integer and serves to catch all offsets above 1000. I assume you won't be dealing with sequences any bigger than this?
The #totals and #counts arrays are initialised to zeroes with the same number of elements as the #offsets array.
Otherwise the functionality is much as before.
use strict;
use warnings;
use List::MoreUtils 'firstval';
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'small.fa' or die $!;
my #offsets = map $_*100, 1 .. 10;
push #offsets, 2E9;
my #totals = (0) x #offsets;
my ($id, $seq);
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Total: #totals\n";
sub process_seq {
my $sequence = shift;
my #counts = (0) x #offsets;
while ($sequence =~ /$regex/g) {
my $place = $-[0];
my $i = firstval { $place < $offsets[$_] } keys #offsets;
$counts[$i]++;
}
print "Counts: #counts\n\n";
$totals[$_] += $counts[$_] for keys #totals;
}
output
Running this program against your new data file small.fa produces
Total: 1 1 0 0 0 0 0 1 0 1 10
But using the data from the previous question, sample.fa is much more interesting
Total: 5 4 1 0 0 2 2 1 0 0 1
The following seems to work. While playing around, I put the data you posted in the __DATA__ section at the end of the script. To use it with a real data file, you'll need to open it, and pass the file handle to run.
#!/usr/bin/env perl
use strict; use warnings;
use Data::Dumper;
use List::MoreUtils qw( first_index );
if (#ARGV) {
my ($input_file) = #ARGV;
open my $input, '<', $input_file
or die "Cannot open '$input_file': $!";
run($input);
close $input
or die "Cannot close '$input_file': $!";
}
else {
run(\*DATA);
}
sub run {
my ($fh, $start_pat, $stop_pat) = #_;
# These are your patterns. I changed $npat because I don't
# think, e.g., q is a valid character in your input.
my $gpat = '[g]{3,5}';
my $npat = '[acgt]{1,25}';
my $wanted = qr/$gpat$npat$gpat$npat$gpat$npat$gpat/;
# These just tell us where a sequence begins and ends.
my $start = qr/\A>([A-Za-z_0-9]+)/;
my $stop = qr/[^acgt]/;
# Set up the bins and labels for the histogram.
my #bins = map 100 * $_, 1 .. 10;
my #labels = map sprintf('%d - %d', $_ - 100, $_), #bins;
# Initialize the histogram with all zero counts.
my %hist = map { $_ => 0 } #labels;
my $id;
while (my $line = <$fh>) {
# Whenever you see a new sequence, read it completely
# and pass it to build_histogram.
if (($id) = ($line =~ $start)) {
print "Start sequence: '$id':\n";
my $seq_ref;
($line, $seq_ref) = read_sequence($fh, $stop);
my $hist = build_histogram(
$seq_ref,
$wanted,
\#bins,
\#labels,
);
# Add the counts from this sequence to the overall
# histogram.
for my $key ( keys %$hist ) {
$hist{ $key } += $hist->{$key};
}
# exit loop if read_sequence stopped because of EOF.
last unless defined $line;
# else see if the line that stopped input is the start
# of a new sequence.
redo;
}
}
print Dumper \%hist;
}
sub build_histogram {
my ($seq_ref, $wanted, $bins, $labels) = #_;
my %hist;
while ($$seq_ref =~ /$wanted/g) {
# Whenever we find segment which matches what we want,
# store the position,
my $pos = $-[0];
# and find the bin where it fits.
my $idx = first_index { $_ > $pos } #$bins;
# if you do not have List::MoreUtils, you should install it
# however, the grep can be used instead of first_index
# my ($idx) = grep { $bins->[$_] > $pos } 0 .. $#$bins;
# $idx = -1 unless defined $idx;
# if it did not fit in the bins, then the position must
# be greater than the upper limit of the last bin, put
# it in "> than upper limit of last bin".
my $key = ($idx == -1 ? "> $bins->[-1]" : $labels->[$idx]);
$hist{ $key } += 1;
}
# we're done matching, return the histogram for this sequence
return \%hist;
}
sub read_sequence {
my ($fh, $stop) = #_;
my ($line, $seq);
while ($line = <$fh>) {
$line =~ s/\s+\z//;
last if $line =~ $stop;
$seq .= $line;
}
return ($line, \$seq);
}
__DATA__
-- Either paste your data here, or pass the name
-- of your input file on the command line
Output:
Start sequence: 'NR_037701':
Start sequence: 'NR_002714':
Start sequence: 'NR_003569':
Start sequence: 'NR_002817':
$VAR1 = {
'700 - 800' => 0,
'> 1000' => 10,
'200 - 300' => 1,
'900 - 1000' => 1,
'800 - 900' => 1,
'500 - 600' => 0,
'0 - 100' => 0,
'100 - 200' => 1,
'300 - 400' => 0,
'400 - 500' => 0,
'600 - 700' => 0
};
Also, you should take Chris Charley's advice and use Bio::SeqIO to read sequences rather than my homebrewed read_sequence function. I was just too lazy to install BioPerl just for the purpose of answering this question.
Generally, in Perl you can count the occurrence of a pattern by:
$_ = $input;
my $c = 0;
$c++ while s/pattern//s;
I was finally able to figure out where I was going wrong with my code. It turned out to be a looping problem. The following code works perfectly. I have marked it in comments the places where I made the modification.
#!/usr/bin/perl -w
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat . $npat . $gpat . $npat . $gpat . $npat . $gpat;
my $regex = qr/$pattern/i;
open OUT, ">Quadindividual.refMrna.fa" or die;
open my $fh, '<', 'refMrna.fa' or die $!;
my ( $id, $seq ); # can be written as my $id; my $seq;
my #totals = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ); #intialize the #total arrays.
my #thousandcounts = (0);
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
print OUT "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Totals : #totals\n";
print OUT "Totals : #totals \n";
print "Thousand Counts total : #thousandcounts\n";
print OUT "Thousand Counts total : #thousandcounts\n";
sub process_seq {
my $sequence = shift #_;
my $subseq = substr $sequence, 0, 1000;
my $length = length $subseq;
print $length, "\n";
my #offsets = map { sprintf '%.0f', $length * $_ / 10 } 1 .. 10;
print "Offsets of 10 divisions: #offsets\n";
my #counts = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, );
my #count = (0);
# *MODIFICATION*
# This if loop was intialized from my #offsets above earlier
if ( $length eq 1000 ) {
while ( $sequence =~ /$regex/g ) {
my $place = $-[0];
print $place, "\n\n";
if ( $place <= 1000 ) {
for my $i ( 0 .. 9 ) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
if ( $place > 1000 ) {
for my $i (0) {
$count[$i]++;
last;
}
}
} #*MODIFICATION*
#The following commands were also subsequently shifted to ..
#...properly compute the total
print "Counts : #counts\n\n";
$totals[$_] += $counts[$_] for 0 .. 9;
print "Count : #count\n\n";
$thousandcounts[$_] += $count[$_] for 0;
}
elsif ( $length != 1000 ) {
my $substr = join ' ', unpack '(A100)*', $sequence;
my #offsets =
map { sprintf '%.0f', $length * $_ / ( $length / 100 ) } 1 .. 10;
print "Offsets of 10 divisions: #offsets\n";
my #counts = ( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, );
while ( $sequence =~ /$regex/g ) {
my $place = $-[0];
print "Place : $place", "\n\n";
for my $i ( 0 .. 9 ) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "Counts : #counts\n\n";
$totals[$_] += $counts[$_] for 0 .. 9;
}
} #subroutine ends