Read all the lines between a pattern match using Perl - regex

I want to read a csv file. The csv files is like below,
cat,dog,0
apple,banana,0
#start_loop
jug,ball,0
tub, jar,3
#stop_loop
phone,bottle,10
#per head
#start_loop
cup,book,7
laptop,charger,9
#stop_loop
For above csv, I want to read between the #start and #stop. Also, I want to differentiate two #start_loop 's for ex, the lines from first #start_loop goes to one array and another #start_loop goes to another array.
The expected result would be:
#array1 = {jug ball 0, tub jar 3}
and
#array2 = { cup book 7, laptop charger 9}
How do I solve this problem?

You can use the flip-flop operator. It's value will be 1 for the #start line and will contain E for the #stop line.
#!/usr/bin/perl
use warnings;
use strict;
my #arr;
while (<>) {
chomp;
my $inside = /^#start_loop/ .. /^#stop_loop/;
if ($inside) {
push #arr, [] if 1 == $inside;
undef $inside if 1 == $inside
|| $inside =~ /E/;
push #{ $arr[-1] }, $_ if $inside;
}
}
use Data::Dumper; print Dumper \#arr;

Simplified version of choroba's solution:
my #arr;
while (<>) {
chomp;
my $inside = /^#start_loop/ .. /^#stop_loop/;
if ( $inside == 1 ) { push #arr, []; } # Start line
elsif ( $inside !~ /E/ ) { push #{ $arr[-1] }, $_; } # Middle line
}
Without the flip-flop:
my #arr;
my $inside;
while (<>) {
chomp;
if ( /^#start_loop/ ) { $inside = 1; push #arr, []; } # Start line
elsif ( /^#stop_loop/ ) { $inside = 0; } # Stop line
elsif ( $inside ) { push #{ $arr[-1] }, $_; } # Middle line
}

Related

perl to check if sequence contains at least 3 unique bases and if not delete

I have a fasta file. I need to remove sequences containing ā€œNā€ or did not contain at least 3 unique bases.
The code so far is below. Also how would I remove the sequence ID line as go along for sequences I delete.
#!/usr/bin/perl
use strict;
use warnings;
open FILE, '<', $ARGV[0] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open match_fh, ">$ARGV[0]_trimmed.fasta"
or die qq{Failed to open for output: $!\n};
while ( my $line = <FILE> ) {
chomp($line);
if ( $line =~ m/^>/ ) {
print match_fh "$line\n";
my #data = split( /\|/, $line );
my $nextline = <FILE>;
if ( $nextline !~ /N+/g ) {
if ( $nextline =~ /[ATGC]{3}/g ) {
}
print match_fh "$nextline";
}
}
}
close FILE;
close match_fh;
INPUT
>seq1
ATGCGGGATGATCCGAACGTTTAATCTCGTATGCCGTCTTCTATCTCNNN
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq3
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTC
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA
OUTPUT
>seq2
GATGAGCTTGACTCTAGTCCATCTCGTATGCCGTCTTCTGCTATCTCGTA
>seq4
TGGTACTGTAAGCATGAGAGTAATCTCGTATGCCGTCTTCTGCTTGAAAA
while(my $head = <FILE>) {
next if($head !~ /^>/);
$_=<FILE>;
if(!/N+/ && /A/+/T/+/G/+/C/ >= 3) {
print match_fh $head, $_;
}
}

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Perl: How to extract sequences based on gene number and nucleotide length?

I have 2 files, as follows:
file1.txt:
0 117nt, >gene_73|GeneMark.hm... *
0 237nt, >gene_3097|GeneMark.... *
0 237nt, >gene_579|GeneMark.h... *
0 237nt, >gene_988|GeneMark.h... *
0 189nt, >gene_97|GeneMark.hm... *
0 183nt, >gene_97|GeneMark.hm... *
file2.fasta:
>gene_735|GeneMark.hmm|237_nt|+|798985|799221
TTGTGGTTCGTGCCGCGCGACGCGTTGCGTCTGCAAACGCCCGACGAAGACATCGCGACCTATCTGTTCAACAAGCATGTGATTCGGCATCGGTTCTGTCCGACCTGCGGGATTCATCCGTTCGCGGAAGGCACGGACCCGAAGGGCAACGCGATGGCGGCCGTCAATCTTCGCTGCGTCGACGGCGTCGATCTCGACGCGTTGAGCGTCCGCCATTTCGACGGGCGCGCGCTCTGA
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_1876|GeneMark.hmm|234_nt|-|2168498|2168731
ATGCTGTTCTTTTCGCGCGCGGGCGTGTCGCGTGCGGCCGGCGGCCAATCATGCGGCGAGTCGTTTTGTCGCGGCTCGCGGCGCTTGCCGACGTTGGAATCGCGCGCGCCGATGCGCGGATCGGGGCGGCAACGTTTGCGTATGAGGAATGATGCGTTTGCGCATCGGGAATGGGCGCCTCGCCCCGGTTTCGCCGCGATTCCGCCCGACTCGAGGCAGTCGTTTTTCCGCTAA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|105_nt|+|90122|90226
GTGACGCGTTTCGCGACGCGCGTCGATGGGGCGGGCGCGAAACCCGTTCGCCGCGATGCGGCGGACGGGGTATGGCCGAGCGCCGTCCGTCGCGGCGAGAGTTGA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
>gene_97|GeneMark.hmm|234_nt|+|105494|105727
ATGAAGATTCAAATCGCCATTGTTTATTTTGTCGCCCGTCACGCAAACGAGCAGGCGCGAAGCGGATCGGCGCGCATTGGCGAAGAGCCGGCGCGCATCGGCATCGCGCTCGCGCGACACATGCGCGCCGCGCGCGGCCGGTCGACGCCGGATTCGCCTGTCGATCGATCCGGTGCGCCCCGAGCCGATGAGCGGTACGCTTCGGCGCGCGCGCGACACGCGCGACACGCGTGA
>gene_979|GeneMark.hmm|225_nt|-|1115442|1115666
TTGATCGACGCGCGGGGCCGGCCGGGCCGCGGGGTATCGAAGGCGATCGACGCGCAACACGAATCGCCGCCGCGCGCCGAAACCTCGCTATGCGCGTCGCGCGCACGCGCGGCCGGCGGCGCACGCGCGGGTGTGCGCGGGCCGGCGGCGCGGCCGCTCGCACTGCGCGACCGCTCGCGCGCACGCCTTCCTCGGCACGCGCCGGGAATCCCGGCCCTTCAATGA
The output that I expect is:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
There are 4 sequences with gene number 97, but all in different length. I want the sequence with the correct gene length only which listed in file1.txt to output in the output.fasta file. What I've done so far is as follows (but failed and have some errors):
#!/usr/bin/perl
use strict;
use warnings;
my #genes;
open my $list, '<file1.txt';
while (my $line = <$list>) {
push (#genes, $1) if $line =~/\>(.*?)\|/gs;
}
my $tag1 = "0\t";
my $tag2 = "nt";
while (my $line = <$list>) {
if ($line =~ /$tag1(.*?)$tag2/) {
my $match1 = $1;
}
}
my $input;
{
local $/ = undef;
open my $fasta, '<file2.fasta';
my $tag3 = "GeneMark.hmm";
my $tag4 = "_nt";
while (my $input = <$fasta>) {
if ($input =~ /$tag3(.*?)$tag4/) {
my $match2 = $1; }}
close $fasta;
}
my #lines = split(/>/,$input);
foreach my $l (#lines) {
if ($l =~ /(.+?)\|/) {
my $real_name = $1;
if ($real_name ~~ #genes) {
if ($match2 = $match1) {
open (OUTFILE, '>>output.fasta');
print OUTFILE ">$l"; }
}
}
}
Can anyone give me some guide to correct the code? Or is there any better way to do this? Any help will be very much appreciated! Thanks! :)
Here's an option that uses Bio::SeqIO:
use strict;
use warnings;
use Bio::SeqIO;
my %hash;
open my $fh, '<', $ARGV[0] or die $!;
while (<$fh>) {
push #{ $hash{$2} }, $1 if /\s+(\d+)nt,.+?>(gene_\d+)\|/;
}
close $fh;
my $in = Bio::SeqIO->new( -file => $ARGV[1], -format => 'Fasta' );
my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq)
if $seq->id =~ /(gene_\d+)\|.+?\|(\d+)_nt\|/ and grep /$2/, #{ $hash{$1} };
}
Usage: perl script.pl file1.txt file2.fasta [>outFile.fasta]
The second, optional parameter directs output to a file.
Output from your data:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCC
AACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTC
AACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACC
AAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGG
CGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGC
TCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAA
ACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCG
GTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTC
CGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGG
CTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTC
ATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGC
GACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCAC
TAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCG
AACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGC
GATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGC
CATCGATGA
Bio::SeqIO lives to parse fasta (and other such) files, so the above leverages this capability. After creating a hash of arrays (HoA) from file1.txt, the fasta file is processed, and only matching fasta records are printed.
Hope this helps!

Perl Regular Expression to insert/substitute in a string at specific places

Given a url the following regular expression is able insert/substitute in words at certain points in the urls.
Code:
#!/usr/bin/perl
use strict;
use warnings;
#use diagnostics;
my #insert_words = qw/HELLO GOODBYE/;
my $word = 0;
my $match;
while (<DATA>) {
chomp;
foreach my $word (#insert_words)
{
my $repeat = 1;
while ((my $match=$_) =~ s|(?<![/])(?:[/](?![/])[^/]*){$repeat}[^/]*\K|$word|)
{
print "$match\n";
$repeat++;
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/
The output given (for the first example url in __DATA__ with the HELLO word):
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
Where I am now stuck:
I would now like to alter the regular expression so that the output will look like what is shown below:
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
#above is what it already does at the moment
#below is what i also want it to be able to do as well
http://www.stackoverflow.com/HELLOdog/cat/rabbit/ #<-puts the word at the start of the string
http://www.stackoverflow.com/dog/HELLOcat/rabbit/
http://www.stackoverflow.com/dog/cat/HELLOrabbit/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
http://www.stackoverflow.com/HELLO/cat/rabbit/ #<- now also replaces the string with the word
http://www.stackoverflow.com/dog/HELLO/rabbit/
http://www.stackoverflow.com/dog/cat/HELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
But I am having trouble getting it to automatically do this within the one regular expression.
Any help with this matter would be highly appreciated, many thanks
One solution:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for (#insert_words) {
# Use package vars to communicate with /(?{})/ blocks.
local our $insert_word = $_;
local our #paths;
$path =~ m{
^(.*/)([^/]*)((?:/.*)?)\z
(?{
push #paths, "$1$insert_word$2$3";
if (length($2)) {
push #paths, "$1$insert_word$3";
push #paths, "$1$2$insert_word$3";
}
})
(?!)
}x;
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/
Without crazy regexes:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for my $insert_word (#insert_words) {
my #parts = $path =~ m{/([^/]*)}g;
my #paths;
for my $part_idx (0..$#parts) {
my $orig_part = $parts[$part_idx];
local $parts[$part_idx];
{
$parts[$part_idx] = $insert_word . $orig_part;
push #paths, join '', map "/$_", #parts;
}
if (length($orig_part)) {
{
$parts[$part_idx] = $insert_word;
push #paths, join '', map "/$_", #parts;
}
{
$parts[$part_idx] = $orig_part . $insert_word;
push #paths, join '', map "/$_", #parts;
}
}
}
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/
one more solution:
#!/usr/bin/perl
use strict;
use warnings;
my #insert_words = qw/HELLO GOODBYE/;
while (<DATA>) {
chomp;
/(?<![\/])(?:[\/](?![\/])[^\/]*)/p;
my $begin_part = ${^PREMATCH};
my $tail = ${^MATCH} . ${^POSTMATCH};
my #tail_chunks = split /\//, $tail;
foreach my $word (#insert_words) {
for my $index (1..$#tail_chunks) {
my #new_tail = #tail_chunks;
$new_tail[$index] = $word . $tail_chunks[$index];
my $str = $begin_part . join "/", #new_tail;
print $str, "\n";
$new_tail[$index] = $tail_chunks[$index] . $word;
$str = $begin_part . join "/", #new_tail;
print $str, "\n";
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/

Perl regex syntax generation

This is a follow up to the question posted here: Perl Regex syntax
The results from that discussion yielded this script:
#!/usr/bin/env perl
use strict;
use warnings;
my #lines = <DATA>;
my $current_label = '';
my #ordered_labels;
my %data;
for my $line (#lines) {
if ( $line =~ /^\/(.*)$/ ) { # starts with slash
$current_label = $1;
push #ordered_labels, $current_label;
next;
}
if ( length $current_label ) {
if ( $line =~ /^(\d) "(.*)"$/ ) {
$data{$current_label}{$1} = $2;
next;
}
}
}
for my $label ( #ordered_labels ) {
print "$label <- as.factor($label\n";
print " , levels= c(";
print join(',',map { $_ } sort keys %{$data{$label}} );
print ")\n";
print " , labels= c(";
print join(',',
map { '"' . $data{$label}{$_} . '"' }
sort keys %{$data{$label}} );
print ")\n";
print " )\n";
}
__DATA__
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
Essentially, I need to build the Perl to accommodate a syntax shorthand found in the SPSS file. For adjacent columns, SPSS allows one to type something like:
VALUE LABELS
/agree1 to agree5
1 "Strongly disagree"
2 "Disagree"
3 "Neutral"
4 "Agree"
5 "Strongly agree"
As the script currently exists, it will generate this:
agree1 to agree5 <- factor(agree1 to agree5
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
and I need it to produce something like this:
agree1 <- factor(agree1
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
agree2 <- factor(agree2
, levels= c(1,2,3,4,5,6)
, labels= c("Strongly disagree","Disagree","Neutral","Agree","Strongly agree","N/A")
)
ā€¦
use strict;
use warnings;
main();
sub main {
my #lines = <DATA>;
my $vlabels = get_value_labels(#lines);
write_output_delim($vlabels);
}
# Extract the value label information from SPSS syntax.
sub get_value_labels {
my (#vlabels, $i, $j);
for my $line (#_){
if ( $line =~ /^\/(.+)/ ){
my #vars = parse_var_range($1);
$i = #vlabels;
$j = $i + #vars - 1;
push #vlabels, { var => $_, codes => [] } for #vars;
}
elsif ( $line =~ /^\s* (\d) \s+ "(.*)"$/x ){
push #{$vlabels[$_]{codes}}, [$1, $2] for $i .. $j;
}
}
return \#vlabels;
}
# A helper function to handle variable ranges: "agree1 to agree3".
sub parse_var_range {
my $vr = shift;
my #vars = split /\s+ to \s+/x, $vr;
return $vr unless #vars > 1;
my ($stem) = $vars[0] =~ /(.+?)\d+$/;
my #n = map { /(\d+)$/ } #vars;
return map { "$stem" . $_ } $n[0] .. $n[1];
}
sub write_output_delim {
my $vlabels = shift;
for my $vlab (#$vlabels){
print $vlab->{var}, "\n";
print join("\t", '', #$_), "\n" for #{$vlab->{codes}}
}
}
sub write_output_factors {
# You get the idea...
}
__DATA__
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
/agree1 to agree3
1 "Disagree"
2 "Neutral"
3 "Agree"