Parsing plain text database into data structures

Parsing plain text database into data structures - regex

I have a 100MB plain text database file which I would like to parse and convert into datastructure for easy access. The environment is perl and cygwin. Since we receive the plain text file with data from third party, I am not able to use any existing parser like xml or google protocol buffers.
Text file looks like below.
Class=Instance1
parameterA = <val>
parameterB = <val>
parameterC = <val>
ref = Instance2
Class=Instance2
parameterA = <val>
parameterB = <val>
parameterC = <val>
The file contains a huge number class variants.
What would be the best option to parse this ? Will yacc/lex help me or should i write my own perl parser ?

This should do the trick. It auto-detects the line ending by checking the first one, and the assumption here is a record is separated by a blank line.
Within each record, key/value pairs are assumed to be joined with an equal sign (=), and maybe some whitespace.
Here's my code:
#!/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Getopt::Long;
my $db_file;
GetOptions(
'file=s' => \$db_file,
);
sub detect_line_ending {
my ($fh) = #_;
my $line = <$fh>;
# Rewind to the beginning
seek($fh, 0, 0);
my ($ending) = $line =~ m/([\f\n\r]+$)/s;
return $ending;
}
sub process_chunk {
my ($chunk, $line_ending) = #_;
my #lines = split(/$line_ending/, $chunk);
my $section = {};
foreach my $line (#lines) {
my ($key, $value) = split(/[ \t]*=[ \t]*/, $line, 2);
$section->{$key} = $value;
}
return $section;
}
sub read_db_file {
my ($file) = #_;
my $data = [];
open (my $fh, '<', $file) or die $!;
my $line_ending = detect_line_ending($fh);
{
local $/ = $line_ending.$line_ending;
while (my $chunk = <$fh>) {
chomp $chunk;
my $section = process_chunk($chunk, $line_ending);
push #$data, $section;
}
}
close $fh;
return $data;
}
print Dumper read_db_file($db_file);

Is this what you want?
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my %classes;
my $current;
while(<DATA>) {
chomp;
if (/^Class\s*=\s*(\w+)/) {
$classes{$1} = {};
$current = $1;
} elsif (/^(\w+)\s*=\s*(.+)$/) {
$classes{$current}{$1} = $2;
}
}
say Dumper\%classes;
Output:
$VAR1 = {
'Instance2' => {
'parameterC' => '<val>',
'parameterB' => '<val>',
'parameterA' => '<val>'
},
'Instance1' => {
'parameterC' => '<val>',
'ref' => 'Instance2',
'parameterB' => '<val>',
'parameterA' => '<val>'
}
};

Related

To replace a number with strings in an XML file using a Perl script

I have an XML file. I need to replace the digits in comment="18" with comment="my string" where my string is from my #array ($array[18] = my string).
<rule ccType="inst" comment="18" domain="icc" entityName="thens" entityType="toggle" excTime="1605163966" name="exclude" reviewer="hpanjali" user="1" vscope="default"></rule>
This is what I have tried.
while (my $line = <FH>) {
chomp $line;
$line =~ s/comment="(\d+)"/comment="$values[$1]"/ig;
#print "$line \n";
print FH1 $line, "\n";
}

Here is an example using XML::LibXML:
use strict;
use warnings;
use XML::LibXML;
my $fn = 'test.xml';
my #array = map { "string$_" } 0..20;
my $doc = XML::LibXML->load_xml(location => $fn);
for my $node ($doc->findnodes('//rule')) {
my $idx = $node->getAttribute('comment');
$node->setAttribute('comment', $array[$idx]);
}
print $doc->toString();

Here's an XML::Twig example. It's basically the same idea as the XML::LibXML example done in a different way with a different tool:
use XML::Twig;
my $xml =
qq(<rule ccType="inst" comment="18"></rule>);
my #array;
$array[18] = 'my string';
my $twig = XML::Twig->new(
twig_handlers => {
rule => \&update_comment,
},
);
$twig->parse( $xml );
$twig->print;
sub update_comment {
my( $t, $e ) = #_;
my $n = $e->{att}{comment};
$e->set_att( comment => $array[$n] );
}

How to read data in file which the file inside a file in perl?

I have list2.txt file and inside that file, there are several files like 02x5.txt or 0.3x5.txt etc. Then how to read data inside 02x5.txt file at one time? Inside 02x5.txt, there are height:20, length:5, colour:blue etc.
inside list2.txt:
02x5.txt
03x5.txt
inside 02x5.txt:
height:20,
length:5,
colour:blue
inside 03x5.txt:
height:25,
length:10,
colour:green
#!/usr/bin/perl
use strict;
use warnings;
# Reading a line from a file (or rather from a filehandle)
my $filename = "list2.txt";
if (open my $data, "<", $filename) {
while (my $row = <$data>) {
chomp $row;
if ($row =~ m/02x5.txt$/ ){
my $m = $row;
print "$m\n";
}
}
}
How can I read data for height and length from certain txt file?
Thank you

Please see following piece of code which performs tasks you've described, read data stored in hash %data which you can use anyway you desire.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $debug = 1;
my $filename = 'list2.txt';
my %data;
open my $fh, '<', $filename
or die "Couldn't open $filename";
my #filenames = <$fh>;
close $fh;
chomp #filenames;
foreach $filename(#filenames) {
open $fh, '<', $filename
or die "Couldn't open $filename";
while( <$fh> ) {
chomp;
my($k,$v) = split ':';
$data{$filename}{$k} = $v;
}
close $fh;
}
say Dumper(\%data);
Output
$VAR1 = {
'02x5.txt' => {
'colour' => 'blue',
'height' => '20, ',
'length' => '5, '
},
'03x5.txt' => {
'length' => '10, ',
'colour' => 'green',
'height' => '25, '
}
};

The answer is always to break the problem down into smaller chunks.
#!/usr/bin/perl
use strict;
use warnings;
my #files = get_list_of_files();
my %data;
foreach my $file (#files) {
my $file_data = get_data_from_file($file);
$data{$file} = $file_data;
}
# You now have your data in a hash called %data.
# Do whatever you want with it.
sub get_list_of_files {
open my $list_fh, '<', 'list2.txt'
or die $!;
my #files = <$list_fh>;
chomp(#files);
return #files;
}
sub get_data_from_file {
my ($filename) = #_;
my $record;
open my $fh, '<', $filename or die $!;
while (<$fh>) {
chomp;
# Remove trailing comma
s/,$//;
my ($key, $value) = split /:/;
$record->{$key} = $value;
}
return $record;
}

Build a hash by taking input from file, append the values if key is not unique

I have this file
affaire,chose,question
chose,emploi,fonction,service,travail,tâche
cause,chose,matière
chose,point,question,tête
chose,objet,élément
chose,machin,truc
I would like to have an associative array like this :
affaire => chose, question
cause => chose, matière
chose => emploi, fonction, service, travail, tache, point, question, tete, objet élément, machin, truc
or even better, whenever I found a new word, save the word as a key and the context (left or/and right) as a value... So for example:
affaire => chose, question
cause => chose, matière
chose => affaire, question, cause, matière, emploi, fonction, service, travail, tache, point, question, tete, objet élément, machin, truc
At present time I'm trying to create the associative array in this way:
$in = "test.txt";
$out = "res_test.txt";
open(IN, "<", $in);
open(OUT, ">", $out);
%list = '';
while(defined($l = <IN>)){
if ($l =~ /((\w+),(.*))/){
#2,3
$list{$2} = $3;
}
}
while(my($k,$v) = each(%list)){
print OUT $k." => ".$v."\n";
}
But the result is:
affaire => chose,question
=>
chose => machin,truc
cause => chose,matière
Why doesn't it add new values?
Thank you for help.

You overwrite old hash values when you actually want to append them, so
solution would be to concatenate strings,
my %list;
while (my $l = <IN>) {
if ($l =~ /((\w+),(.*))/) {
# $list{$2} //= ""; # initialize to empty string
# # add comma in front depending on $list{$2} content
# $list{$2} .= length($list{$2}) ? ",$3" : $3;
if (defined $list{$2}) { $list{$2} .= ",$3" }
else { $list{$2} = $3 }
}
}
or to use more common hash of arrays for storing values,
my %list;
while (my $l = <IN>) {
my ($k, #vals) = split /,/, $l;
push #{ $list{$k} }, #vals;
}
use Data::Dumper; print Dumper \%list;

Each time you have new value, you assigned this new value to hash key's value, causes the old value is overridden.
A simple fix:
#!/usr/bin/perl
use strict;
use warnings;
my $in = "in";
my $out = "out";
open IN, "<", $in
or die "$!";
open OUT, ">", $out
or die "$!";
my %list = ();
while (defined(my $l = <IN>)) {
if ($l =~ /(\w+),(.*)/) {
$list{$1} .= exists($list{$1}) ? ",$2" : $2;
}
}
while(my($k,$v) = each(%list)){
print OUT $k." => ".$v."\n";
}

use Data::Dumper;
$in = "test.txt";
$out = "res_test.txt";
open(IN, "<", $in);
open(OUT, ">", $out);
%list = '';
while(defined($l = <IN>)){
chomp($l);
$list{$k} = [] unless exists $list{$k};
if ($l =~ /((\w+),(.*))/){
#2,3
push #{ $list{$2} }, $3;
}
}
foreach $k (sort keys %list) {
my #val = #{$list{$k}};
print join ', ', sort #val;
print ".\n";
}
It works!

In hash (associate array) the keys must be unique. That is why in your case chose will cause issues.
#!/usr/bin/perl
# your code goes here
use strict;
use warnings;
use Data::Dumper;
my %hash;
while(chomp(my $line = <DATA>)){
my (#values) = split /,/,$line;
my $key = shift #values;
if(exists $hash{$key}){
my $ref_value = $hash{"$key"};
push #values, #$ref_value;
$hash{"$key"} = [#values];
}
else{
$hash{"$key"} = [#values];
}
}
print Dumper %hash;
__DATA__
affaire,chose,question
chose,emploi,fonction,service,travail,tâche
cause,chose,matière
chose,point,question,tête
chose,objet,élément
chose,machin,truc
Demo

Perl: How to extract sequences based on gene number and nucleotide length?

I have 2 files, as follows:
file1.txt:
0 117nt, >gene_73|GeneMark.hm... *
0 237nt, >gene_3097|GeneMark.... *
0 237nt, >gene_579|GeneMark.h... *
0 237nt, >gene_988|GeneMark.h... *
0 189nt, >gene_97|GeneMark.hm... *
0 183nt, >gene_97|GeneMark.hm... *
file2.fasta:
>gene_735|GeneMark.hmm|237_nt|+|798985|799221
TTGTGGTTCGTGCCGCGCGACGCGTTGCGTCTGCAAACGCCCGACGAAGACATCGCGACCTATCTGTTCAACAAGCATGTGATTCGGCATCGGTTCTGTCCGACCTGCGGGATTCATCCGTTCGCGGAAGGCACGGACCCGAAGGGCAACGCGATGGCGGCCGTCAATCTTCGCTGCGTCGACGGCGTCGATCTCGACGCGTTGAGCGTCCGCCATTTCGACGGGCGCGCGCTCTGA
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_1876|GeneMark.hmm|234_nt|-|2168498|2168731
ATGCTGTTCTTTTCGCGCGCGGGCGTGTCGCGTGCGGCCGGCGGCCAATCATGCGGCGAGTCGTTTTGTCGCGGCTCGCGGCGCTTGCCGACGTTGGAATCGCGCGCGCCGATGCGCGGATCGGGGCGGCAACGTTTGCGTATGAGGAATGATGCGTTTGCGCATCGGGAATGGGCGCCTCGCCCCGGTTTCGCCGCGATTCCGCCCGACTCGAGGCAGTCGTTTTTCCGCTAA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|105_nt|+|90122|90226
GTGACGCGTTTCGCGACGCGCGTCGATGGGGCGGGCGCGAAACCCGTTCGCCGCGATGCGGCGGACGGGGTATGGCCGAGCGCCGTCCGTCGCGGCGAGAGTTGA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
>gene_97|GeneMark.hmm|234_nt|+|105494|105727
ATGAAGATTCAAATCGCCATTGTTTATTTTGTCGCCCGTCACGCAAACGAGCAGGCGCGAAGCGGATCGGCGCGCATTGGCGAAGAGCCGGCGCGCATCGGCATCGCGCTCGCGCGACACATGCGCGCCGCGCGCGGCCGGTCGACGCCGGATTCGCCTGTCGATCGATCCGGTGCGCCCCGAGCCGATGAGCGGTACGCTTCGGCGCGCGCGCGACACGCGCGACACGCGTGA
>gene_979|GeneMark.hmm|225_nt|-|1115442|1115666
TTGATCGACGCGCGGGGCCGGCCGGGCCGCGGGGTATCGAAGGCGATCGACGCGCAACACGAATCGCCGCCGCGCGCCGAAACCTCGCTATGCGCGTCGCGCGCACGCGCGGCCGGCGGCGCACGCGCGGGTGTGCGCGGGCCGGCGGCGCGGCCGCTCGCACTGCGCGACCGCTCGCGCGCACGCCTTCCTCGGCACGCGCCGGGAATCCCGGCCCTTCAATGA
The output that I expect is:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCCAACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTCAACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACCAAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGGCGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGCTCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAAACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCGGTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTCCGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGGCTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTCATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGCGACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCACTAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCGAACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGCGATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGCCATCGATGA
There are 4 sequences with gene number 97, but all in different length. I want the sequence with the correct gene length only which listed in file1.txt to output in the output.fasta file. What I've done so far is as follows (but failed and have some errors):
#!/usr/bin/perl
use strict;
use warnings;
my #genes;
open my $list, '<file1.txt';
while (my $line = <$list>) {
push (#genes, $1) if $line =~/\>(.*?)\|/gs;
}
my $tag1 = "0\t";
my $tag2 = "nt";
while (my $line = <$list>) {
if ($line =~ /$tag1(.*?)$tag2/) {
my $match1 = $1;
}
}
my $input;
{
local $/ = undef;
open my $fasta, '<file2.fasta';
my $tag3 = "GeneMark.hmm";
my $tag4 = "_nt";
while (my $input = <$fasta>) {
if ($input =~ /$tag3(.*?)$tag4/) {
my $match2 = $1; }}
close $fasta;
}
my #lines = split(/>/,$input);
foreach my $l (#lines) {
if ($l =~ /(.+?)\|/) {
my $real_name = $1;
if ($real_name ~~ #genes) {
if ($match2 = $match1) {
open (OUTFILE, '>>output.fasta');
print OUTFILE ">$l"; }
}
}
}
Can anyone give me some guide to correct the code? Or is there any better way to do this? Any help will be very much appreciated! Thanks! :)

Here's an option that uses Bio::SeqIO:
use strict;
use warnings;
use Bio::SeqIO;
my %hash;
open my $fh, '<', $ARGV[0] or die $!;
while (<$fh>) {
push #{ $hash{$2} }, $1 if /\s+(\d+)nt,.+?>(gene_\d+)\|/;
}
close $fh;
my $in = Bio::SeqIO->new( -file => $ARGV[1], -format => 'Fasta' );
my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
$out->write_seq($seq)
if $seq->id =~ /(gene_\d+)\|.+?\|(\d+)_nt\|/ and grep /$2/, #{ $hash{$1} };
}
Usage: perl script.pl file1.txt file2.fasta [>outFile.fasta]
The second, optional parameter directs output to a file.
Output from your data:
>gene_579|GeneMark.hmm|237_nt|+|667187|667423
ATGTACCACGGCGCCGAATTTGCCGCTGCCAAGGGCATGCGCTGGCTGCGAGATGCCGCC
AACGGCTCTGCCTTCATCGCACCGGGCAGTCCGTGGCAAAACGGTTTCGTCGAGCGTTTC
AACGGCAAGCTGCATGACGAATTGCTGAACCGGGAATGGTTCCGCGGCCGTGCCGAGACC
AAGATGCTCATCGAACGCTCCGGCTACGGTCCGTCGAGTCTGACCGGATTCCGATGA
>gene_3097|GeneMark.hmm|237_nt|-|3467022|3467258
GTGTCGAACGAACGTCGCGGCGAACGGCCGCTGCGGGCATCGCCGCAGGACGTCACACGG
CGAACGTCGCGCGCGATCCTCGGCGGCCGCGAACGTGGGCCGTCCCGTGGCACGTTCGGC
TCGCTCGGCATGGCGAACGACCGCCGCATCGCGCATCGCCGTCGCGCGGCCTCCAAAAAA
ACGGCGGTCAGCGACCGCCGGCTTTGGCCGAAACCGATGCGTCGTACGAATCAGTGA
>gene_988|GeneMark.hmm|237_nt|+|1121027|1121263
ATGACCTTGTCAGGCAACATCAAGGACGGCGACTGGACGGTCGAGGTGACGACATCGCCG
GTGCAGGGCGGTTACGTGTGCGACATCGAGGTGATGCACGGCGCGCCGGGCGGCGCGTTC
CGGCACGCGTTCCGGCACGGCGGCACTTATCCGGCCGAGCGCGACGCGATGATCGAGGGG
CTGCGCGCGGGCATGACCTGGATCGAGCTGAAGATGTCGAAAGCATTCAATCTGTAA
>gene_97|GeneMark.hmm|183_nt|-|107002|107184
ATGGAGGCAATCGTGATCGAGCAAGTGATACTGGGCGTCTTTCTCGTACTGCCGCTTCTC
ATCGTCGCGGTGCTGTACTCCGACGAACTCTGGCAAGAACACCGCCTGCAGCATCCGCGC
GACGAGCACACGCCACATATCGACTGGCGTCATCCGTGGCGGATCCTGCGGCGAGGGCAC
TAA
>gene_97|GeneMark.hmm|189_nt|-|98624|98812
GTGAAATACACGAGCGACCATTACGCGGGCGTCAAATTTGGCGCGCTGTACGGGTTCTCG
AACGCGGCGAACTTCGCCGACAACCGCGCTCGCCGGCGCATGCGCGGCGTTCGCATACGC
GATCGGCAAAAGCGGCGTGATGTGCGGTTGCCTGCCGCGCTCGCGCTATGCGCGGCACGC
CATCGATGA
Bio::SeqIO lives to parse fasta (and other such) files, so the above leverages this capability. After creating a hash of arrays (HoA) from file1.txt, the fasta file is processed, and only matching fasta records are printed.
Hope this helps!

Perl Regular Expression to insert/substitute in a string at specific places

Given a url the following regular expression is able insert/substitute in words at certain points in the urls.
Code:
#!/usr/bin/perl
use strict;
use warnings;
#use diagnostics;
my #insert_words = qw/HELLO GOODBYE/;
my $word = 0;
my $match;
while (<DATA>) {
chomp;
foreach my $word (#insert_words)
{
my $repeat = 1;
while ((my $match=$_) =~ s|(?<![/])(?:[/](?![/])[^/]*){$repeat}[^/]*\K|$word|)
{
print "$match\n";
$repeat++;
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/
The output given (for the first example url in __DATA__ with the HELLO word):
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
Where I am now stuck:
I would now like to alter the regular expression so that the output will look like what is shown below:
http://www.stackoverflow.com/dogHELLO/cat/rabbit/
http://www.stackoverflow.com/dog/catHELLO/rabbit/
http://www.stackoverflow.com/dog/cat/rabbitHELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
#above is what it already does at the moment
#below is what i also want it to be able to do as well
http://www.stackoverflow.com/HELLOdog/cat/rabbit/ #<-puts the word at the start of the string
http://www.stackoverflow.com/dog/HELLOcat/rabbit/
http://www.stackoverflow.com/dog/cat/HELLOrabbit/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
http://www.stackoverflow.com/HELLO/cat/rabbit/ #<- now also replaces the string with the word
http://www.stackoverflow.com/dog/HELLO/rabbit/
http://www.stackoverflow.com/dog/cat/HELLO/
http://www.stackoverflow.com/dog/cat/rabbit/HELLO
But I am having trouble getting it to automatically do this within the one regular expression.
Any help with this matter would be highly appreciated, many thanks

One solution:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for (#insert_words) {
# Use package vars to communicate with /(?{})/ blocks.
local our $insert_word = $_;
local our #paths;
$path =~ m{
^(.*/)([^/]*)((?:/.*)?)\z
(?{
push #paths, "$1$insert_word$2$3";
if (length($2)) {
push #paths, "$1$insert_word$3";
push #paths, "$1$2$insert_word$3";
}
})
(?!)
}x;
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/

Without crazy regexes:
use strict;
use warnings;
use URI qw( );
my #insert_words = qw( HELLO );
while (<DATA>) {
chomp;
my $url = URI->new($_);
my $path = $url->path();
for my $insert_word (#insert_words) {
my #parts = $path =~ m{/([^/]*)}g;
my #paths;
for my $part_idx (0..$#parts) {
my $orig_part = $parts[$part_idx];
local $parts[$part_idx];
{
$parts[$part_idx] = $insert_word . $orig_part;
push #paths, join '', map "/$_", #parts;
}
if (length($orig_part)) {
{
$parts[$part_idx] = $insert_word;
push #paths, join '', map "/$_", #parts;
}
{
$parts[$part_idx] = $orig_part . $insert_word;
push #paths, join '', map "/$_", #parts;
}
}
}
for (#paths) {
$url->path($_);
print "$url\n";
}
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
http://10.15.16.17/dog/cat/rabbit/

one more solution:
#!/usr/bin/perl
use strict;
use warnings;
my #insert_words = qw/HELLO GOODBYE/;
while (<DATA>) {
chomp;
/(?<![\/])(?:[\/](?![\/])[^\/]*)/p;
my $begin_part = ${^PREMATCH};
my $tail = ${^MATCH} . ${^POSTMATCH};
my #tail_chunks = split /\//, $tail;
foreach my $word (#insert_words) {
for my $index (1..$#tail_chunks) {
my #new_tail = #tail_chunks;
$new_tail[$index] = $word . $tail_chunks[$index];
my $str = $begin_part . join "/", #new_tail;
print $str, "\n";
$new_tail[$index] = $tail_chunks[$index] . $word;
$str = $begin_part . join "/", #new_tail;
print $str, "\n";
}
print "\n";
}
}
__DATA__
http://www.stackoverflow.com/dog/cat/rabbit/
http://www.superuser.co.uk/dog/cat/rabbit/hamster/
10.15.16.17/dog/cat/rabbit/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing plain text database into data structures - regex

Related

To replace a number with strings in an XML file using a Perl script

How to read data in file which the file inside a file in perl?

Build a hash by taking input from file, append the values if key is not unique

Perl: How to extract sequences based on gene number and nucleotide length?

Perl Regular Expression to insert/substitute in a string at specific places

Categories

Resources