I've two text files. I want to take text from first one between </sup><sup> tags, and insert it to another text file between {}.
Better example (sth like a dictionary)
Text1:
<sup>1</sup>dog
<sup>2</sup>cat
<sup>3</sup>lion
<sup>1</sup>flower
<sup>2</sup>tree
.
.
Text2:
\chapter1
\pkt{1}{}{labrador retirever is..}
\pkt{2}{}{home pets..}
\pkt{3}{}{wild cats..}
\chapter2
\pkt{1}{}{red rose}
\pkt{2}{}{lemon tree}
.
.
What I want:
Text3:
\chapter1
\pkt{1}{dog}{labrador retirever is..}
\pkt{2}{cat}{home pets..}
\pkt{3}{lion}{wild cats..}
\chapter2
\pkt{1}{flower}{red rose}
\pkt{2}{tree}{lemon tree}
Text is random, but You can see what I want. Perl would be best.
So get
</sup>**text**<sup>
and paste it to
\pkt{nr}{**here**}{this is translation of this word already stored in text2}.
Text A and B are in order, so if I could read first </sup>text<sup> from Text A, save it in temp, delete this line from Text A, put it on first free {} slot in text B, and start over again it would be great. Numbers will match because order is saved.
Sorry for my English:)
Thanks!
This code puts all dict items in an array, in the order they appear. The tex file is then looped and each time \pkt{num}{} is hit an item from the array is inserted.
Newlines in dict are handled and replaced with spaces (Just remove this replace in the map if you don't want this behavior). \pkt should be found as long as the part \pkt{num}{} is not spanning multiple lines. Otherwise I think the easiest solution would be to undef $/ (the input record separator) and read the whole file into a string and just loop the replacement (could be a bit memory hungry though).
#!/usr/bin/perl -wT
use strict;
my $dict_filename = 'text1';
my $tex_filename = 'text2';
my $out_filename = 'text3';
open(DICT, $dict_filename);
my #dict;
{
# Set newline separator to <sup>
local $/ = '<sup>';
# Throw away first "line", it will be empty
<DICT>;
# Extract string and throw away newlines
#dict = map { $_ =~ m#</sup>\s*(.*?)\s*(?:<sup>|$)#s; $_ = $1; $_ =~ s/\n/ /g; $_; } <DICT>;
}
close(DICT);
open(TEX, $tex_filename);
open(OUT, ">$out_filename");
my $tex_line;
my $dict_pos = 0;
while($tex_line = <TEX>)
{
# Replace any \pkt{num}{} with \pkt{num}{text}
$tex_line =~ s|(\\pkt\{\d+\}\{)(\})|$1$dict[$dict_pos++]$2|g;
print OUT $tex_line;
}
close(TEX);
close(OUT);
Related
I assume some sort of regex would be used to accomplish this?
I need to get it where each word consists of 2 or more characters, start with a letter, and the remaining characters consist of letters, digits, and underscores.
This is the code I currently have, although it isn't very close to my desired output:
while (my $line=<>) {
# remove leading and trailing whitespace
$line =~ s/^\s+|\s+$//g;
$line = lc $line;
#array = split / /, $line;
foreach my $a (#array){
$a =~ s/[\$##~!&*()\[\];.,:?^ `\\\/]+//g;
push(#list, "$a");
}
}
A sample input would be:
#!/usr/bin/perl -w
use strict;
# This line will print a hello world line.
print "Hello world!\n";
exit 0;
And the desired output would be (alphabetical order):
bin
exit
hello
hello
line
perl
print
print
strict
this
use
usr
will
world
my #matches = $string =~ /\b([a-z][a-z0-9_]+)/ig;
If case-insensitive operation need be applied only to a subpattern, can embed it
/... \b((?i)[a-z][a-z0-9_]+) .../
(or, it can be turned off after the subpattern, (?i)pattern(?-i))
That [a-zA-Z0-9_] goes as \w, a "word character", if that's indeed exactly what is needed.
The above regex picks words as required without a need to first split the line on space, done in the shown program. Can apply it on the whole line (or on the whole text for that matter), perhaps after the shown stripping of the various special characters.†
There is a question of some other cases -- how about hyphens? Apostrophes? Tilde? Those aren't found in identifiers, while this appears to be intended to process programming text, but comments are included; what other legitimate characters may there be?
Note on split-ing on whitespace
The shown split / /, $line splits on exactly that one space. Better is split /\s+/, $line -- or, better yet is to use split's special pattern split ' ', $line: split on any number of any consecutive whitespace, and where leading and trailing spaces are discarded.
† The shown example is correctly processed as desired by the given regex alone
use strict;
use warnings;
use feature 'say';
use Path::Tiny qw(path); # convenience, to slurp the file
my $fn = shift // die "Usage: $0 filename\n";
my #matches = sort map { lc }
path($fn)->slurp =~ /\b([a-z][a-z0-9_]+)/ig;
say for #matches;
I threw in sorting and lower-casing to match the sample code in the question but all processing is done with the shown regex on the file's content in a string.
Output is as desired (except that line and world here come twice, what is correct).
Note that lc can be applied on the string with the file content, which is then processed with the regex, what is more efficient. While this is in principle not the same in this case it may be
perl -MPath::Tiny -wE'$f = shift // die "Need filename\n";
#m = sort lc(path($f)->slurp) =~ /\b([a-z]\w+)/ig;
say for #m'
Here I actually used \w. Adjust to the actual character to match, if different.
Curiously, this can be done with one of those long, typical Perl one-liners
$ perl -lwe'print for sort grep /^\pL/ && length > 1, map { split /\W+/ } map lc, <>' a.txt
bin
exit
hello
hello
line
line
perl
print
print
strict
this
use
usr
will
world
world
Lets go through that and see what we can learn. This line reads from right to left.
a.txt is the argument file to read
<> is the diamond operator, reading the lines from the file. Since this is list context, it will exhaust the file handle and return all the lines.
map lc, short for map { lc($_) } will apply the lc function on all the lines and return the result.
map { split /\W+/ } is a multi-purpose operation. It will remove the unwanted characters (the non-word characters), and also split the line there, and return a list of all those words.
grep /^\pL/ && length > 1 sorts out strings that begin with a letter \pL and are longer than 1 and returns them.
sort sorts alphabetically the list coming in from the right and returns it left
for is a for-loop, applied to the incoming list, in the post-fix style.
print is short for print $_, and it will print once for each list item in the for loop.
The -l switch in the perl command will "fix" line endings for us (remove them from input, add them in output). This will make the print pretty at the end.
I won't say this will produce a perfect result, but you should be able to pick up some techniques to finish your own program.
I have a perl script that reads text file line by line and splits the line into 4 different columns (shown by dashes & referred to as $cols[0-3] in code; important parts are bolded). For each distinct value before the decimal point in column 0, it should randomly generate a hex color.
Essentially, I need to compare if the Xth column in the current line matches that of the previous line.
A----last_column----221----18
A----last_column----221----76
A----last_column----221----42
B----last_column----335----18
C----last_column----467----83
So far, I am randomly generating a new #random_hex_color for every line, but desired output is below:
221.18-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.76-------#EB23AE1-------#$some/random/path/A.txt-------last_column
221.42-------#EB23AE1-------#$some/random/path/A.txt-------last_column
335.18-------#AC16D6E-------#$some/random/path/B.txt-------last_column
467.83-------#FD89A1C-------#$some/random/path/C.txt-------last_column
[Image of input file and desired output][1]
my #cols;
my $row;
my $color = color_gen();
my $path = "\t#\some_random_path/";
my $newvar = dir_contents();
my #array = ($color, $path, $newvar);
my %hash;
while ($row = <$fh>){
next if $row =~ /^(#|\s|\t)/; #skip lines beginning with comments and spaces
#cols = split(" ", $row);
%hash = (
"$cols[2]" => ["$color", "$path", "$newvar"]
);
say Dumper (\%hash);
print("$cols[2].$cols[3]\t#");
print(color_gen());
printf("%-65s", $path.dir_contents());
print("\t\t$cols[0]_"."$cols[1]"." 1 1\n");
}
Use a hash to store, and thus be able to check for, the distinct values in the first column.
I assume that color_gen() returns a new random color at each invocation. The desired output is unclear to me so it is only indicated in the code.
use warnings;
use strict;
my $file = shift #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
open my $fh, '<', $file or die "Can't open $file: $!";
my %c0;
while (<$fh>) {
next if /^(?:\s*$|\s*#)/; # skip: spaces only or empty, comment
my #cols = split;
my ($num) = $cols[0] =~ /^([0-9]+)/;
if (not exists $c0{$num}) { # this number not seen yet; assign color
$c0{$num} = color_gen();
}
# write line of output, with $c0{$num} and #cols
}
The value "before the decimal point in column 0" is extracted using regex as the leading number in that string and stored in $num. The parens around are needed to provide the list context for the match operator, in which case it returns the captured values. See perlretut.
This number is stored as a key in a hash with its value being the generated color. Unless it already exists that is, in which case it has been seen and a color for it generated. This way you can keep track of distinct numbers in that column. Then you can write output using $c0{$num}.
This can be written far more compactly but I hoped for clarity.
The skipped lines here aren't those "beginning with comments and spaces" but are ones with only spaces (or empty), or comments. If you really mean to skip lines that merely start with whitespace (or #) then indeed use /^(?:\s|#)/, where ?: makes () only group and not capture.
A few comments on the code
Always have use warnings; and use strict; at the beginning of each program
The \s in regex matches most types of whitespace; no need for a separate pattern for tab
A variable can be declared right in the while condition, which makes it perfectly scoped -- to that loop. However, you can also omit it and use $_
If while condition has only the input read, such as <$fh>, then the value is assigned to $_ variable; also see I/O in perlop.
I use that here since then the regex is simpler (match on $_ by default) and so is split
The split without arguments has default of split ' ', $_;, where ' ' stands for any amount of any whitespace (and leading spaces are removed before splitting)
Please provide exact samples of input and desired output for a more complete example.
I have been working on this for a little while now and can't seem to figure it out. I have a file containing a bunch of lines all structured like the one below meaning each line starts with "!" and has three separators "<DIV>".
!the<DIV>car<DIV>drove down the<DIV>road off into the distance
I am interested in retrieving the last string "road off into the distance" I can't seem to get it to work. Below I have listed the current code I have.
while($line = <INFILE>) {
$line =~ /<SEP>{3}(.*)/;
print $1;
}
Any help would be greatly appreciated!
The statement
#b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/
will split the string into a list, and you can then extract the 4th element with
$b[3]
If you really want only the last one, do this instead:
($text) = $a =~ /^!.*<DIV>(.*)/
I don't know whether you insist on regex or simply didn't think of else, but split will nicely do this
$text = (split '<DIV>', $str)[-1];
If you regularly have such repeating patterns split may well be better for the job than a pure regex. (Split also uses full regular expressions in its pattern, of course.)
ADDED
All this can be done directly, if you simply only need to pull the last thing from each line:
open my $fh, '<', $file;
my #text = map { (split '<DIV>')[-1] } <$fh>;
close $fh;
print "$_\n" for #text;
The split by default uses $_, which inside the map is the current element processed. For lines without a <DIV> this returns the whole line. A file handle in the list context serves all lines as a list; the list context is imposed by map here.
In case you want all text between delimiters you can do
my #rlines = map { [ split '<DIV>' ] } <$fh>;
where [ ] takes a reference to the list returned by split and thus #rlines contains references to arrays, each with text in between <DIV>s on a line. The leading ! is there though and to drop it a little more processing is needed.
Of course, for the map block you can use { (/.*<DIV>(.*)/)[0] } from Jim Garrison's answer for a single match, or modify the regex a little to catch'em all.
If performance is a factor then that's a little different question.
A simple substitution could work too:
while(<DATA>){
chomp;
my $text = (s/.*<DIV>//g, $_);
say $text;
}
Simple regex which answers your question:
my $match= '';
while($line = <INFILE>) {
($match) = $line =~/.*<DIV>(.*)/;
}
print $match, "\n";
I am a first year grad student who's relatively new in computational biology. I recently started using Perl and it's not the easiest language to learn, at least not for me.
I need help applying my idea/logic the right way to figure out the solution to my problem.
I have a dna string and I want to split it at specific sites to get multiple fragments using information from an enzyme file that contains lines of recognition sites. Once the fragments are obtained, I want to output the list of dna fragments in an output file. I want to create an output file for every line in the enzyme file I am going to extract the information from, to apply it to the dna string.
Here's what I mean exactly:
Hypothetical scenario:
Enzyme.File contains:
abc/at'gtct// (abc is the name of the enzyme. (atgtct) is the recognition site.)
def/cgg'ataaa// ........
Suppose the dna string is: $dna = "accggttatgtctaaacggataaagtctcggataaattt" (recognition sites are bolded)
For line 1
When I extract the info from the first line/enzyme(abc) from the enzyme file and apply it to this string, the output should be:
accggttat
gtctaaacggataaagtctcggataaattt
(split between cgg'ataaa) the apostrophe represents the cut point
(note: Even though there is another gtct in the string, it does not split it because at ought to precede it.)
For line 2
$dna = accggttatgtctaaacggataaagtctcggataaattt (Info is applied to same dna string)
Info from line/enzyme 2 (def) would split the dna as follow:
accggttatgtctaaacgg (split between cgg'ataaa)
ataaagtctcgg
ataaattt
I want to put each output from the different lines in separate file with distinct names. (I can take care of assigning the names)
So in conclusion, this example would create two new files, one name "abc_whatever" and "def_whatever". Important: If the enzyme file had 8 lines with different enzymes, I would get 8 new output files with their distinct dna fragments."
Here's what I've tried so far:
#!/usr/bin/perl;
use warnings;
use strict;
open(ENZ,$ARGV[0]) || die; # ENZ(file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>) {
if ( match pattern etc..) { # I took care of that and created captured groups of
$1 = holds "abc" # the info I needed from the line e.g. I captured
$2 = ..."at" # (abc)/(at)'(gtct)//, so they are stored in $1,$2,$3
$3 = ..."gtct" # respectively
}
while (<$dna>){
my #fragments_array = split(/$3/, $dna);
open (OutFile, ">$dna"."_"."$1")
print OutFile shift #fragments_array,"\n";
foreach (#fragments_array) {
print OutFile "$3$_\n";
close OutFile;
}
}
}
close ENZ;
FIRST
I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
SECOND
I am not properly cutting the dna. From other examples I have seen online, it looks like I am gonna have to use the following functions to properly apply the enzyme information on the dna. The functions include:
the for loop, length and substr(),
If you can, please demonstrate your work in the simplest form (no extravagant, impressing codes lol :-) since I am just learning this language)
Thanks in advance!
FIRST I can only create an output only for the 1st line in the Enzyme file. I want to create and output file for all the lines.
That's simply because you put close OutFile; into the foreach (#fragments_array) loop, instead of placing the close after the loop body.
SECOND I am not properly cutting the dna.
That's because you forgot to include $2, the head of the recognition site (e. g. the at of atgtct) in the split pattern as well as in the output.
The problem is solved easier if we just insert the splitting new-line character everywhere between the head and the tail:
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]) || die; # ENZ (file handle for enzyme file)
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
while (<ENZ>)
{
if (m-(.*)/(.*)'(.*)//-)
{
my ($head, $tail) = ($2, $3); # $2$3 is the recognition site; save it
open(OutFile, ">${dna}_$1");
(my $fragments = $dna) =~ s/$head$tail/$head\n$tail/g; # insert NLs
print OutFile $fragments, "\n";
close OutFile;
}
}
close ENZ;
I changed your code a bit, hope it works now
#!/usr/bin/perl
use warnings;
use strict;
open(ENZ, $ARGV[0]);
my $dna = "accggttatgtctaaacggataaagtctcggataaattt";
my ($enzyme, $first, $second) = ("", "", "");
for my $line (<ENZ>) {
chomp($line); # remove \n at the end of string
my #elements = split(/\/|'/, $line); # split string into tokens (e.g. abc/at'gtct => array(abc, at, gtct))
$elements[2] = substr($elements[2], 0, -2); # remove the last "//"
my ($firstPart, $secondPart) = ($elements[1], $elements[2]);
if ($dna =~ /(.*)$firstPart$secondPart(.*)/) {
$first = $1 . $firstPart;
$second = $2 . $secondPart;
$enzyme = $elements[0];
open(OUTPUT, ">$enzyme" . "_something");
print OUTPUT "$first\n$second\n";
close(OUTPUT);
}
}
close ENZ;
EDIT: this is the working version. I suggest you learn how to use Regular Expression if you want to use Perl for your study. It is the strongest tool in Perl.
I have a giant text data file (~100MB) that is a concatenation of a bunch of data files with various header information then some columns of data. Here's the problem. I want to extract a particular number from the header info before each of these data sets and then append that to another column in the data (and write out that data to a different file).
The header info that I want is of the format ex: BGA 1
Where what I want for that extra data column is the # after word BGA. It will be a number between 1 and maybe 20000. I can write the regex to pull the word BGA, but I don't seem to be able to figure out how to just get the digit after it.
To add EXTRA fun, that text "BGA 1" is repeated in each data section TWICE.
Here's what I have so far, which actually doesn't work... I want it to at least print "BGA" everytime it encounters the word BGA, but it prints nothing.... Any help would be appreciated.
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'alldata.txt';
open my $info, $file or die "Could not open $file: $!";
$_="";
while(my $line = <$info>){
if ($line eq "/BGA/"){
print <>,"\n";
}
}
close $file;
if ($line =~ /BGA\s(\d+)/){
#your code
print "BGA number $1 \n";
#your code
}
And $1 variable will have the number you want
If there is more than one BGA per line, you'll need to allow the regex to match more than once per line:
while (my $line = <$info>) {
while ( $line =~ /BGA\s(\d+)/g ) {
print "$1\n";
}
}
This should print out all the BGA numbers as a single column. Without any further information it's hard to answer this any better.
First, a 100 MB file is not giant. Don't be so defeatist. You could even slurp it into memory:
Let's look at the few critical places in your code:
while(my $line = <$info>) {
if ($line eq "/BGA/") {
Your condition $line eq "/BGA/" tests if the line literally consists of the string "/BGA/". But, that can never be true for the line with at least have the input record separator, i.e. the contents of $/ at the end because you did not chomp it. In any case, what you want is to match lines that contain "BGA" anywhere and the proper Perl syntax to do that is
if ($line =~ /BGA/) {
Now, once you fix that, you are going to run into a problem with the following statement:
print <>,"\n";
What you really want is print $line;. The diamond operator, <>, in list context is going to try to slurp from STDIN or any files specified as arguments on the command line. Not a good idea.
Others have pointed out how to match the string "BGA" followed by a digit. For better answers, you are going to need to show examples of input and expected output.