Create Matrix from redundant list

Create Matrix from redundant list - list

I have an Input with a redundant list which looks like this:
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.16 Dirt
Sample1.14 Dirt
Sample2.45 Air
Sample1.16 Air
I created a hash which counts how often each sample gives the result Water,Air,Dirt (note this is just example data but the structure is identical).
use warnings;
use strict;
my $inPut = "ExampleSample";
open(READ,$inPut) || die "Coult not read $inPut: $!";
my %sampleHash;
while (<READ>) {
chomp;
my #temp = split("\t",$_);
my $sample = $temp[0];
my $type = $temp[1];
$sampleHash{$type}{$sample} += 1;
}
This works as intended and gives as output:
$VAR1 = {
'Dirt' => {
'Sample1.16' => 4,
'Sample1.14' => 1
},
'Air' => {
'Sample1.16' => 1,
'Sample2.45' => 4
},
'Water' => {
'Sample1.14' => 3
}
};
Since this is quiet a bad data structure for further downstream stuff I would like to put this data into a matrix which I am somewhat lost at.
Desired Output or the transposed of this example, does not really matter:
Sample1.14 Sample2.45 Sample1.16
Air 0 4 1
Dirt 4 0 4
Water 3 0 0
I am really stuck here, any help would be very much appreciated! Thanks.

You can "munge" your hash of hashes into an array of arrays and then input that into Acme::Tools::pivot() or Data::Pivot::pivot(). Like this:
use Acme::Tools;
my $data={
'Dirt' => {
'Sample1.16' => 4,
'Sample1.14' => 1
},
'Air' => {
'Sample1.16' => 1,
'Sample2.45' => 4
},
'Water' => {
'Sample1.14' => 3
}
};
my #sample=uniq(sort map keys(%$_), values %$data);
my #element=sort keys %$data;
my $data2=[ map { my $x=$_; map [$x,$_,$$data{$x}{$_}||' 0'], #sample } #element ];
print tablestring([Acme::Tools::pivot($data2,"Element")]);
Output:
Element Sample1.14 Sample1.16 Sample2.45
------- ---------- ---------- ----------
Air 0 1 4
Dirt 1 4 0
Water 3 0 0

The easiest way to create a unique list in Perl is to use the elements as hash keys with dummy values. After filling the hash you can get the unique list of values with keys.
my %samples;
$samples{"some value"} = 1;
$samples{"some other value"} = 1;
$samples{"some value"} = 1;
my #samples = sort keys %samples;
If you want to make Perl behaving like awk, you can use the split function with a single space argument. And if you want to assign the result of a split to two variables you can use Perl's list notation.
my ($a, $b) = split ' ';
The complicated part is to build the table. This can be done with for loops or by map. The use of for loops might be easier to read, but maps allow a more compact notation.
The following creates an array reference (square brackets) and fills the array with the return list of the map expression, prefixed by the $t value. The map expression takes a pice of code and a list and executes the code for each element of the list. The value of the current list element is available in the variable $_.
[ $t, map { $sampleHash{$t}{$_} or '0' } #samples ]
If you nest map expressions, you have to give the outer $_ a name to access it from the inner map, because the inner $_ shadows the outer.
A basic way to format tables in Perl is to use Perl's report feature perlform. For that you have to define a list of alternating lines: first a pattern line and then a value line.
If you put all together your example becomes this
#! /usr/bin/perl
use strict;
use warnings;
my %sampleHash;
my %samples;
my %types;
while (<DATA>)
{
chomp;
my ($sample, $type) = split ' ';
$sampleHash{$type}{$sample} += 1;
$samples{$sample} = 1;
$types{$type} = 1;
}
my #samples = sort keys %samples;
my #types = sort keys %types;
my #table =
(['', #samples],
map { my $t=$_; [ $t, map { $sampleHash{$t}{$_} or '0' } #samples ] } #types );
my $row;
format =
#<<<<<< #|||||||||| #|||||||||| #||||||||||
#$row
.
for $row (#table) { write; }
__DATA__
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.14 Water
Sample2.45 Air
Sample1.16 Dirt
Sample1.16 Dirt
Sample1.14 Dirt
Sample2.45 Air
Sample1.16 Air
which outputs this
Sample1.14 Sample1.16 Sample2.45
Air 0 1 4
Dirt 1 4 0
Water 3 0 0
Note: Your desired output does not match your input.
In order to read a file you have to keep your code using open. I have used the __DATA__ section only to simplify the example in order to get a MCVE.

Related

Substitute the markdown italic to html using regex in Perl

To convert the markdown italic text $script into html, I've written this:
my $script = "*so what*";
my $res =~ s/\*(.)\*/$1/g;
print "<em>$1</em>\n";
The expected result is:
<em>so what</em>
but it gives:
<em></em>
How to make it give the expected result?

Problems:
You print the wrong variable.
You switch variable names halfway through.
. won't match more than one character.
You always add one EM element, even if no stars are found.
You always add one EM element, even if multiple pairs of stars are found.
You add the EM element around the entire output, not just the portion in stars.
Fix:
$script =~ s{\*([^*]+)\*}{<em>$1</em>}g;
print "$script\n";
or
my $res = $script =~ s{\*([^*]+)\*}{<em>$1</em>}gr;
print "$res\n";
But that's not it. Even with all the aforementioned problems fixed, your parser still has numerous other bugs. For example, it misapplies italics for all of the following:
**Important**Correct: ImportantYour code: *Important*
4 * 5 * 6 = 120Correct: 4 * 5 * 6 = 120Your code: 4 5 6 = 120
4 * 6 = 20 is *wrong*Correct: 4 * 6 = 20 is wrongYour code: 4 6 = 20 is wrong*
`foo *bar* baz`Correct: foo *bar* bazYour code: `foo bar baz`
\*I like stars\*Correct: *I like stars*Your code: \I like stars\

Perl list interpolation performance

Background
Perldoc for List::Util suggests that some uses of map may be replaced by reduce in order to avoid creating unnecessary intermadiate list:
For example, to find the total length of the all the strings in a
list, we could use
$total = sum map { length } #strings;
However, this produces a list of temporary integer values as long as
the original list of strings, only to reduce it down to a single value
again. We can compute the same result more efficiently by using reduce
with a code block that accumulates lengths by writing this instead as:
$total = reduce { $a + length $b } 0, #strings;
That makes sense. However, reduce in order to work in this example needs "identity value", that would be prepended to input list:
$total = reduce { $a + length $b } 0, #strings;
# ^^^^^^^^^^^
That makes me think, doesn't 0, #strings create a new list, thus offset any gains from not creaing list in map?
Question
How does list interpolation ($scalar, #list) work in Perl? Does it involve copying elements from source list or is it done in some smarter way? My simple benchmark suggests copying taking place:
use strict;
use warnings;
use Benchmark qw/cmpthese/;
my #a1 = 1..10;
my #a2 = 1..100;
my #a3 = 1..1000;
my #a4 = 1..10000;
my #a5 = 1..100000;
my #a6 = 1..1000000;
cmpthese(10000, {
'a1' => sub { my #l = (0, #a1); },
'a2' => sub { my #l = (0, #a2); },
'a3' => sub { my #l = (0, #a3); },
'a4' => sub { my #l = (0, #a4); },
'a5' => sub { my #l = (0, #a5); },
'a6' => sub { my #l = (0, #a6); },
});
Results:
(warning: too few iterations for a reliable count)
Rate a6 a5 a4 a3 a2 a1
a6 17.6/s -- -90% -99% -100% -100% -100%
a5 185/s 952% -- -90% -99% -100% -100%
a4 1855/s 10438% 902% -- -90% -99% -100%
a3 17857/s 101332% 9545% 862% -- -91% -98%
a2 200000/s 1135940% 107920% 10680% 1020% -- -80%
a1 1000000/s 5680100% 540000% 53800% 5500% 400% --
Bonus question: If my assumptions are correct (i.e. 0, #strings creates a new list), does replacing map with reduce make sense?

doesn't 0, #strings create a new list
Not really. If you decompile the code, it's just one additional SVOP.
But you're measuring the wrong thing. The values are flattened and passed into the map or reduce subroutine in both cases!
The documentation is talking about what happens inside the subroutine. map creates a list of as many input values and returns them, and then sum takes the list and condenses it into a value. The return list is ephemeral and is not represented directly in the code. (This list passing is not that efficient, it could be made faster by using references.)
In contrast, in reduce, there no such return list. reduce only works on the input list of values and returns a single value.

"This produces a list of temporary integer values as long as the original list of strings" refers to map putting N scalars on the stack. The thing is, the reduce approach creates just as many scalars, and they also all go on the stack. The only difference is that the reduce approach only keeps one on them on the stack at once. That means the reduce approach uses less memory, but it doesn't speak to its performance at all. The reason it gives for reduce computing the same result more efficiently is nonsense.
There could be a performance difference, but not for that reason. If you want to find which one performs better for you, will need to run a benchmark.
That makes me think, doesn't 0, #strings create a new list
No. reduce creates a single list unconditonally. This is unrelated to the number expressions in the argument list.
Lists aren't arrays. When we say "the sub returns a list" or "the op evaluates to a list", we actually mean "the sub or op places some quantity of scalars on the stack".
List are created for ops that will pop a variable number of scalars from the stack. This is done by simply pushing a mark onto the stack. For example, reduce { ... } 0, #a would create a list for the entersub op. { ... } will end up leaving one code ref on the list/stack, 0 will end up leaving a number on the list/stack, and #strings will end up leaving its elements on the list/stack. One last thing is added to the list/stack before the sub is called: the glob *reduce.
Note that creating the list is effectively free, since it's simply pushing a mark on the stack. Placing an array on the stack is proportional to the number of its elements, but it's still quite cheap since we're only copying a block of pointers (in the C sense of the word).
That means there's effectively no performance difference between reduce { ... } #strings and reduce { ... } 0, #strings. Both create a single list, and both add roughly the same number of elements to the list/stack.
Exceptions:
for (#a) is optimized to be for* (\#a).This saves memory, and it saves time if the loop is exited prematurely.
sub f(\#); f(#a) is equivalent to &f(\#a).
AFAIK, map and grep aren't optimized in this manner.
In detail:
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } #a'
...
3 <0> pushmark s <-- Creates list (adds mark to the stack).
4 <$> anoncode[CV ] sRM <-- Adds CV to the stack.
5 <1> srefgen sKM/1 <-- Replaces CV with a ref to the CV.
6 <#> gv[*a] s <-- Places *a on the stack.
7 <1> rv2av[t4] lKM/1 <-- Replaces *a with the contents of #a.
8 <#> gv[*reduce] s <-- Places *reduce on the stack.
9 <1> entersub[t5] vKS/TARG <-- Will remove the entire list from the stack.
...
$ perl -MO=Concise,-exec -MList::Util=reduce -e'reduce { ... } 0, #a'
...
3 <0> pushmark s
4 <$> anoncode[CV ] sRM
5 <1> srefgen sKM/1
6 <$> const[IV 0] sM <-- The only difference.
7 <#> gv[*a] s
8 <1> rv2av[t4] lKM/1
9 <#> gv[*reduce] s
a <1> entersub[t5] vKS/TARG
...

The direct question can be answered directly by a benchmark
use strict;
use warnings;
use List::Util qw(sum reduce);
use Benchmark qw(cmpthese);
my #ary = 1..10_000;
sub by_reduce { my $res = reduce { $a + length $b } 0, #ary }
sub by_map { my $res = sum map { length } #ary }
cmpthese(-3, {
reduce => sub { by_reduce },
map => sub { by_map },
});
which prints on my v5.16 at hand
Rate map reduce
map 780/s -- -41%
reduce 1312/s 68% --
Thus reduce does something significantly better for this task.
As for the question of lists in general, it would have to depend on how the full list is used.
In your benchmark there is an assignment to a new array so the data copy clearly must be done. Then longer arrays take longer, and by about an order of magnitude quite like the ratio of their sizes.
With list inputs for functions like map and reduce I don't see a reason for an additional data copy. This can be checked by a benchmark, comparing an identical operation
my #ary = 1..10_000;
# benchmark:
my $r1 = sum map { length } #ary;
my $r2 = sum map { length } (1..5000, 5001..10_000);
The reported rates are nearly identical, for example780/s and 782/s, showing that the flattening of the ranges for map input doesn't involve a data copy. (The ranges are converted to arrays at compile time, thanks to ikegami for comments.)

parsing multiline nested tokens from a file in perl

I have a file that looks line this:
Alpha 27600
Beta 1
Charlie true
BEGIN Delta
BEGIN Epsilon Setting High Hook 50 END
BEGIN Foxtrot Corp 71 END
BEGIN "Jelly Bean" Corp 88 END
END
BEGIN Hotel
Height 25
Lawn 85
END
Basically it is several key/value pairs separated by one or more spaces. The tricky part is the BEGIN/END blocks that might be nested and might span multiple lines. I need to go through the file and take some action based on what follows the BEGIN. For example, if it's Delta, i might need to process each of the sub BEGIN lines where as if it is Hotel, i can skip that completely.
I looked at Parse::RecDescent a little bit but wasn't sure how to make it handle the BEGIN/END situation properly. Speed isn't as important as having a an easier to understand and maintain solution.
Any suggestions?
EDIT: I liked Miller's solution, but then looking over the data realized why I didn't just split on whitespace. Some of the labels have whitespace in them. Added "Jelly Bean" label in above data file to reflect that.

Just parse the whole data structure, and filter out sections you don't need after the fact:
use strict;
use warnings;
use Text::ParseWords;
my #tokens = parse_line( qr{\s+}, 0, do { local $/; <DATA> } );
my %hash;
my #levels = \%hash;
while ( defined( my $key = shift #tokens ) ) {
if ( $key eq 'BEGIN' ) {
push #levels, $levels[-1]{ shift #tokens } = {};
} elsif ( $key eq 'END' ) {
pop #levels;
} else {
$levels[-1]{$key} = shift #tokens;
}
}
use Data::Dump;
dd \%hash;
__DATA__
Alpha 27600
Beta 1
Charlie true
BEGIN Delta
BEGIN Epsilon Setting High Hook 50 END
BEGIN Foxtrot Corp 71 END
BEGIN "Jelly Bean" Corp 88 END
END
BEGIN Hotel
Height 25
Lawn 85
END
Outputs:
{
Alpha => 27600,
Beta => 1,
Charlie => "true",
Delta => {
"Epsilon" => { Hook => 50, Setting => "High" },
"Foxtrot" => { Corp => 71 },
"Jelly Bean" => { Corp => 88 },
},
Hotel => { Height => 25, Lawn => 85 },
}

Personally I'd hack something up with Parser::MGC (though perhaps I'm biased because I wrote it).
Using a nested scope of its scope_of method will easily handle those BEGIN/END markers for you.

Perl Deleting element from array

I have a 2D array in perl. I want to delete all elements which has the pattern <<< or >>>.
I have written a perl code, it works good till matching pattern, however it cannot delete that element, some error occurs.
foreach my $x(#array)
{
foreach my $y(#$x)
{
if($y =~ (m/^(\<+)|(\>+)$/ig))
{
delete $y;
}
}
}
Can you help me to delete that particular element that matches the pattern. (I want to delete and remove from array, not undef it)

Let's say your array looks like this:
1 2 3 4
5 X 6 7
8 9 A B
You want to delete X. What do you want to happen? What should your new array look like after the delete?
Do you wan this:
1 2 3 4
4 6 7
8 9 A B
Or this?
1 2 3 4
5 9 6 7
8 A B
That's the first thing you need to decide. Second, you can't use delete. The delete command deletes a keyed value from a hash and not an array. If you have an array like this:
my #array = qw(0 1 2 3 4 5 X 7 8 9);
And you want to delete the X (which is $array[6]), you'd use the splice command:
splice #array, 6, 1;
Finally, Perl does not have 2 dimensional arrays, so you can't delete a value from a 2 dimensional array.
What you have is an array of references to a second array. Think of it this way:
my #row0 = qw(1 2 3 4);
my #row1 = qw(5 X 6 7);
my #row2 = qw(8 9 A B);
my #two_d_array = (\#row0, \#row1, \#row2);
Or, I could do this by column:
my #col0 = qw(1 5 8);
my #col1 = qw(2 X 6);
my #col2 = qw(2 6 A);
my #col3 = qw(4 7 B);
my #two_d_array = (\#col0, \#col1, \#col2, \#col3);
When you talk about.
if ( $two_d_array[1][1]` eq "X" ) {
What is going on is that Perl is messing with your mind. It is making you think there's a two dimensional array is involved, but it's not really there.
A more accurate way of writing this would be:
if ( ${ $two_d_array[1] }[1] eq "X" ) {
or, more cleanly:
if ( $two_d_array[1]->[1] eq "X" ) {
So first, decide what you mean by deleting a value. In a two dimensional array, if you actually delete that value, you end up ruining the dimensional structure of that array. Maybe you can replace the value at that point with an undef.
Once you do that, you must understand what you're actually dealing with: An array of references to arrays.
for my $array_reference ( #two_d_array ) {
for my $value ( #{ $array_reference } ) {
if ( $value =~ /^(<+|>+)$/ ) {
$value = undef; #See Note #1
}
}
}
Note #1: When you use a for loop, the index of the array is a link to the actual value in the array. Therefore, when you change the index, you're changing the actual value. That's why this will work.
If you really, really want to delete the element using splice, you will have to decide if you want your elements moving up to replace the deleted value or moving to the left to replace the deleted value. If you want the values to the moving left, you want an array or references to row arrays. If you wan the values moving up to fill in the deleted value, you want an array of reference to column arrays.
Remember that computers will do exactly what you tell them to do and not what you want them to do. Make sure you understand exactly what you want.

You are applying delete on a scalar value, $y, and delete is only meant to be applied to hashes and arrays. You would need to do do
for my $x (0 .. $#array) {
for my $y (0 .. $#{$array[$x]}) {
if (...) { delete $array[$x][$y]; }
The best solution, in my opinion, is to remove the value before storing it in the array. I am guessing you read it in from some data source such as a file, and that would be the best place to filter it out. E.g.
while (<$fh>) {
....
#values = grep !/^[<>]+/, #values; # filtering
push #array, \#values; # storing
}
On that note, you can also do it afterwards, of course, with something like:
for (#array) {
#$_ = grep !/^[<>]+/, #$_;
}

You can delete elements from arrays, by splice function:
splice(#array, $index, 1); 1 in this example is number of elements, you want to delete
delete function only sets array value to undef

delete does not alter array indices so it is not what you want. If you want to delete elements by value, use something like this:
foreach my $x(#array)
{
$x = [ grep { $_ !~ (m/^(\<+)|(\>+)$/ig)} #$x ];
print join(",", #$x), "\n";
}
or, use splice. But then you will need to iterate the array using indices rather than values.
Also see Perl-delete, Perl-splice.

Perl regex & data extraction/manipulation

I'm not sure where to start with this one... my client gets stock figures from his supplier but they are now being sent in a different format, here is a sample snippet:
[["BLK",[["Black","0F1315"]],[["S","813"],["M","1378"],["L","1119"],["XL","1069"],["XXL","412"],["3XL","171"]]],["BOT",[["Bottle","15451A"]],[["S","226"],["M","425"],["L","772"],["XL","509"],["XXL","163"]]],["BUR",[["Burgundy","73002E"]],[["S","402"],["M","530"],["L","356"],["XL","257"],["XXL","79"]]],["DNA",[["Deep Navy","000F33"]],[["S","699"],["M","1161"],["L","1645"],["XL","1032"],["XXL","350"]]],["EME",[["Emerald","0DAB5E"]],[["S","392"],["M","567"],["L","613"],["XL","431"],["XXL","97"]]],["HEA",[["Heather","C0D4D7"]],[["S","374"],["M","447"],["L","731"],["XL","386"],["XXL","115"],["3XL","26"]]],["KEL",[["Kelly","0FFF00"]],[["S","167"],["M","285"],["L","200"],["XL","98"],["XXL","45"]]],["NAV",[["Navy","002466"]],[["S","451"],["M","1389"],["L","1719"],["XL","1088"],["XXL","378"],["3XL","177"]]],["NPU",[["Purple","560D55"]],[["S","347"],["M","553"],["L","691"],["XL","230"],["XXL","101"]]],["ORA",[["Orange","FF4700"]],[["S","125"],["M","273"],["L","158"],["XL","98"],["XXL","98"]]],["RED",[["Red","FF002E"]],[["S","972"],["M","1186"],["L","1246"],["XL","889"],["XXL","184"]]],["ROY",[["Royal","1500CE"]],[["S","1078"],["M","1346"],["L","1102"],["XL","818"],["XXL","135"]]],["SKY",[["Sky","91E3FF"]],[["S","567"],["M","919"],["L","879"],["XL","498"],["XXL","240"]]],["SUN",[["Sunflower","FFC700"]],[["S","843"],["M","1409"],["L","1032"],["XL","560"],["XXL","53"]]],["WHI",[["White","FFFFFF"]],[["S","631"],["M","2217"],["L","1666"],["XL","847"],["XXL","410"],["3XL","74"]]]]
Firstly the inital [ and end ] can be removed
Then it needs be be broken down into segments of colours, i.e.:
["BLK",[["Black","0F1315"]],[["S","813"],["M","1378"],["L","1119"],["XL","1069"],["XXL","412"],["3XL","171"]]]
The BLK is needed here, the next block [["Black","0F1315"]] can be disregarded.
Next I need to take the stock data for each size ["S","813"] etc
Therefore I should have a data such as:
$col = BLK
$size = S
$qty = 813
$col = BLK
$size = M
$qty = 1278
and repeat this segment for every colour seqment in the data.
The amount of colour segments in the data will vary, as will the amount of sizing segements within. Also the amount of sizing segments will vary colour to colour, i.e. there maybe 6 sizes for BLK but only 5 for RED
The data will be written out while in the loop for these so something like print "$col:$size:$qty" will be fine as this would then be in a format ready to be processed.
Sorry for the long message, I just can't seem to get my head round this today!!
Regards,
Stu

This looks like valid JSON to me, why not use a JSON parser instead of trying to solve this with a regex?
use JSON;
my $json_string = '[["BLK",[["Black","0F1315"]],[["S","813"...<snip>';
my $deserialized = from_json( $json_string );
Then you can iterate over the array and extract the pieces of information you need.

Building on Tim Pietzcker's answer:
...
my $deserialized = from_json( $json_string );
foreach my $group ( #$deserialized ) {
my ( $color, undef, $sizes ) = #$group;
print join( ":", $color, #$_ ), "\n" for #$sizes;
}
(And yes, for this particular format, eval should do as well as from_json, although the latter is safer. However, you should really try to find an official spec for the format: is it really JSON or something else?)

Assuming you have your data in $str, then eval(EXPR) (Danger Will Robinson!) and process the resulting data structure:
my $struct = eval $str;
foreach my $cref (#$struct) {
my($color, undef, $sizerefs) = #$cref; # 3 elements in each top level
foreach my $sizeref (#$sizerefs) {
my($size, $qty) = #$sizeref;
print "$color:$size:$qty\n";
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Create Matrix from redundant list - list

Related

Substitute the markdown italic to html using regex in Perl

Perl list interpolation performance

parsing multiline nested tokens from a file in perl

Perl Deleting element from array

Perl regex & data extraction/manipulation

Categories

Resources