parsing multiline nested tokens from a file in perl - regex

I have a file that looks line this:
Alpha 27600
Beta 1
Charlie true
BEGIN Delta
BEGIN Epsilon Setting High Hook 50 END
BEGIN Foxtrot Corp 71 END
BEGIN "Jelly Bean" Corp 88 END
END
BEGIN Hotel
Height 25
Lawn 85
END
Basically it is several key/value pairs separated by one or more spaces. The tricky part is the BEGIN/END blocks that might be nested and might span multiple lines. I need to go through the file and take some action based on what follows the BEGIN. For example, if it's Delta, i might need to process each of the sub BEGIN lines where as if it is Hotel, i can skip that completely.
I looked at Parse::RecDescent a little bit but wasn't sure how to make it handle the BEGIN/END situation properly. Speed isn't as important as having a an easier to understand and maintain solution.
Any suggestions?
EDIT: I liked Miller's solution, but then looking over the data realized why I didn't just split on whitespace. Some of the labels have whitespace in them. Added "Jelly Bean" label in above data file to reflect that.

Just parse the whole data structure, and filter out sections you don't need after the fact:
use strict;
use warnings;
use Text::ParseWords;
my #tokens = parse_line( qr{\s+}, 0, do { local $/; <DATA> } );
my %hash;
my #levels = \%hash;
while ( defined( my $key = shift #tokens ) ) {
if ( $key eq 'BEGIN' ) {
push #levels, $levels[-1]{ shift #tokens } = {};
} elsif ( $key eq 'END' ) {
pop #levels;
} else {
$levels[-1]{$key} = shift #tokens;
}
}
use Data::Dump;
dd \%hash;
__DATA__
Alpha 27600
Beta 1
Charlie true
BEGIN Delta
BEGIN Epsilon Setting High Hook 50 END
BEGIN Foxtrot Corp 71 END
BEGIN "Jelly Bean" Corp 88 END
END
BEGIN Hotel
Height 25
Lawn 85
END
Outputs:
{
Alpha => 27600,
Beta => 1,
Charlie => "true",
Delta => {
"Epsilon" => { Hook => 50, Setting => "High" },
"Foxtrot" => { Corp => 71 },
"Jelly Bean" => { Corp => 88 },
},
Hotel => { Height => 25, Lawn => 85 },
}

Personally I'd hack something up with Parser::MGC (though perhaps I'm biased because I wrote it).
Using a nested scope of its scope_of method will easily handle those BEGIN/END markers for you.

Related

Convert a word's characters into its ascii code list concisely in Raku

I'm trying to convert the word wall into its ascii code list (119, 97, 108, 108) like this:
my #ascii="abcdefghijklmnopqrstuvwxyz";
my #tmp;
map { push #tmp, $_.ord if $_.ord == #ascii.comb.any.ord }, "wall".comb;
say #tmp;
Is there a way to use the #tmp without declaring it in a seperate line?
Is there a way to produce the ascii code list in one line instead of 3 lines? If so, how to do it?
Note that I have to use the #ascii variable i.e. I can't make use of the consecutively increasing ascii sequence (97, 98, 99 ... 122) because I plan to use this code for non-ascii languages too.
There are a couple of things we can do here to make it work.
First, let's tackle the #ascii variable. The # sigil indicates a positional variable, but you assigned a single string to it. This creates a 1-element array ['abc...'], which will cause problems down the road. Depending on how general you need this to be, I'd recommend either creating the array directly:
my #ascii = <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my #ascii = 'a' .. 'z';
my #ascii = 'abcdefghijklmnopqrstuvwxyz'.comb;
or going ahead and handling the any part:
my $ascii-char = any <a b c d e f g h i j k l m n o p q r s t u v x y z>;
my $ascii-char = any 'a' .. 'z';
my $ascii-char = 'abcdefghijklmnopqrstuvwxyz'.comb.any;
Here I've used the $ sigil, because any really specifies any single value, and so will function as such (which also makes our life easier). I'd personally use $ascii, but I'm using a separate name to make later examples more distinguishable.
Now we can handle the map function. Based on the above two versions of ascii, we can rewrite your map function to either of the following
{ push #tmp, $_.ord if $_ eq #ascii.any }
{ push #tmp, $_.ord if $_ eq $ascii-char }
Note that if you prefer to use ==, you can go ahead and create the numeric values in the initial ascii creation, and then use $_.ord. As well, personally, I like to name the mapped variable, e.g.:
{ push #tmp, $^char.ord if $^char eq #ascii.any }
{ push #tmp, $^char.ord if $^char eq $ascii-char }
where $^foo replaces $_ (if you use more than one, they map alphabetical order to #_[0], #_[1], etc).
But let's get to the more interesting question here. How can we do all of this without needing to predeclare #tmp? Obviously, that just requires creating the array in the map loop. You might think that might be tricky for when we don't have an ASCII value, but the fact that an if statement returns Empty (or () ) if it's not run makes life really easy:
my #tmp = map { $^char.ord if $^char eq $ascii-char }, "wall".comb;
my #tmp = map { $^char.ord if $^char eq #ascii.any }, "wall".comb;
If we used "wáll", the list collected by map would be 119, Empty, 108, 108, which is automagically returned as 119, 108, 108. Consequently, #tmp is set to just 119, 108, 108.
Yes there is a much simpler way.
"wall".ords.grep('az'.ords.minmax);
Of course this relies on a to z being an unbroken sequence. This is because minmax creates a Range object based on the minimum and maximum value in the list.
If they weren't in an unbroken sequence you could use a junction.
"wall".ords.grep( 'az'.ords.minmax | 'AZ'.ords.minmax );
But you said that you want to match other languages. Which to me screams regex.
"wall".comb.grep( /^ <:Ll> & <:ascii> $/ ).map( *.ord )
This matches Lowercase Letters that are also in ASCII.
Actually we can make it even simpler. comb can take a regex which determines which characters it takes from the input.
"wall".comb( / <:Ll> & <:ascii> / ).map( *.ord )
# (119, 97, 108, 108)
"ΓΔαβγδε".comb( / <:Ll> & <:Greek> / ).map( *.ord )
# (945, 946, 947, 948, 949)
# Does not include Γ or Δ, as they are not lowercase
Note that the above only works with ASCII if you don't have a combining accent.
"de\c[COMBINING ACUTE ACCENT]f".comb( / <:Ll> & <:ascii> / )
# ("d", "f")
The Combining Acute Accent combines with the e which composes to Latin Small Letter E With Acute.
That composed character is not in ASCII so it is skipped.
It gets even weirder if there isn't a composed value for the character.
"f\c[COMBINING ACUTE ACCENT]".comb( / <:Ll> & <:ascii> / )
# ("f́",)
That is because the f is lowercase and in ASCII. The composing codepoint gets brought along for the ride though.
Basically if your data has, or can have combining accents and if it could break things, then you are better off dealing with it while it is still in binary form.
$buf.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
The above would also work for single character strings because .uniprop works on either integers representing a codepoint, or on the actual character.
"wall".comb.grep: {
.uniprop() eq 'Ll' #
&& .uniprop('Block') eq 'Basic Latin' # ASCII
}
Note again that this would have the same issues with composing codepoints since it works with strings.
You may also want to use .uniprop('Script') instead of .uniprop('Block') depending on what you want to do.
Here's a working approach using Raku's trans method (code snippet performed in the Raku REPL):
> my #a = "wall".comb;
[w a l l]
> #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put;
119 97 108 108
Above, we handle an ascii string. Below I add the "é" character, and show a 2-step solution:
> my #a = "wallé".comb;
[w a l l é]
> my #b = #a.trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') );
[119 97 108 108 é]
> #b.trans("é" => ords("é")).put
119 97 108 108 233
Nota bene #1: Although all the code above works fine, when I tried shortening the alphabet to 'a'..'z' I ended up seeing erroneous return values...hence the use of the full 'abcdefghijklmnopqrstuvwxyz'.
Nota bene #2: One question in my mind is trying to suppress output when trans fails to recognize a character (e.g. how to suppress assignment of "é" as the last element of #b in the second-example code above). I've tried adding the :delete argument to trans, but no luck.
EDITED: To remove unwanted characters, here's code using grep (à la #Brad Gilbert), followed by trans:
> my #a = "wallé".comb;
[w a l l é]
> #a.grep('a'..'z'.comb.any).trans('abcdefghijklmnopqrstuvwxyz' => ords('abcdefghijklmnopqrstuvwxyz') ).put
119 97 108 108

How to get the current line number in a multi-line list initializer of testcases?

Is there a way to reliably get the current line number during a Perl
multiline list assignment without explicitly using __LINE__? I am
storing testcases in a list and would like to tag each with its line
number.* That way I can do (roughly)
ok($_->[1], 'line ' . $_->[0]) for #tests.
And, of course, I would like to save typing compared to
putting __LINE__ at the beginning of each test case :) . I have
not been able to find a way to do so, and I have encountered some
confusing behaviour in the lines reported by caller.
* Possible XY, but I can't find a module to do it.
Update I found a hack and posted it as an answer. Thanks to #zdim for helping me think about the problem a different way!
MCVE
A long one, because I've tried several different options. my_eval,
L(), and L2{} are some I've tried so far — L() was the one
I initially hoped would work. Jump down to my #testcases to see how
I'm using these. When testing, do copy the shebang line.
Here's my non-MCVE use case, if you are interested.
#!perl
use strict; use warnings; use 5.010;
# Modified from https://www.effectiveperlprogramming.com/2011/06/set-the-line-number-and-filename-of-string-evals/#comment-155 by http://sites.google.com/site/shawnhcorey/
sub my_eval {
my ( $expr ) = #_;
my ( undef, $file, $line ) = caller;
my $code = "# line $line \"$file\"\n" . $expr;
unless(defined wantarray) {
eval $code; die $# if $#;
} elsif(wantarray) {
my #retval = eval $code; die $# if $#; return #retval;
} else {
my $retval = eval $code; die $# if $#; return $retval;
}
}
sub L { # Prepend caller's line number
my (undef, undef, $line) = caller;
return ["$line", #_];
} #L
sub L2(&) { # Prepend caller's line number
my $fn = shift;
my (undef, undef, $line) = caller;
return ["$line", &$fn];
} #L2
# List of [line number, item index, expected line number, type]
my #testcases = (
([__LINE__,0,32,'LINE']),
([__LINE__,1,33,'LINE']),
(L(2,34,'L()')),
(L(3,35,'L()')),
(do { L(4,36,'do {L}') }),
(do { L(5,37,'do {L}') }),
(eval { L(6,38,'eval {L}') }),
(eval { L(7,39,'eval {L}') }),
(eval "L(8,40,'eval L')"),
(eval "L(9,41,'eval L')"),
(my_eval("L(10,42,'my_eval L')")),
(my_eval("L(11,43,'my_eval L')")),
(L2{12,44,'L2{}'}),
(L2{13,45,'L2{}'}),
);
foreach my $idx (0..$#testcases) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases[$idx]->[3], $testcases[$idx]->[0],
$testcases[$idx]->[2],
($testcases[$idx]->[0] != $testcases[$idx]->[2]) && '*';
}
Output
With my comments added.
0 LINE line 32 expected 32
1 LINE line 33 expected 33
Using __LINE__ expressly works fine, but I'm looking for an
abbreviation.
2 L() line 45 expected 34 *
3 L() line 45 expected 35 *
L() uses caller to get the line number, and reports a line later
in the file (!).
4 do {L} line 36 expected 36
5 do {L} line 45 expected 37 *
When I wrap the L() call in a do{}, caller returns the correct
line number — but only once (!).
6 eval {L} line 38 expected 38
7 eval {L} line 39 expected 39
Block eval, interestingly, works fine. However, it's no shorter
than __LINE__.
8 eval L line 1 expected 40 *
9 eval L line 1 expected 41 *
String eval gives the line number inside the eval (no surprise)
10 my_eval L line 45 expected 42 *
11 my_eval L line 45 expected 43 *
my_eval() is a string eval plus a #line directive based on
caller. It also gives a line number later in the file (!).
12 L2{} line 45 expected 44 *
13 L2{} line 45 expected 45
L2 is the same as L, but it takes a block that returns a list,
rather than
the list itself. It also uses caller for the line number. And it
is correct once, but not twice (!). (Possibly just because it's the last
item — my_eval reported line 45 also.)
So, what is going on here? I have heard of Deparse and wonder if this is
optimization-related, but I don't know enough about the engine to know
where to start investigating. I also imagine this could be done with source
filters or Devel::Declare, but that is well beyond my
level of experience.
Take 2
#zdim's answer got me started thinking about fluent interfaces, e.g., as in my answer:
$testcases2 # line 26
->add(__LINE__,0,27,'LINE')
->add(__LINE__,1,28,'LINE')
->L(2,29,'L()')
->L(3,30,'L()')
->L(3,31,'L()')
;
However, even those don't work here — I get line 26 for each of the ->L() calls. So it appears that caller sees all of the chained calls as coming from the $testcases2->... line. Oh well. I'm still interested in knowing why, if anyone can enlighten me!
The caller can get only the line numbers of statements, decided at compilation.
When I change the code to
my #testcases;
push #testcases, ([__LINE__,0,32,'LINE']);
push #testcases, ([__LINE__,1,33,'LINE']);
push #testcases, (L(2,34,'L()'));
push #testcases, (L(3,35,'L()'));
...
maintaining line numbers, it works (except for string evals).
So, on the practical side, using caller is fine with separate statements for calls.
Perl internals
The line numbers are baked into the op-tree at compilation and (my emphasis)
At run-time, only the line numbers of statements are available [...]
from ikegami's post on permonks.
We can see this by running perl -MO=Concise script.pl where the line
2 nextstate(main 25 line_nos.pl:45) v:*,&,{,x*,x&,x$,$,67108864 ->3
is for the nextstate op, which sets the line number for caller (and warnings). See this post, and the nextstate example below.
A way around this would be to try to trick the compilation (somehow) or, better of course, to not assemble information in a list like that. One such approach is in the answer by cxw.
See this post for a related case and more detail.
nextstate example
Here's a multi-line function-call chain run through Deparse (annotated):
$ perl -MO=Concise -e '$x
->foo()
->bar()
->bat()'
d <#> leave[1 ref] vKP/REFC ->(end)
1 <0> enter ->2
2 <;> nextstate(main 1 -e:1) v:{ ->3 <=== the only nextstate
c <1> entersub[t4] vKRS/TARG ->d
3 <0> pushmark s ->4
a <1> entersub[t3] sKRMS/LVINTRO,TARG,INARGS ->b
4 <0> pushmark s ->5
8 <1> entersub[t2] sKRMS/LVINTRO,TARG,INARGS ->9
5 <0> pushmark s ->6
- <1> ex-rv2sv sKM/1 ->7
6 <#> gvsv[*x] s ->7
7 <.> method_named[PV "foo"] s ->8
9 <.> method_named[PV "bar"] s ->a
b <.> method_named[PV "bat"] ->c
-e syntax OK
Even though successive calls are on separate lines, they are part of the same statement, so are all attached to the same nextstate.
Edit This answer is now wrapped in a CPAN module (GitHub)!
#zdim's answer got me thinking about fluent interfaces. Below are two hacks that work for my particular use case, but that don't help me understand the behaviour reported in the question. If you can help, please post another answer!
Hack 2 (newer) (the one now on CPAN)
I think this one is very close to minimal. In perl, you can call a subroutine through a reference with $ref->(), and you can leave out the second and subsequent -> in a chain of arrows. That means, for example, that you can do:
my $foo; $foo=sub { say shift; return $foo; };
$foo->(1)
(2)
(3);
Looks good, right? So here's the MCVE:
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList2 {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
my $self = bless {lnum => shift // 0, arr => []}, $class;
# Make a loader that adds an item and returns itself --- not $self
$self->{loader} = sub { $self->L(#_); return $self->{loader} };
return $self;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub load { goto &{ shift->{loader} } } # kick off loading
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList2
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList2->new(__LINE__) # line 28
->add(__LINE__,0,36,'LINE')
->add(__LINE__,1,37,'LINE');
$testcases->load(2,38,'load')-> # <== Only need two arrows.
(3,39,'chain load') # <== After that, () are enough.
(4,40,'chain load')
(5,41,'chain load')
(6,42,'chain load')
(7,43,'chain load')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 36 expected 36
1 LINE line 37 expected 37
2 load line 38 expected 38
3 chain load line 39 expected 39
4 chain load line 40 expected 40
5 chain load line 41 expected 41
6 chain load line 42 expected 42
7 chain load line 43 expected 43
All the chain load lines were loaded with zero extra characters compared to the original [x, y] approach. Some overhead, but not much!
Hack 1
Code:
By starting with __LINE__ and assuming a fixed number of lines per call, a counter will do the trick. This could probably be done more cleanly with a tie.
#!perl
use strict; use warnings; use 5.010;
package FluentAutoIncList {
sub new { # call as $class->new(__LINE__); each element is one line
my $class = shift;
return bless {lnum => shift // 0, arr => []}, $class;
}
sub size { return scalar #{ shift->{arr} }; }
sub last { return shift->size-1; } # $#
sub L { # Push a new record with the next line number on the front
my $self = shift;
push #{ $self->{arr} }, [++$self->{lnum}, #_];
return $self;
} #L
sub add { # just add it
my $self = shift;
++$self->{lnum}; # keep it consistent
push #{ $self->{arr} }, [#_];
return $self;
} #add
} #FluentAutoIncList
# List of [line number, item index, expected line number, type]
my $testcases = FluentAutoIncList->new(__LINE__) # line 28
->add(__LINE__,0,29,'LINE')
->add(__LINE__,1,30,'LINE')
->L(2,31,'L()')
->L(3,32,'L()')
->L(4,33,'L()')
;
foreach my $idx (0..$testcases->last) {
printf "%2d %-10s line %2d expected %2d %s\n",
$idx, $testcases->{arr}->[$idx]->[3],
$testcases->{arr}->[$idx]->[0],
$testcases->{arr}->[$idx]->[2],
($testcases->{arr}->[$idx]->[0] !=
$testcases->{arr}->[$idx]->[2]) && '*';
}
Output:
0 LINE line 29 expected 29
1 LINE line 30 expected 30
2 L() line 31 expected 31
3 L() line 32 expected 32
4 L() line 33 expected 33

Regex pattern to match groups starting with pattern

I am extract data from a text stream which is data structured as such
/1-<id>/<recType>-<data>..repeat n times../1-<id>/#-<data>..repeat n times..
In the above, the "/1" field precedes the record data which can then have any number of following fields, each with choice of recType from 2 to 9 (also, each field starts with a "/")
For example:
/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE
So, there are three groups of data above
1=XXXX 2=YYYY 9=ZZZZ
1=AAAA 3=BBBB 5=CCCC 8=NNNN 9=DDDD
1=QQQQ 2=WWWW 3=PPPP 7=EEEE
Data is for simplicity, I know for certain that its only contains [A-Z0-9. ] but can be variable length (not just 4 chars as per example)
Now, the following expression sort of works, but its only capturing the first 2 fields of each group and none of the remaining fields...
/1-(?'fld1'[A-Z]+)/((?'fldNo'[2-9])-(?'fldData'[A-Z0-9\. ]+))
I know I need some sort of quantifier in there somewhere, but I do not know what or where to place it.
You can use a regex to match these blocks using 2 .NET regex features: 1) capture collection and 2) multiple capturing groups with the same name in the pattern. Then, we'll need some Linq magic to combine the captured data into a list of lists:
(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*
Details:
(?<fldNo>1) - Group fldNo matching 1
- - a hyphen
(?'fldData'[^/]+) - Group "fldData" capturing 1+ chars other than /
(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))* - zero or more sequences of:
/ - a literal /
(?<fldNo>[2-9]) - 2 to 9 digit (Group "fldNo")
[-=] - a - or =
(?'fldData'[^/]+)- 1+ chars other than / (Group "fldData")
See the regex demo, results:
See C# demo:
using System;
using System.Linq;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
var str = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
var res = Regex.Matches(str, #"(?<fldNo>1)-(?'fldData'[^/]+)(?:/(?<fldNo>[2-9])[-=](?'fldData'[^/]+))*")
.Cast<Match>()
.Select(p => p.Groups["fldNo"].Captures.Cast<Capture>().Select(m => m.Value)
.Zip(p.Groups["fldData"].Captures.Cast<Capture>().Select(m => m.Value),
(first, second) => first + "=" + second))
.ToList();
foreach (var t in res)
Console.WriteLine(string.Join(" ", t));
}
}
I would suggest to first split the string by /1, then use a patern along these lines:
\/([1-9])[=-]([A-Z]+)
https://regex101.com/r/0nyzzZ/1
A single regex isn't the optimal tool for doing this (at least used in this way). The main reason is because your stream has a variable number of entries in it, and using a variable number of capture groups is not supported. I also noticed some of the values had "=" between them as well as the dash, which your current regex doesn't address.
The problem comes when you try and add a quantifier to a capture group - the group will only remember the last thing it captured, so if you add a quantifier, it will end up catching the first and last fields, leaving out all the rest of them. So something like this won't work:
\/1-(?'fld1'[A-Z]+)(?:\/(?'fldNo'[2-9])[-=](?'fldData'[A-Z]+))+
If your streams were all the same length, then a single regex could be used, but there's a way to do it using a foreach loop with a much simpler regex working on each part of your stream (so it verifies your stream as well when it goes along!)
Now I'm not sure what language you're working with when using this, but here is a solution in PHP that I think delivers what you need.
function extractFromStream($str)
{
/*
* Get an array of [num]-[letters] with explode. This will make an array that
* contains [0] => 1-AAAA, [1] => 2-BBBB ... etc
*/
$arr = explode("/", substr($str, 1));
$sorted = array();
$key = 0;
/*
* Sort this data into key->values based on numeric ordering.
* If the next one has a lower or equal starting number than the one before it,
* a new entry will be created. i.e. 2-aaaa => 1-cccc will cause a new
* entry to be made, just in case the stream doesn't always start with 1.
*/
foreach ($arr as $value)
{
// This will get the number at the start, and has the added bonus of making sure
// each bit is in the right format.
if (preg_match("/^([0-9]+)[=-]([A-Z]+)$/", $value, $matches)) {
$newKey = (int)$matches[1];
$match = $matches[2];
} else
throw new Exception("This is not a valid data stream!");
// This bit checks if we've got a lower starting number than last time.
if (isset($lastKey) && is_int($lastKey) && $newKey <= $lastKey)
$key += 1;
// Now sort them..
$sorted[$key][$newKey] = $match;
// This will be compared in the next iteration of the loop.
$lastKey = $newKey;
}
return $sorted;
}
Here's how you can use it...
$full = "/1-XXXX/2-YYYY/9-ZZZZ/1-AAAA/3-BBBB/5-CCCC/8=NNNN/9=DDDD/1-QQQQ/2-WWWW/3=PPPP/7-EEEE";
try {
$extracted = extractFromStream($full);
$stream1 = $extracted[0];
$stream2 = $extracted[1];
$stream3 = $extracted[2];
print "<pre>";
echo "Full extraction: \n";
print_r($extracted);
echo "\nFirst Stream:\n";
print_r($stream1);
echo "\nSecond Stream:\n";
print_r($stream2);
echo "\nThird Stream:\n";
print_r($stream3);
print "</pre>";
} catch (Exception $e) {
echo $e->getMessage();
}
This will print
Full extraction:
Array
(
[0] => Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
[1] => Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
[2] => Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
)
First Stream:
Array
(
[1] => XXXX
[2] => YYYY
[9] => ZZZZ
)
Second Stream:
Array
(
[1] => AAAA
[3] => BBBB
[5] => CCCC
[8] => NNNN
[9] => DDDD
)
Third Stream:
Array
(
[1] => QQQQ
[2] => WWWW
[3] => PPPP
[7] => EEEE
)
So you can see you have the numbers as the array keys, and the values they correspond to, which are now readily accessible for further processing. I hope this helps you :)

How to get the value from the list that appears only once?

I saw this question in the internet. Get the only number that is present only once in the list while other numbers are present twice in the list. The data is large and contains about a million numbers unsorted and may contain negative numbers too of random order out of which all numbers appear twice except one number that appears only once.
my #array = (1,1,2,3,3,4,4)
output :
2
Only two is not repeated in the list. I tried my solutions.
my $unique;
$unique ^= $_ for(#array);
say $unique;
It doesn't work on negative numbers but fast.
I tried a hash where key is the number and value is the number of times its present in the list. Reverse the hash and then print the value with 1 as key as all other numbers have 2 as key as they appear twice. The hash solution is slow with a large input of one million numbers but works for negative numbers.
I tried a regex way of combining the entire list with tab and then used
my $combined = join " ", #array;
$combined !~ (\d+).*$1;
say $1;
but I get only the last number of the list
Is there a fast way to do it? Any idea of using a regex?
Edit : Repharsed the title for better answers
This seems pretty fast:
use v5.10; use strict; use warnings;
sub there_can_be_only_one {
my #counts;
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]++ for #{$_[0]};
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]==1 and return $_ for #{$_[0]};
return;
}
my #array = (1,1,-4,-4,2,3,-1,3,4,-1,4);
say there_can_be_only_one(\#array);
It's basically a variation of the hash technique, but using an array instead of a hash. Because we need to deal with negative numbers, we can't use them unmodified in the #counts array. Negative indexes do work in Perl of course, but they'd overwrite our data for positive indexes. Fail.
So we use something similar to two's complement. We store positive numbers in the array as 2*$_ and negative numbers as (-2*$_)-1. That is:
Integer: ... -3 -2 -1 0 1 2 3 ...
Stored as: ... 5 3 1 0 2 4 6 ...
Because this solution doesn't rely on sorting the list, and simply does two passes over it (well, on average, one and a half passes), it performs at O(n) in contrast to Schwern's O(n log n) solution. Thus for larger lists (a few million integers) should be significantly faster. Here's a quick comparison on my (fairly low-powered) netbook:
use v5.10; use strict; use warnings;
use Benchmark qw(timethese);
use Time::Limit '60';
sub tobyink {
my #counts;
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]++ for #{$_[0]};
$counts[ $_>=0 ? 2*$_ : (-2*$_)-1 ]==1 and return $_ for #{$_[0]};
return;
}
sub schwern {
my #nums = sort #{$_[0]};
return $nums[0] if $nums[0] != $nums[1];
for (1..$#nums-1) {
my($prev, $this, $next) = #nums[$_-1, $_, $_+1];
return $this if $prev != $this && $next != $this;
}
return $nums[-1] if $nums[-1] != $nums[-2];
}
my #input = (
1..2_000_000, # 1_000_001 only appears once
1..1_000_000, 1_000_002..2_000_000,
);
timethese(1, {
tobyink => sub { tobyink(\#input) },
schwern => sub { schwern(\#input) },
});
__END__
Benchmark: timing 1 iterations of schwern, tobyink...
schwern: 11 wallclock secs ( 8.72 usr + 0.92 sys = 9.64 CPU) # 0.10/s (n=1)
(warning: too few iterations for a reliable count)
tobyink: 5 wallclock secs ( 5.01 usr + 0.08 sys = 5.09 CPU) # 0.20/s (n=1)
(warning: too few iterations for a reliable count)
UPDATE: in my initial answer I missed the detail that no number will appear more than twice. I'd assumed that it was possible for some numbers to appear three or more times. Using this additional detail, we can go even faster:
sub there_can_be_only_one {
my $tmp;
$tmp ^= $_>=0 ? 2*$_ : (-2*$_)-1 for #{$_[0]};
$tmp%2 ? ($tmp+1)/-2 : $tmp/2;
}
say there_can_be_only_one(\#array);
This runs about 30% faster than my initial answer.
The standard way to deal with this is to throw it all into a hash.
use v5.10;
use strict;
use warnings;
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
my %count;
for (#nums) {
$count{$_}++
}
for (keys %count) {
say $_ if $count{$_} == 1;
}
But yes, it's quite slow.
Then I thought maybe I could avoid having to loop through the hash to find the singles...
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
my %uniqs;
my %dups;
for (#nums) {
if( $uniqs{$_} ) {
delete $uniqs{$_};
$dups{$_} = 1;
}
elsif( !$dups{$_} ) {
$uniqs{$_} = 1;
}
}
print join ", ", keys %uniqs;
But that was even slower.
This is the fastest thing I've come up with, takes about half the time as the above.
use v5.10;
use strict;
use warnings;
my #nums = (2..500_000, 500_002..1_000_000, 0..1_000_001);
#nums = sort #nums;
say $nums[0] if $nums[0] != $nums[1];
for (1..$#nums-1) {
my($prev, $this, $next) = #nums[$_-1, $_, $_+1];
say $this if $prev != $this && $next != $this;
}
say $nums[-1] if $nums[-1] != $nums[-2];
By sorting the list, you can iterate through it and check if a given entry's neighbors are duplicates. Have to be careful about the first and last elements. I put their checks outside the loop to avoid having to run a special case for every iteration.
Because sort is O(nlogn), as the list of numbers gets larger this solution will eventually be slower than the hash-based one, but you'll probably run out of memory before that happens.
Finally, if this list is large, you should consider storing it on disk in a database. Then you can avoid using up memory and let the database do the work efficiently.
It doesn't work on negative numbers but fast.
Actually, if you want xor to work on negative numbers, you just need to stringify them:
my #array = (-10..-7,-5..10,-10..10);
my $unique;
$unique ^= "$_" for #array;
say $unique;
Outputs
-6
And doing some quick benchmarks:
Benchmark: timing 100 iterations of schwern, there_can_be_only_one, tobyink, xor_string...
schwern: 323 wallclock secs (312.42 usr + 7.08 sys = 319.51 CPU) # 0.31/s (n=100)
there_can_be_only_one: 114 wallclock secs (113.49 usr + 0.02 sys = 113.51 CPU) # 0.88/s (n=100)
tobyink: 177 wallclock secs (176.76 usr + 0.14 sys = 176.90 CPU) # 0.57/s (n=100)
xor_string: 98 wallclock secs (97.05 usr + 0.00 sys = 97.05 CPU) # 1.03/s (n=100)
Shows that xor-ing the string goes 15% faster than xor-ing the mathematical translation to the positive numbers.
Corollary - What about with a sorted list?
Schwern's solution brings up an interesting corollary. He sorted the list and then did a search for all of the unique elements.
If we use the additional information that there is only 1 singleton in a crowd of doubletons, we can quickly simplify that the search by doing a pairwise comparison which reduces our comparisons a factor of 4.
However, we can do even better by doing a binary search. If we separate the list on a barrier between a known matched pair, then whichever of the two remaining lists is odd contains our singleton. I did some benchmarking of this solution, and it's orders of magnitude faster than anything else (of course):
use strict;
use warnings;
use Benchmark qw(timethese);
sub binary_search {
my $nums = $_[0];
my $min = 0;
my $max = $#$nums;
while ($min < $max) {
my $half = ($max - $min) / 2; # should always be an integer
my ($prev, $this, $next) = ($min+$half-1) .. ($min+$half+1);
if ($nums->[$prev] == $nums->[$this]) {
if ($half % 2) { # 0 0 1 1 2 2 3 ( half = 3 )
$min = $next;
} else { # 0 1 1 2 2 ( half = 2 )
$max = $prev - 1;
}
} elsif ($nums->[$this] == $nums->[$next]) {
if ($half % 2) { # 0 1 1 2 2 3 3 ( half = 3 )
$max = $prev;
} else { # 0 0 1 1 2 ( half = 2 )
$min = $next + 1;
}
} else {
$max = $min = $this;
}
}
return $nums->[$min];
}
sub xor_string {
my $tmp;
$tmp ^= "$_" for #{$_[0]};
}
sub brute {
my $nums = $_[0];
return $nums->[0] if $nums->[0] != $nums->[1];
for (1..$#$nums-1) {
my($prev, $this, $next) = #$nums[$_-1, $_, $_+1];
return $this if $prev != $this && $next != $this;
}
return $nums->[-1] if $nums->[-1] != $nums->[-2];
}
sub pairwise_search {
my $nums = $_[0];
for (my $i = 0; $i <= $#$nums; $i += 2) {
if ($nums->[$i] != $nums->[$i+1]) {
return $nums->[$i];
}
}
}
# Note: this test data is very specific and is intended to take near the maximum
# number of steps for a binary search while shortcutting halfway for brute force
# and pairwise
my #input = sort {$a <=> $b} (0..500_003, 500_005..1_000_000, 0..1_000_000);
#my #input = sort {$a <=> $b} (0..499_996, 499_998..1_000_000, 0..1_000_000);
timethese(1000, {
brute => sub { brute(\#input) },
pairwise => sub { pairwise_search(\#input) },
xor_string => sub { xor_string(\#input) },
binary => sub { binary_search(\#input) },
});
Results:
Benchmark: timing 1000 iterations of binary, brute, pairwise, xor_string...
binary: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) # 62500.00/s (n=1000)
(warning: too few iterations for a reliable count)
brute: 472 wallclock secs (469.92 usr + 0.05 sys = 469.97 CPU) # 2.13/s (n=1000)
pairwise: 216 wallclock secs (214.74 usr + 0.00 sys = 214.74 CPU) # 4.66/s (n=1000)
xor_string: 223 wallclock secs (221.74 usr + 0.06 sys = 221.80 CPU) # 4.51/s (n=1000)

Perl regex & data extraction/manipulation

I'm not sure where to start with this one... my client gets stock figures from his supplier but they are now being sent in a different format, here is a sample snippet:
[["BLK",[["Black","0F1315"]],[["S","813"],["M","1378"],["L","1119"],["XL","1069"],["XXL","412"],["3XL","171"]]],["BOT",[["Bottle","15451A"]],[["S","226"],["M","425"],["L","772"],["XL","509"],["XXL","163"]]],["BUR",[["Burgundy","73002E"]],[["S","402"],["M","530"],["L","356"],["XL","257"],["XXL","79"]]],["DNA",[["Deep Navy","000F33"]],[["S","699"],["M","1161"],["L","1645"],["XL","1032"],["XXL","350"]]],["EME",[["Emerald","0DAB5E"]],[["S","392"],["M","567"],["L","613"],["XL","431"],["XXL","97"]]],["HEA",[["Heather","C0D4D7"]],[["S","374"],["M","447"],["L","731"],["XL","386"],["XXL","115"],["3XL","26"]]],["KEL",[["Kelly","0FFF00"]],[["S","167"],["M","285"],["L","200"],["XL","98"],["XXL","45"]]],["NAV",[["Navy","002466"]],[["S","451"],["M","1389"],["L","1719"],["XL","1088"],["XXL","378"],["3XL","177"]]],["NPU",[["Purple","560D55"]],[["S","347"],["M","553"],["L","691"],["XL","230"],["XXL","101"]]],["ORA",[["Orange","FF4700"]],[["S","125"],["M","273"],["L","158"],["XL","98"],["XXL","98"]]],["RED",[["Red","FF002E"]],[["S","972"],["M","1186"],["L","1246"],["XL","889"],["XXL","184"]]],["ROY",[["Royal","1500CE"]],[["S","1078"],["M","1346"],["L","1102"],["XL","818"],["XXL","135"]]],["SKY",[["Sky","91E3FF"]],[["S","567"],["M","919"],["L","879"],["XL","498"],["XXL","240"]]],["SUN",[["Sunflower","FFC700"]],[["S","843"],["M","1409"],["L","1032"],["XL","560"],["XXL","53"]]],["WHI",[["White","FFFFFF"]],[["S","631"],["M","2217"],["L","1666"],["XL","847"],["XXL","410"],["3XL","74"]]]]
Firstly the inital [ and end ] can be removed
Then it needs be be broken down into segments of colours, i.e.:
["BLK",[["Black","0F1315"]],[["S","813"],["M","1378"],["L","1119"],["XL","1069"],["XXL","412"],["3XL","171"]]]
The BLK is needed here, the next block [["Black","0F1315"]] can be disregarded.
Next I need to take the stock data for each size ["S","813"] etc
Therefore I should have a data such as:
$col = BLK
$size = S
$qty = 813
$col = BLK
$size = M
$qty = 1278
and repeat this segment for every colour seqment in the data.
The amount of colour segments in the data will vary, as will the amount of sizing segements within. Also the amount of sizing segments will vary colour to colour, i.e. there maybe 6 sizes for BLK but only 5 for RED
The data will be written out while in the loop for these so something like print "$col:$size:$qty" will be fine as this would then be in a format ready to be processed.
Sorry for the long message, I just can't seem to get my head round this today!!
Regards,
Stu
This looks like valid JSON to me, why not use a JSON parser instead of trying to solve this with a regex?
use JSON;
my $json_string = '[["BLK",[["Black","0F1315"]],[["S","813"...<snip>';
my $deserialized = from_json( $json_string );
Then you can iterate over the array and extract the pieces of information you need.
Building on Tim Pietzcker's answer:
...
my $deserialized = from_json( $json_string );
foreach my $group ( #$deserialized ) {
my ( $color, undef, $sizes ) = #$group;
print join( ":", $color, #$_ ), "\n" for #$sizes;
}
(And yes, for this particular format, eval should do as well as from_json, although the latter is safer. However, you should really try to find an official spec for the format: is it really JSON or something else?)
Assuming you have your data in $str, then eval(EXPR) (Danger Will Robinson!) and process the resulting data structure:
my $struct = eval $str;
foreach my $cref (#$struct) {
my($color, undef, $sizerefs) = #$cref; # 3 elements in each top level
foreach my $sizeref (#$sizerefs) {
my($size, $qty) = #$sizeref;
print "$color:$size:$qty\n";
}
}