I want to split the following string at the pipe character without split at the escaped pipe:
"123|ABC|x\|yz|123" should result in ["123","ABC","x|yz",123]
Does anyone had such a split regexp for perl?
You could use a negative lookbehind:
use warnings 'all';
use strict;
use Data::Dumper;
my $str = '123|ABC|x\|yz|123';
my #bits = split /(?<!\\)\|/, $str;
print Dumper(#bits);
Results in:
$VAR1 = '123';
$VAR2 = 'ABC';
$VAR3 = 'x\\|yz';
$VAR4 = '123';
As pointed out by Wiktor, if your string was of the form:
my $str = '123|ABC|x\|yz|123\\|456|123\\345';
The 123\\ would be grouped with 456 (athough the last string
123\\345 would be okay):
$VAR1 = '123';
$VAR2 = 'ABC';
$VAR3 = 'x\\|yz';
$VAR4 = '123\\|456';
$VAR5 = '123\\345';
This is because the negative lookbehind only asserts a single backslash.
I have this file
affaire,chose,question
chose,emploi,fonction,service,travail,tâche
cause,chose,matière
chose,point,question,tête
chose,objet,élément
chose,machin,truc
I would like to have an associative array like this :
affaire => chose, question
cause => chose, matière
chose => emploi, fonction, service, travail, tache, point, question, tete, objet élément, machin, truc
or even better, whenever I found a new word, save the word as a key and the context (left or/and right) as a value... So for example:
affaire => chose, question
cause => chose, matière
chose => affaire, question, cause, matière, emploi, fonction, service, travail, tache, point, question, tete, objet élément, machin, truc
At present time I'm trying to create the associative array in this way:
$in = "test.txt";
$out = "res_test.txt";
open(IN, "<", $in);
open(OUT, ">", $out);
%list = '';
while(defined($l = <IN>)){
if ($l =~ /((\w+),(.*))/){
#2,3
$list{$2} = $3;
}
}
while(my($k,$v) = each(%list)){
print OUT $k." => ".$v."\n";
}
But the result is:
affaire => chose,question
=>
chose => machin,truc
cause => chose,matière
Why doesn't it add new values?
Thank you for help.
You overwrite old hash values when you actually want to append them, so
solution would be to concatenate strings,
my %list;
while (my $l = <IN>) {
if ($l =~ /((\w+),(.*))/) {
# $list{$2} //= ""; # initialize to empty string
# # add comma in front depending on $list{$2} content
# $list{$2} .= length($list{$2}) ? ",$3" : $3;
if (defined $list{$2}) { $list{$2} .= ",$3" }
else { $list{$2} = $3 }
}
}
or to use more common hash of arrays for storing values,
my %list;
while (my $l = <IN>) {
my ($k, #vals) = split /,/, $l;
push #{ $list{$k} }, #vals;
}
use Data::Dumper; print Dumper \%list;
Each time you have new value, you assigned this new value to hash key's value, causes the old value is overridden.
A simple fix:
#!/usr/bin/perl
use strict;
use warnings;
my $in = "in";
my $out = "out";
open IN, "<", $in
or die "$!";
open OUT, ">", $out
or die "$!";
my %list = ();
while (defined(my $l = <IN>)) {
if ($l =~ /(\w+),(.*)/) {
$list{$1} .= exists($list{$1}) ? ",$2" : $2;
}
}
while(my($k,$v) = each(%list)){
print OUT $k." => ".$v."\n";
}
use Data::Dumper;
$in = "test.txt";
$out = "res_test.txt";
open(IN, "<", $in);
open(OUT, ">", $out);
%list = '';
while(defined($l = <IN>)){
chomp($l);
$list{$k} = [] unless exists $list{$k};
if ($l =~ /((\w+),(.*))/){
#2,3
push #{ $list{$2} }, $3;
}
}
foreach $k (sort keys %list) {
my #val = #{$list{$k}};
print join ', ', sort #val;
print ".\n";
}
It works!
In hash (associate array) the keys must be unique. That is why in your case chose will cause issues.
#!/usr/bin/perl
# your code goes here
use strict;
use warnings;
use Data::Dumper;
my %hash;
while(chomp(my $line = <DATA>)){
my (#values) = split /,/,$line;
my $key = shift #values;
if(exists $hash{$key}){
my $ref_value = $hash{"$key"};
push #values, #$ref_value;
$hash{"$key"} = [#values];
}
else{
$hash{"$key"} = [#values];
}
}
print Dumper %hash;
__DATA__
affaire,chose,question
chose,emploi,fonction,service,travail,tâche
cause,chose,matière
chose,point,question,tête
chose,objet,élément
chose,machin,truc
Demo
I'm assigning a series of regex's to vars. Some of the regex values will be the same but unique and be identifiable by the var name itself ($a and $c as example).
#various regex
$a = "([\d]{1,2})"
$b = "([\d]{3})"
$c = $b #Note this has the same regex as $b
$d = "\s[-]\s"
$e = "[_]"
#select the pattern
$patternNum = 4
I then want to be able to concat the vars in different orders to create a larger regex.
Switch ($patternNum){
#create a pattern
1 { $pattern = ($a, $e, $b) }
2 { $pattern = ($a, $d, $b) }
3 { $pattern = ($a, $d, $a, $e, $b) }
4 { $pattern = ($a, $e, $b, $e, $c) }
}
This creates the expanded regex string i'm hoping for
#so i can use full regex pattern later
$selectedPattern = -join $pattern
But I want to be able to associate the var in $pattern with the original var name and not the literal string that's associated with the var (as some strings will be the same)
#find the index of each var and assign to another var so var can be used later to identify position within match
var1 = [array]::IndexOf($pattern, $a) # [0]
var2 = array]::IndexOf($pattern, $b) # [2]
var3 = [array]::IndexOf($pattern, $c) # [2] but i want it to be [4]
The regex which will be used in matching, each match will be used in different strings and in different positions
I thought i'd be able to use scriptblock {} and then convert back to string but that doesn't seem to work. Can anybody think of a way to get each vars original var name or think of a better way of doing this?
Using named captures
Use (? ) syntax to create named captures. Make the name the same as your variable names, e.g.:
$A = '(?<A>\d{3})'
$B = '(?<B>\D{3})'
$string = 'ABC123'
$regex = $B + $A
$string -match $regex
$Matches
Name Value
---- -----
A 123
B ABC
0 ABC123
Now you can correlate the variables to the position they matched in the string like this:
$string.IndexOf($Matches.A)
3
$string.IndexOf($Matches.B)
0
following your code I'll do it like this, but knowing what's is your real need someone can suggest other solution:
$c = $b
$d = "\s[-]\s"
$e = "[_]"
#select the pattern
$patternNum = 4
Switch ($patternNum){
#create a pattern
1 { $pattern = ('$a', '$e', '$b') }
2 { $pattern = ('$a', '$d', '$b') }
3 { $pattern = ('$a', '$d', '$a', '$e', '$b') }
4 { $pattern = ('$a', '$e', '$b', '$e', '$c') }
}
$selectedPattern = -join $pattern
$var1 = [array]::IndexOf($pattern, '$a') # [0]
$var2 = [array]::IndexOf($pattern, '$b') # [2]
$var3 = [array]::IndexOf($pattern, '$c') # [4]
#converting literal to your pattern
$regexpattern = $ExecutionContext.InvokeCommand.ExpandString( -JOIN $pattern )
$regexpattern
([\d]{1,2})[_]([\d]{3})[_]([\d]{3})
This is nuts, I mean pseudocode, but something like this:
/[January, February, March] \d*/
Should match things like January 13 or February 26, and so on...
WHAT I'M DOING:
my $url0 = 'http://www.registrar.ucla.edu/calendar/acadcal13.htm';
my $url1 = 'http://www.registrar.ucla.edu/calendar/acadcal14.htm';
my $url2 = 'http://www.registrar.ucla.edu/calendar/acadcal15.htm';
my $url3 = 'http://www.registrar.ucla.edu/calendar/acadcal16.htm';
my $url4 = 'http://www.registrar.ucla.edu/calendar/acadcal17.htm';
my $url5 = 'http://www.registrar.ucla.edu/calendar/sumcal.htm';
my $document0 = get($url0);
my $document1 = get($url1);
my $document2 = get($url2);
my $document3 = get($url3);
my $document4 = get($url4);
my $document5 = get($url5);
my #dates0 = ($document0 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates1 = ($document1 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates2 = ($document2 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates3 = ($document3 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates4 = ($document4 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
my #dates5 = ($document5 =~ /(January|February|March|April|May|June|July|August|September|October|November|December) \d+/g);
foreach(#dates0)
{
print "$_\r\n";
}
foreach(#dates1)
{
print "$_\r\n";
}
foreach(#dates2)
{
print "$_\r\n";
}
foreach(#dates3)
{
print "$_\r\n";
}
foreach(#dates4)
{
print "$_\r\n";
}
foreach(#dates5)
{
print "$_\r\n";
}
These printing gadgets give the following result: http://pastebin.com/7z13gBqt
This is not good:
http://tinypic.com/r/nqpapx/8
Yes. You can use an alternation.
/(January|February|March|April|May|June|July|August|September|October|November|December) \d*/
Would do that.
If you already have them in an array, then you can change the variable $LIST_SEPARATOR to string them into an alternation. And then parenthesize the whole
use English qw<$LIST_SEPARATOR>; # In line-noise: $"
my $date_regex
= do { local $LIST_SEPARATOR = '|';
qr/(?:#months) \d*/ # ?: if you don't want the capture
};
This gives you a compiled expression, which you can reuse like so:
my #dates;
while ( my $url = <DATA> ) {
my $document = get( $url );
push #dates, [ $document =~ /($date_regex)/g ];
push #dates, $date;
}
__DATA__
http://www.registrar.ucla.edu/calendar/acadcal13.htm
http://www.registrar.ucla.edu/calendar/acadcal14.htm
http://www.registrar.ucla.edu/calendar/acadcal15.htm
http://www.registrar.ucla.edu/calendar/acadcal16.htm
http://www.registrar.ucla.edu/calendar/acadcal17.htm
http://www.registrar.ucla.edu/calendar/sumcal.htm
Say, I have lines like:
SOMETHING.AA.AA.DARKSIDE
BLaH.AA.AA.Blah
I want to find for each line $before = $1; $after = $2; of the $middle = ”AA”
Such that for example for line 1 I get:
$before= “SOMETHING.”
$After = “.AA.DARKSIDE”
And also
$before= “SOMETHING.AA”
$After = “.DARKSIDE”
My code looks like this:
$middle = “AA”;
foreach (#lines){
$line = $_;
while ($line =~m/^(.+)$middle(.+)$/g){
$before = $1;
$after = $2;
}
}
Is there a simple way to change regex in my while?
PS: $middle will be a variable so i cannot hardcode it.
Thank you for help.
Why do you want to use regexes for that?
($before, $after) = split(/$middle\.$middle/, $line);
And then use $before and $after each once without and once with $middle concatenated to the end and start of the string respectively.