I understand how to use regex in Perl in the following way:
$str =~ s/expression/replacement/g;
I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:
$str =~ s/(a)/($1)dosomething/;
But is there a way to capture the ($1) above outside of the regex expression?
I have a full word which is a string of consonants, e.g. bEdmA, its vowelized version baEodamaA (where a and o are vowels), as well its split up form of two tokens, separated by space, bEd maA. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda, maA. I'm trying to capture the token within the full word expression, so I have:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(#tokens) {
#find the token within the full word, and capture its vowels
}
I'm trying to do something like this:
$vowelizedword = m/($t)/;
This is completely wrong for two reasons: the token $t is not present in exactly its own form, such as bEd, but something like m/b.E.d/ would be more relevant. Also, how do I capture it in a variable outside the regular expression?
The real question is: how can I capture the vowelized sequences baEoda and maA, given the tokens bEd, mA from the full word beEodamaA?
Edit
I realized from all the answers that I missed out two important details.
Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:
#vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
The following seems to do what you want:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA');
my $vowelizedword = "beEodamaA";
my #regex = map { join('.?', split //) . '.?' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:
#!/usr/bin/env perl
use warnings;
use strict;
my #tokens = ('bEd', 'mA', 'Al', 'ywm');
my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work.
my #regex = map { join('.?', split //) . '.?$' } #tokens;
my $regex = join('|', #regex);
$regex = qr/($regex)/;
while (my ($matched) = $vowelizedword =~ $regex) {
$vowelizedword =~ s{$regex}{};
print "matched $matched\n";
}
Use the m// operator in so-called "list context", as this:
my #tokens = ($input =~ m/capturing_regex_here/modifiershere);
ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.
With this, you can tweak the $vowels variable to only contain the letters you seek. Optionally, you may also just use . to capture any character.
use strict;
use warnings;
use Data::Dumper;
my #tokens = ("bEd", "mA");
my $full = "baEodamA";
my $vowels = "[aeiouy]";
my #matches;
for my $rx (#tokens) {
$rx =~ s/.\K/$vowels?/g;
if ($full =~ /$rx/) {
push #matches, $full =~ /$rx/g;
}
}
print Dumper \#matches;
Output:
$VAR1 = [
'baEoda',
'mA'
];
Note that
... $full =~ /$rx/g;
does not require capturing groups in the regex.
I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.
I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.
To match the tokens, I can use the \G anchor with the /g global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in #subtokens.
Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.
use v5.14;
my $vowels = '[ao]*';
my #subtokens = qw(bEd mA);
# prepare the subtoken regular expressions
my #patterns = map {
my $s = join "$vowels", map quotemeta, (split( // ), '');
qr/$s/;
} #subtokens;
my #tokens = qw( baEodamA mAabaEod baEoda mAbaEoda );
my #grand_matches;
TOKEN: foreach my $token ( #tokens ) {
say "-------\nMatching $token..........";
my #matches;
PATTERN: foreach my $pattern ( #patterns ) {
say "Position is ", pos($token) // 0;
# scalar context /g and \G
next TOKEN unless $token =~ /\G($pattern)/g;
push #matches, $1;
say "Matched with $pattern";
}
push #grand_matches, [ $token, \#matches ];
}
# Now report the original
foreach my $tuple ( #grand_matches ) {
say "$tuple->[0] has both fragments: #{$tuple->[1]}";
}
Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the
\G anchor and the /g flag;
next TOKEN unless $token =~ /($pattern)/;
Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a .*? to match that stuff, effectively skipping over it:
next TOKEN unless $token =~ /\G.*?($pattern)/g;
It would be much nicer if I could manage all of this from the map where I create the patterns, but the /g flag isn't a pattern flag. It has to go with the operator.
I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.
Assuming the tokens need to appear in order and without anything (other than a vowel) between them:
my #tokens = ( "bEd", "mA" );
my $vowelizedword = "baEodamaA";
my $vowels = '[ao]';
my (#vowelized_sequences) = $vowelizedword =~ ( '^' . join( '', map "(" . join( $vowels, split( //, $_ ) ) . "(?:$vowels)?)", #tokens ) . '\\z' );
print for #vowelized_sequences;
The following lines of comma-separated values contains several consecutive empty fields:
$rawData =
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
I want to replace these empty fields with 'N/A' values, which is why I decided to do it via a regex substitution.
I tried this first of all:
$rawdata =~ s/,([,\n])/,N\/A/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which returned
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,,N/A,\n
Not what I wanted. The problem occurs when more than two consecutive commas occur. The regex gobbles up two commas at a time, so it starts at the third comma rather than the second when it rescans the string.
I thought this could be something to do with lookahead vs. lookback assertions, so I tried the following regex out:
$rawdata =~ s/(?<=,)([,\n])|,([,\n])$/,N\/A$1/g; # RELABEL UNAVAILABLE DATA AS 'N/A'
which resulted in:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,,N/A,,N/A\n
That didn't work either. It just shifted the comma-pairings by one.
I know that washing this string through the same regex twice will do it, but that seems crude. Surely, there must be a way to get a single regex substitution to do the job. Any suggestions?
The final string should look like this:
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,N/A,Clear\n
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,N/A,,N/A,N/A,N/A,N/A,N/A\n
EDIT: Note that you could open a filehandle to the data string and let readline deal with line endings:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
my $str = <<EO_DATA;
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,
EO_DATA
open my $str_h, '<', \$str;
while(my $row = <$str_h>) {
chomp $row;
print join(',',
map { length $_ ? $_ : 'N/A'} split /,/, $row, -1
), "\n";
}
Output:
E:\Home> t.pl
2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A
You can also use:
pos $str -= 1 while $str =~ s{,(,|\n)}{,N/A$1}g;
Explanation: When s/// finds a ,, and replaces it with ,N/A, it has already moved to the character after the last comma. So, it will miss some consecutive commas if you only use
$str =~ s{,(,|\n)}{,N/A$1}g;
Therefore, I used a loop to move pos $str back by a character after each successful substitution.
Now, as #ysth shows:
$str =~ s!,(?=[,\n])!,N/A!g;
would make the while unnecessary.
I couldn't quite make out what you were trying to do in your lookbehind example, but I suspect you are suffering from a precedence error there, and that everything after the lookbehind should be enclosed in a (?: ... ) so the | doesn't avoid doing the lookbehind.
Starting from scratch, what you are trying to do sounds pretty simple: place N/A after a comma if it is followed by another comma or a newline:
s!,(?=[,\n])!,N/A!g;
Example:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
print Dumper($rawData);
$rawData =~ s!,(?=[,\n])!,N/A!g;
print Dumper($rawData);
Output:
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n"
"2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,N/A,Clear\n2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,N/A,N/A,N/A,N/A\n"
You could search for
(?<=,)(?=,|$)
and replace that with N/A.
This regex matches the (empty) space between two commas or between a comma and end of line.
The quick and dirty hack version:
my $rawData = "2008-02-06,8:00 AM,14.0,6.0,59,1027,-9999.0,West,6.9,-,N/A,,Clear
2008-02-06,9:00 AM,16,6,40,1028,12,WNW,10.4,,,,\n";
while ($rawData =~ s/,,/,N\/A,/g) {};
print $rawData;
Not the fastest code, but the shortest. It should loop through at max twice.
Not a regex, but not too complicated either:
$string = join ",", map{$_ eq "" ? "N/A" : $_} split (/,/, $string,-1);
The ,-1 is needed at the end to force split to include any empty fields at the end of the string.