Identify words in a sentence which are figures through regex [duplicate] - regex

I have string:
$string = 'Five People';
I want to replace all number-words into numbers. So results are:
$string = '5 People';
I have this function to convert single words to int:
function words_to_number($data) {
$data = strtr(
$data,
array(
'zero' => '0',
'a' => '1',
'one' => '1',
'two' => '2',
'three' => '3',
'four' => '4',
'five' => '5',
'six' => '6',
'seven' => '7',
'eight' => '8',
'nine' => '9',
'ten' => '10',
'eleven' => '11',
'twelve' => '12',
'thirteen' => '13',
'fourteen' => '14',
'fifteen' => '15',
'sixteen' => '16',
'seventeen' => '17',
'eighteen' => '18',
'nineteen' => '19',
'twenty' => '20',
'thirty' => '30',
'forty' => '40',
'fourty' => '40', // common misspelling
'fifty' => '50',
'sixty' => '60',
'seventy' => '70',
'eighty' => '80',
'ninety' => '90',
'hundred' => '100',
'thousand' => '1000',
'million' => '1000000',
'billion' => '1000000000',
'and' => '',
)
);
// Coerce all tokens to numbers
$parts = array_map(
function ($val) {
return floatval($val);
},
preg_split('/[\s-]+/', $data)
);
$stack = new SplStack; // Current work stack
$sum = 0; // Running total
$last = null;
foreach ($parts as $part) {
if (!$stack->isEmpty()) {
// We're part way through a phrase
if ($stack->top() > $part) {
// Decreasing step, e.g. from hundreds to ones
if ($last >= 1000) {
// If we drop from more than 1000 then we've finished the phrase
$sum += $stack->pop();
// This is the first element of a new phrase
$stack->push($part);
} else {
// Drop down from less than 1000, just addition
// e.g. "seventy one" -> "70 1" -> "70 + 1"
$stack->push($stack->pop() + $part);
}
} else {
// Increasing step, e.g ones to hundreds
$stack->push($stack->pop() * $part);
}
} else {
// This is the first element of a new phrase
$stack->push($part);
}
// Store the last processed part
$last = $part;
}
return $sum + $stack->pop();
}
// test
$words = 'five';
echo words_to_number($words);
Works great (try it ideone). I need to find a way to determine which words within a string is a word-number and then do a replace of all these matching words and convert them into numbers.
How can this be done? Maybe a regex approach?

I have tried to port a text2num Python library to PHP, mix it with a regex for matching English spelled out numbers, enhanced it to the decillion, and here is a result:
function text2num($s) {
// Enhanced the regex at http://www.rexegg.com/regex-trick-numbers-in-english.html#english-number-regex
$reg = <<<REGEX
(?x) # free-spacing mode
(?(DEFINE)
# Within this DEFINE block, we'll define many subroutines
# They build on each other like lego until we can define
# a "big number"
(?<one_to_9>
# The basic regex:
# one|two|three|four|five|six|seven|eight|nine
# We'll use an optimized version:
# Option 1: four|eight|(?:fiv|(?:ni|o)n)e|t(?:wo|hree)|
# s(?:ix|even)
# Option 2:
(?:f(?:ive|our)|s(?:even|ix)|t(?:hree|wo)|(?:ni|o)ne|eight)
) # end one_to_9 definition
(?<ten_to_19>
# The basic regex:
# ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|
# eighteen|nineteen
# We'll use an optimized version:
# Option 1: twelve|(?:(?:elev|t)e|(?:fif|eigh|nine|(?:thi|fou)r|
# s(?:ix|even))tee)n
# Option 2:
(?:(?:(?:s(?:even|ix)|f(?:our|if)|nine)te|e(?:ighte|lev))en|
t(?:(?:hirte)?en|welve))
) # end ten_to_19 definition
(?<two_digit_prefix>
# The basic regex:
# twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety
# We'll use an optimized version:
# Option 1: (?:fif|six|eigh|nine|(?:tw|sev)en|(?:thi|fo)r)ty
# Option 2:
(?:s(?:even|ix)|t(?:hir|wen)|f(?:if|or)|eigh|nine)ty
) # end two_digit_prefix definition
(?<one_to_99>
(?&two_digit_prefix)(?:[- ](?&one_to_9))?|(?&ten_to_19)|
(?&one_to_9)
) # end one_to_99 definition
(?<one_to_999>
(?&one_to_9)[ ]hundred(?:[ ](?:and[ ])?(?&one_to_99))?|
(?&one_to_99)
) # end one_to_999 definition
(?<one_to_999_999>
(?&one_to_999)[ ]thousand(?:[ ](?&one_to_999))?|
(?&one_to_999)
) # end one_to_999_999 definition
(?<one_to_999_999_999>
(?&one_to_999)[ ]million(?:[ ](?&one_to_999_999))?|
(?&one_to_999_999)
) # end one_to_999_999_999 definition
(?<one_to_999_999_999_999>
(?&one_to_999)[ ]billion(?:[ ](?&one_to_999_999_999))?|
(?&one_to_999_999_999)
) # end one_to_999_999_999_999 definition
(?<one_to_999_999_999_999_999>
(?&one_to_999)[ ]trillion(?:[ ](?&one_to_999_999_999_999))?|
(?&one_to_999_999_999_999)
) # end one_to_999_999_999_999_999 definition
# ==== MORE ====
(?<one_to_quadrillion>
(?&one_to_999)[ ]quadrillion(?:[ ](?&one_to_999_999_999_999_999))?|
(?&one_to_999_999_999_999_999)
) # end one_to_quadrillion definition
(?<one_to_quintillion>
(?&one_to_999)[ ]quintillion(?:[ ](?&one_to_quadrillion))?|
(?&one_to_quadrillion)
) # end one_to_quintillion definition
(?<one_to_sextillion>
(?&one_to_999)[ ]sextillion(?:[ ](?&one_to_quintillion))?|
(?&one_to_quintillion)
) # end one_to_sextillion definition
(?<one_to_septillion>
(?&one_to_999)[ ]septillion(?:[ ](?&one_to_sextillion))?|
(?&one_to_sextillion)
) # end one_to_septillion definition
(?<one_to_octillion>
(?&one_to_999)[ ]octillion(?:[ ](?&one_to_septillion))?|
(?&one_to_septillion)
) # end one_to_octillion definition
(?<one_to_nonillion>
(?&one_to_999)[ ]nonillion(?:[ ](?&one_to_octillion))?|
(?&one_to_octillion)
) # end one_to_nonillion definition
(?<one_to_decillion>
(?&one_to_999)[ ]decillion(?:[ ](?&one_to_nonillion))?|
(?&one_to_nonillion)
) # end one_to_decillion definition
(?<bignumber>
zero|(?&one_to_decillion)
) # end bignumber definition
(?<zero_to_9>
(?&one_to_9)|zero
) # end zero to 9 definition
# (?<decimals>
# point(?:[ ](?&zero_to_9))+
# ) # end decimals definition
) # End DEFINE
####### The Regex Matching Starts Here ########
\b(?:(?&ten_to_19)\s+hundred|(?&bignumber))\b
REGEX;
return preg_replace_callback('~' . trim($reg) . '~i', function ($x) {
return text2num_internal($x[0]);
}, $s);
}
function text2num_internal($s) {
// Port of https://github.com/ghewgill/text2num/blob/master/text2num.py
$Small = [
'zero'=> 0,
'one'=> 1,
'two'=> 2,
'three'=> 3,
'four'=> 4,
'five'=> 5,
'six'=> 6,
'seven'=> 7,
'eight'=> 8,
'nine'=> 9,
'ten'=> 10,
'eleven'=> 11,
'twelve'=> 12,
'thirteen'=> 13,
'fourteen'=> 14,
'fifteen'=> 15,
'sixteen'=> 16,
'seventeen'=> 17,
'eighteen'=> 18,
'nineteen'=> 19,
'twenty'=> 20,
'thirty'=> 30,
'forty'=> 40,
'fifty'=> 50,
'sixty'=> 60,
'seventy'=> 70,
'eighty'=> 80,
'ninety'=> 90
];
$Magnitude = [
'thousand'=> 1000,
'million'=> 1000000,
'billion'=> 1000000000,
'trillion'=> 1000000000000,
'quadrillion'=> 1000000000000000,
'quintillion'=> 1000000000000000000,
'sextillion'=> 1000000000000000000000,
'septillion'=> 1000000000000000000000000,
'octillion'=> 1000000000000000000000000000,
'nonillion'=> 1000000000000000000000000000000,
'decillion'=> 1000000000000000000000000000000000,
];
$a = preg_split("~[\s-]+(?:and[\s-]+)?~u", $s);
$a = array_map('strtolower', $a);
$n = 0;
$g = 0;
foreach ($a as $w) {
if (isset($Small[$w])) {
$g = $g + $Small[$w];
}
else if ($w == "hundred" && $g != 0) {
$g = $g * 100;
}
else {
$x = $Magnitude[$w];
if (strlen($x) > 0) {
$n =$n + $g * $x;
$g = 0;
}
else{
throw new Exception("Unknown number: " . $w);
}
}
}
return $n + $g;
}
echo text2num("one") . "\n"; // 1
echo text2num("twelve") . "\n"; // 12
echo text2num("seventy two") . "\n"; // 72
echo text2num("three hundred") . "\n"; // 300
echo text2num("twelve hundred") . "\n"; // 1200
echo text2num("twelve thousand three hundred four") . "\n"; // 12304
echo text2num("six million") . "\n"; // 6000000
echo text2num("six million four hundred thousand five") . "\n"; // 6400005
echo text2num("one hundred twenty three billion four hundred fifty six million seven hundred eighty nine thousand twelve") . "\n"; # // 123456789012
echo text2num("four decillion") . "\n"; // 4000000000000000000000000000000000
echo text2num("five hundred and thirty-seven") . "\n"; // 537
echo text2num("five hundred and thirty seven") . "\n"; // 537
See the PHP demo.
The regex can actually match either just big numbers or numbers like "eleven hundred", see \b(?:(?&ten_to_19)\s+hundred|(?&bignumber))\b. It can be further enhanced. E.g. word boundaries may be replaced with other boundary types (like (?<!\S) and (?!\S) to match in between whitespaces, etc.).
Decimal part in the regex is commented out since even if we match it, the num2text won't handle them.

You can use this regex:
\b(zero|a|one|tw(elve|enty|o)|th(irt(een|y)|ree)|fi(ft(een|y)|ve)|(four|six|seven|nine)(teen|ty)?|eight(een|y)?|ten|eleven|forty|hundred|thousand|(m|b)illion|and)+\b
By the way, there might be a better regex out there. Until someone posts it, you can use the following implementation
$regex = '/\b(zero|a|one|tw(elve|enty|o)|th(irt(een|y)|ree)|fi(ft(een|y)|ve)|(four|six|seven|nine)(teen|ty)?|eight(een|y)?|ten|eleven|forty|hundred|thousand|(m|b)illion|and)+\b/i';
function word_numbers_to_numbers($string) {
return preg_replace_callback($regex, function($m) {
return words_to_number($m[0]);
},$string);
}

Related

Why do #+ and #{^CAPTURE} differ in length?

I'm trying to understand how the regex variables work, so I can save submatch positions in the payload within embedded code expressions. According to perlvar, the positive indices of the array correspond to $1, $2, $3, etc., but that doesn't seem to be the case?
#!/usr/bin/perl -w
use v5.28;
use Data::Dumper;
"XY" =~ / ( (.*) (.) (?{
say Dumper { match_end => \#+ };
say Dumper { capture => \#{^CAPTURE} }
}) ) (.)/x;
Output:
$VAR1 = {
'match_end' => [
2,
undef,
1,
2,
undef
]
};
$VAR1 = {
'capture' => [
undef,
'X',
'Y'
]
};
$VAR1 = {
'match_end' => [
1,
2,
0,
1,
undef
]
};
$VAR1 = {
'capture' => [
'XY',
'',
'X'
]
};
The #+ array apparently gets allocated, or otherwise prepared, already at compilation
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) )/x'
prints
(0, undef, undef)
(0 for the whole match and an undef for each indicated capture group), while
perl -MData::Dump=dd -we'$_=q(abc); / (?{dd #+}) ( (.) (.) )/x'
prints
(0, undef, undef, undef)
with one more element for one more capture group.
One the other hand, the #{^CAPTURE} is just plain empty until there are actual patterns to capture, as we can see from mob's detailed analysis. This, I'd say, plays well with its name.
After the fact the arrays agree, with that shift of one in indices since #+ also contains (offset for) the whole match, at $+[0].
Another difference is that a trailing failed optional match doesn't get a slot in #{^CAPTURE}
perl -MData::Dump=dd -we'$_=q(abc); /((x)? (.) (x)?)/x; dd #+; dd #{^CAPTURE}'
prints
(1, 1, undef, 1, undef)
("a", undef, "a")
The perlvar docs are unclear about what #{^CAPTURE} look like in the middle of a regexp evaluation, but there is a clear progression that depends where in the regexp you are looking at it.
use 5.026;
use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 0;
sub DEBUG_CAPTURE { say Dumper { a => $_[0], capture => \#{^CAPTURE} }; }
"XY" =~ /
(?{DEBUG_CAPTURE(0)})
(
(?{DEBUG_CAPTURE(1)})
(
(?{DEBUG_CAPTURE(2)})
(.*) (?{DEBUG_CAPTURE(3)})
(.) (?{DEBUG_CAPTURE(4)})
)
(?{DEBUG_CAPTURE(5)}) (.)
(?{DEBUG_CAPTURE(6)})
)
(?{DEBUG_CAPTURE(7)}) /x;
DEBUG_CAPTURE(8);
Output
$VAR1 = {'a' => 0,'capture' => []};
$VAR1 = {'a' => 1,'capture' => []};
$VAR1 = {'a' => 2,'capture' => []};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'XY']};
$VAR1 = {'a' => 3,'capture' => [undef,undef,'X']};
$VAR1 = {'a' => 4,'capture' => [undef,undef,'X','Y']};
$VAR1 = {'a' => 5,'capture' => [undef,'XY','X','Y']};
$VAR1 = {'a' => 3,'capture' => [undef,'XY','','Y']};
$VAR1 = {'a' => 4,'capture' => [undef,'XY','','X']};
$VAR1 = {'a' => 5,'capture' => [undef,'X','','X']};
$VAR1 = {'a' => 6,'capture' => [undef,'X','','X','Y']};
$VAR1 = {'a' => 7,'capture' => ['XY','X','','X','Y']};
$VAR1 = {'a' => 8,'capture' => ['XY','X','','X','Y']};
The docs are correct if you are observing #{^CAPTURE} after a regexp has been completely evaluated. While evaluation is in process, #{^CAPTURE} seems to grow as the number of capture groups encountered increases. But it's not clear how useful it is to look at #{^CAPTURE} at least until you get to the end of the expression.

Perl regex vs. Raku regex, differences in the engine?

I am trying to convert a regex based solution for the knapsack problem from Perl to raku. Details on Perlmonks
The Perl solution creates this regex:
(?<P>(?:vvvvvvvvvv)?)
(?<B>(?:vv)?)
(?<Y>(?:vvvv)?)
(?<G>(?:vv)?)
(?<R>(?:v)?)
0
(?=
(?(?{ $1 })wwww|)
(?(?{ $2 })w|)
(?(?{ $3 })wwwwwwwwwwww|)
(?(?{ $4 })ww|)
(?(?{ $5 })w|)
)
which gets matched against vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. After that the match hash %+ contains the items to put in the sack.
My raku conversion is:
$<B> = [ [ vv ]? ]
$<P> = [ [ vvvvvvvvvv ]? ]
$<R> = [ [ v ]? ]
$<Y> = [ [ vvvv ]? ]
$<G> = [ [ vv ]? ]
0
<?before
[ { say "B"; say $/<B>; say $0; say $1; $1 } w || { "" } ]
[ { say "P"; say $/<P>; say $0; say $1; $2 } wwww || { "" } ]
[ { say "R"; say $/<R>; say $0; say $1; $3 } w || { "" } ]
[ { say "Y"; say $/<Y>; say $0; say $1; $4 } wwwwwwwwwwww || { "" } ]
[ { say "G"; say $/<G>; say $0; say $1; $5 } ww || { "" } ]
which also matches vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. But the match object, $/ does not contain anything useful. Also, my debug says all say Nil, so at that point the backreference does not seem to work?
Here's my test script:
my $max-weight = 15;
my %items =
'R' => { w => 1, v => 1 },
'B' => { w => 1, v => 2 },
'G' => { w => 2, v => 2 },
'Y' => { w => 12, v => 4 },
'P' => { w => 4, v => 10 }
;
my $str = 'v' x %items.map(*.value<v>).sum ~
'0' ~
'w' x $max-weight;
say $str;
my $i = 0;
my $left = my $right = '';
for %items.keys -> $item-name
{
my $v = 'v' x %items{ $item-name }<v>;
my $w = 'w' x %items{ $item-name }<w>;
$left ~= sprintf( '$<%s> = [ [ %s ]? ] ' ~"\n", $item-name, $v );
$right ~= sprintf( '[ { say "%s"; say $/<%s>; say $0; say $1; $%d } %s || { "" } ]' ~ "\n", $item-name, $item-name, ++$i, $w );
}
use MONKEY-SEE-NO-EVAL;
my $re = sprintf( '%s0' ~ "\n" ~ '<?before ' ~ "\n" ~ '%s>' ~ "\n", $left, $right );
say $re;
dd $/ if $str ~~ m:g/<$re>/;
This answer only covers what's going wrong. It does not address a solution. I have not filed corresponding bugs. I have not yet even searched bug queues to see if I can find reports corresponding to either or both the two issues I've surfaced.
my $lex-var;
sub debug { .say for ++$, :$<rex-var>, :$lex-var }
my $regex = / $<rex-var> = (.) { $lex-var = $<rex-var> } <?before . { debug }> / ;
'xx' ~~ $regex; say $/;
'xx' ~~ / $regex /; say $/;
displays:
1
rex-var => Nil
lex-var => 「x」
「x」
rex-var => 「x」
2
rex-var => Nil
lex-var => 「x」
「x」
Focusing first on the first call of debug (the lines starting with 1 and ending at rex-var => 「x」), we can see that:
Something's gone awry during the call to debug: $<rex-var> is reported as having the value Nil.
When the regex match is complete and we return to the mainline, the say $/ reports a full and correctly populated result that includes the rex-var named match.
To begin to get a sense of what's gone wrong, please consider reading the bulk of my answer to another SO question. You can safely skip the Using ~. Footnotes 1,2, and 6 are also probably completely irrelevant to your scenario.
For the second match, we see that not only is $<rex-var> reported as being Nil during the debug call, the final match variable, as reported back in the mainline with the second say $/, is also missing the rex-var match. And the only difference is that the regex $regex is called from within an outer regex.

Perl - Using regex to match input in hash key or value

First, this is a homework assignment. I am having a tough time with regex, and I'm stuck.
This is the code I have so far, where I have the user designate a filename, and if it exists, populates a hash of the names as keys, and the phone numbers as the values.
#!/usr/bin/perl
use strict;
print "\nEnter Filename: ";
my $file = <STDIN>;
chomp $file;
if(!open(my $fileName, "<", "$file"))
{
print "Sorry, that file doesn't exist!", "\n";
}
else
{
my %phoneNums;
while (my $line=<$fileName>)
{
chomp($line);
(my $name,my $number) = split /:/, $line;
$phoneNums{$name} = $number;
}
print "Read in the file!", "\n\n";
print "Enter search: ";
my $input = <STDIN>;
chomp $input;
#HERE IS WHERE I'M LOST
}
print "\n";
This is the part I am stuck on:
Allow the user to enter a search string.
Look for matches using the same style as the phone. Any individual
character in the search string can match any other character from the
key, meaning a ‘2’ in the search string can match a ‘2’, ‘A’, ‘B’, or ‘C’ in the contact list. Matches can occur in the contact name or the phone number. For a match to occur, each character in the search string must appear, in order, in the contact info, but not necessarily next to each
other. For example, a search string of “86” (essentially the same as a search string of “TM” or “NU”) would match “TOM” but not “MOTHER”.
Characters on each phone keys:
0,
1,
2ABC,
3DEF,
4GHI,
5JKL,
6MNO,
7PQRS,
8TUV,
9WXYZ
I just am stuck on how exactly to make all those character classes, and any help at all is much appreciated.
The way to tackle this is by writing a function that reduces your 'things' to their common components. The best way to do this IMO is use a hash:
my %num_to_letter = (
0 => [],
1 => [],
2 => [ "A", "B", "C" ],
3 => [ "D", "E", "F" ],
4 => [ "G", "H", "I" ],
5 => [ "J", "K", "L" ],
## etc.
);
my %letter_to_num;
foreach my $key ( keys %num_to_letter ) {
foreach my $element ( #{$num_to_letter{$key}} ) {
$letter_to_num{lc($element)} = lc($key);
}
}
print Dumper \%letter_to_num;
This creates a map of which letters or numbers map to their original - a bit like this:
$VAR1 = {
'b' => '2',
'g' => '4',
'e' => '3',
'i' => '4',
'a' => '2',
'j' => '5',
...
Note - you can do this by hand, but I prefer to generate from the top map, because I think it looks neater. Note - we use lc to lower case everything, so this becomes case insensitive. It's probably worth looking at fc - which is a similar tool but handles international characters. (Not relevant in this example though)
You then 'reduce' both search and 'target' to their common values:
sub normalise {
my ( $input ) = #_;
#join with no delimiter.
return join ( '',
#look up $_ (each letter) in $letter_to_num
#if not present, use // operator to return original value.
#this means we get to turn numbers into letters,
#but leave things that are already numbers untouched.
map { $letter_to_num{lc($_)} // $_ }
#split the input line into characters.
split ( //, $input )
);
}
print normalise ("DAD"),"\n"; ## 323
And then compare one against the other:
my $search = "DAD";
my $normalised_search = normalise($search);
print "Searching for: \"$normalised_search\"\n";
my $number_to_match = '00533932388';
my $string_to_match = "daddyo";
print "Matches number\n"
if normalise($number_to_match) =~ m/$normalised_search/;
print "Matches string\n"
if normalise($string_to_match) =~ m/$normalised_search/;
Here's an almost procedural approach that cheats a bit by using Hash::MultiValue:
use Hash::MultiValue; # makes reversing and flattening easier
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hash::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# TOM in numbers - 866 in letters
my $letters = "TOM" ;
print join '', $let2num{$_} // $_ for (split('', $letters)), "\n";
my $phone_input = "866" ;
print join '', #{$num2let{$_}}," " for (split('', $phone_input)) , "\n";
Output:
866
TUV MNO MNO
So here "TOM" would overlap with "UNO" ... I like #Sobrique's answer :-)
To search an array/list of contact names using the phone keypad input we can create a hash containing keys and values of the names and their number equivalents and then match the "converted" name value against the input:
use Hash::MultiValue; # makes reversing and flattening easier
my #contacts = <DATA> ;
chomp #contacts;
# build a hash from the phone_keypad array or do it manually!
my #phone_keypad = qw(0 1 2ABC 3DEF 4GHI 5JKL 6MNO 7PQRS 8TUV 9WXYZ);
my %num2let = map { /(\d{1})(\w{3,4})/;
if ($2) { $1 => [ split('',$2) ] } else { 0 => [] , 1 => [] }
} #phone_keypad ;
# Invert the hash using Hasj::MultiValue
my $num2let_mv = Hash::MultiValue->from_mixed(\%num2let);
my %let2num = reverse $num2let_mv->flatten ;
# create key/value pairs for contact database
my %contacts2nums ;
for $contact (#contacts) {
$contacts2nums{$contact} = join "",
map { $let2num{$_} } split('', uc $contact);
}
my $phone_input = "866";
for my $contact (keys %contacts2nums) {
print "The text: \"$contact\" matches the input: \"$phone_input\" \n"
if $phone_input eq $contacts2nums{$contact};
}
__DATA__
Tom
Mother
TIMTOWDI
DAD
Gordon
Output:
The text: "Tom" matches the input: "866"
A more organized approach would wrap the conversion operation in a function.
Addendum:
With a real keypad you could probably come up with a simple algorithm that could be more deterministic regarding the letter you want to associate with the number on the keypad. You could iterate through the array based on number of presses of the key: e.g. two presses on "2" would be equal to "B", etc. You'd just have to figure out how/when to move to the next character with some kind of timeout/wait value. This way you would have a more exact string on which to base your search.

Counting repeated characters around nth character of string

For a part of my University project, i am trying to count base repeats around the 11th character of 21 bp sequences of DNA. I want to look at the 11th character, then if there are repeated identical characters around it, to print them.
For example:
GCTAAAGTAAAAGAAGATGCA
Would give results of:
11th base is A, YES repeated 4 times
I really don't know how to go about this, to get the 11th character i'm sure i can use a regex but after that i'm not sure.
To start with I have playing around using a hash and looking for the number of occurrences of different nucleotide groups in each sequence, as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/hash.txt";
open FILE1, "/Users/edwardtickle/Documents/randomoutput.txt";
open( OUTPUTFILE, ">$outputfile" );
while (<FILE1>) {
if (/^(\S+)/) {
my $dna = $1;
my #repeats = ( $dna =~ /[A]{3}/g );
my #changes = ( $dna =~ /[T]{2}/g );
my %hash = ();
my %hash1 = ();
for my $repeats (#repeats) {
$hash{$repeats}++;
}
for my $changes (#changes) {
$hash1{$changes}++;
}
for my $key ( keys %hash ) {
print OUTPUTFILE $key . " => " . $hash{$key} . "\n";
}
for my $key1 ( keys %hash1 ) {
print OUTPUTFILE $key1 . " => " . $hash1{$key1} . "\n";
}
}
}
FILE1 data:
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Gives results of:
TT => 2
AAA => 1
TT => 4
AAA => 1
TT => 2
TT => 4
AAA => 1
TT => 2
TT => 5
AAA => 1
TT => 1
AAA => 2
TT => 1
TT => 2
When for this sample data set i would like a cumulative tally of every sequence, rather than number of individual occurrences in each matching string, like this:
AAA => 6
TT => 23
How do i go about changing the output? And how do i prevent a string of TTTTT bases showing up as TT => 2? Then if anyone has any recommendations of how to go about the original problem/if it is even possible, that would be greatly appreciated.
Thanks in advance!
Using a regular expression:
use strict;
use warnings;
my $char = 11; # Looking for the 11th character, or position 10.
while (<DATA>) {
chomp;
if (m{
( (.) \2*+ ) # Look for a repeated character sequence
(?<= .{$char} ) # Must include pos $char - 1
}x) {
printf "%s => %d\n", $2, length($1);
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
Output:
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
This code should do what you need. There really isn't a regular expression that will find the longest sequence of a given character at and around a given character position. This code works by splitting the string $seq into an array of characters #seq and then searching forwards and backwards from the centre.
It's practical to do things this way because the sequence is relatively short, and as long as there's an odd numbers of characters in the string it will calculate the centre point for you.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my #seq = unpack '(A1)*', $seq;
my $len_seq = #seq;
my $c_offset = ($len_seq - 1) / 2;
my $c_char = $seq[$c_offset];
my ($start, $end) = ($c_offset, $c_offset + 1);
--$start while $start > 0 and $seq[$start-1] eq $c_char;
++$end while $end < $len_seq and $seq[$end] eq $c_char;
return $c_char, $end-$start;
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC
output
G => 1
A => 3
T => 1
T => 2
A => 4
T => 3
A => 4
A => 5
T => 2
G => 3
Update
Here's a better way. It's shorter and faster, and works by all the subsequences of the same character until it finds a sequence that spans the middle character.
The output is identical to that of the above.
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($base, $length) = find_mid_band($_);
printf "%s => %d\n", $base, $length;
}
sub find_mid_band {
my ($seq) = #_;
my $mid_seq = length($seq) / 2;
while ( $seq =~ /(.)\1*/g ) {
if ($-[0] < $mid_seq and $+[0] > $mid_seq) {
return $1, $+[0]-$-[0];
}
}
}
__DATA__
ATTTTTAGCCGAACAAGTACC
TACTTAGTTAAATTGTTACAA
ATAAACCTTGTGCAGGTTTGT
CCTTAATCCTTGTATTTTTAA
TCTTGTTAAAATGTCTACAGG
ATGTTAGTTATTTCATTCTTC
AAGTAACTAAAATTGCTCAAT
ACATTCGACAAAAATGAAAAA
TGTTTCGAATTCACCATATGC
AGTCGCAGCGGGTGCTCCAGC

Parsing YAML-like text file into hash structure

I've got the text file:
country = {
tag = ENG
ai = {
flags = { }
combat = { ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD }
continent = { "Oceania" }
area = { "America" "Maine" "Georgia" "Newfoundland" "Cuba" "Bengal" "Carnatic" "Ceylon" "Tanganyika" "The Mascarenes" "The Cape" "Gold" "St Helena" "Guiana" "Falklands" "Bermuda" "Oregon" }
region = { "North America" "Carribean" "India" }
war = 50
ferocity = no
}
date = { year = 0 month = january day = 0 }
}
What I'm trying to do is to parse this text into perl hash structure, so that the output after data dump looks like this:
$VAR1 = {
'country' => {
'ai' => {
'area' => [
'America',
'Maine',
'Georgia',
'Newfoundland',
'Cuba',
'Bengal',
'Carnatic',
'Ceylon',
'Tanganyika',
'The Mascarenes',
'The Cape',
'Gold',
'St Helena',
'Guiana',
'Falklands',
'Bermuda',
'Oregon'
],
'combat' => [
'ROY',
'WLS',
'PUR',
'SCO',
'EIR',
'FRA',
'DEL',
'USA',
'QUE',
'BGL',
'MAH',
'MOG',
'VIJ',
'MYS',
'DLH',
'GUJ',
'ORI',
'JAI',
'ASS',
'MLC',
'MYA',
'ARK',
'PEG',
'TAU',
'HYD'
],
'continent' => [
'Oceania'
],
'ferocity' => 'no',
'flags' => [],
'region' => [
'North America',
'Carribean',
'India'
],
'war' => 50
},
'date' => {
'day' => 0,
'month' => 'january',
'year' => 0
},
'tag' => 'ENG'
}
};
Hardcoded version might look like this:
#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;
my $ret;
$ret->{'country'}->{tag} = 'ENG';
$ret->{'country'}->{ai}->{flags} = [];
my #qw = qw( ROY WLS PUR SCO EIR FRA DEL USA QUE BGL MAH MOG VIJ MYS DLH GUJ ORI JAI ASS MLC MYA ARK PEG TAU HYD );
$ret->{'country'}->{ai}->{combat} = \#qw;
$ret->{'country'}->{ai}->{continent} = ["Oceania"];
$ret->{'country'}->{ai}->{area} = ["America", "Maine", "Georgia", "Newfoundland", "Cuba", "Bengal", "Carnatic", "Ceylon", "Tanganyika", "The Mascarenes", "The Cape", "Gold", "St Helena", "Guiana", "Falklands", "Bermuda", "Oregon"];
$ret->{'country'}->{ai}->{region} = ["North America", "Carribean", "India"];
$ret->{'country'}->{ai}->{war} = 50;
$ret->{'country'}->{ai}->{ferocity} = 'no';
$ret->{'country'}->{date}->{year} = 0;
$ret->{'country'}->{date}->{month} = 'january';
$ret->{'country'}->{date}->{day} = 0;
sub hash_sort {
my ($hash) = #_;
return [ (sort keys %$hash) ];
}
$Data::Dumper::Sortkeys = \hash_sort;
print Dumper($ret);
I have to admit I have a huge problem dealing with nested curly brackets.
I've tried to solve it by using greedy and ungreedy matching, but it seems it didn't do the trick. I've also read about extended patterns (like (?PARNO)) but I have absolutely no clue how to use them in my particular problem. Order of data is irrelevant, since I have the hash_sort subroutine.
I'll apprieciate any help.
I broke it down to some simple assumptions:
An entry would consist of an identifier followed by an equals sign
An entry would be one of three basic types: a level or set or a single value
A set has 3 forms: 1) quoted, space-separated list; 2) key-value pairs, 3) qw-like unquoted list
A set of key-value pairs must contain an indentifier for a key and either nonspaces or a quoted
value for a value
See the interspersed comments.
use strict;
use warnings;
my $simple_value_RE
= qr/^ \s* (\p{Alpha}\w*) \s* = \s* ( [^\s{}]+ | "[^"]*" ) \s* $/x
;
my $set_or_level_RE
= qr/^ \s* (\w+) \s* = \s* [{] (?: ([^}]+) [}] )? \s* $/x
;
my $quoted_set_RE
= qr/^ \s* (?: "[^"]+" \s+ )* "[^"]+" \s* $/x
;
my $associative_RE
= qr/^ \s*
(?: \p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ ) \s+ )*
\p{Alpha}\w* \s* = \s* (?: "[^"]+" | \S+ )
\s* $
/x
;
my $pair_RE = qr/ \b ( \p{Alpha}\w* ) \s* = \s* ( "[^"]+" | \S+ )/x;
sub get_level {
my $handle = shift;
my %level;
while ( <$handle> ) {
# if the first character on the line is a close, then we're done
# at this level
last if m/^\s*[}]/;
my ( $key, $value );
# get simple values
if (( $key, $value ) = m/$simple_value_RE/ ) {
# done.
}
elsif (( $key, my $complete_set ) = m/$set_or_level_RE/ ) {
if ( $complete_set ) {
if ( $complete_set =~ m/$quoted_set_RE/ ) {
# Pull all quoted values with global flag
$value = [ $complete_set =~ m/"([^"]+)"/g ];
}
elsif ( $complete_set =~ m/$associative_RE/ ) {
# going to create a hashref. First, with a global flag
# repeatedly pull all qualified pairs
# then split them to key and value by spliting them at
# the first '='
$value
= { map { split /\s*=\s*/, $_, 2 }
( $complete_set =~ m/$pair_RE/g )
};
}
else {
# qw-like
$value = [ split( ' ', $complete_set ) ];
}
}
else {
$value = get_level( $handle );
}
}
$level{ $key } = $value;
}
return wantarray ? %level : \%level;
}
my %base = get_level( \*DATA );
Well, as David suggested, the easiest way would be to get whatever produced the file to use a standard format. JSON, YAML, or XML would be much easier to parse.
But if you really have to parse this format, I'd write a grammar for it using Regexp::Grammars (if you can require Perl 5.10) or Parse::RecDescent (if you can't). This'll be a little tricky, especially because you seem to be using braces for both hashes & arrays, but it should be doable.
The contents look pretty regular. Why not perform some substitutions on the content and convert it to hash syntax, then eval it. That would be a quick and dirty way to convert it.
You can also write a parser, assuming you know the grammar.