Perl string manipulation and find - regex

I am currently working on a phonebook program for a class and I am having a little bit of trouble with the regex part in order to format my text and find what im looking for. Firstly, I am having trouble editing my phone number text to what I want. I am able to find the text that have 7 numbers in a row (777777) but I am unable to substitute it to (1-701-777-777).
if($splitIndex[1] =~ m/^(\d{3}\d{4})/) {
$splitIndex[1] =~ s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/;
print "Updated: $splitIndex[1]";
}
When I run this code the output ends up being (wont let me imbed image here is output https://imgur.com/a/8HtW7xm).
Secondly, I am having trouble doing the actual regex part for the searching. I save all the possible letter combinations in $letofSearch and the number order combination in $numOfSearch. Through playing around in regex I have figured out if I do [$numOfSearch]+[$numOfSearch[-1]...[$numOfSearch[1] it gives me the correct find for the numbers but I am unable to write it properly in my code.
#If user input is only numbers
if($searchValue =~ m/(\D)/) {
#print "Not a number\n";
if($splitIndex[1] =~ m/([$numOfSearch]+)/) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
if($splitIndex[0] =~ m/([$letOfSearch])/i) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
$found = 0;
} else {
#If it is a number search for that number combo immedietly
if($splitIndex[1] =~ m/([$numOfSearch]+)/) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
if($splitIndex[0] =~ m/([$letOfSearch])/i) {
if($found == 0) {
print "$splitIndex[0]:$splitIndex[1]\n";
$found = 1;
}
}
$found = 0;
}
}
}

Instead of:
if($splitIndex[1] =~ m/^(\d{3}\d{4})/) {
$splitIndex[1] =~ s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/;
print "Updated: $splitIndex[1]";
}
try this:
if ($splitIndex[1] =~ s/(\d{3})(\d{4})/1-701-$1-$2/)
{
print "Updated: $splitIndex[1]";
}
In regular expressions, a set of square brackets ([ and ]) will match one and only one character, regardless of what's between the brackets. So when you write [\d{3}][\d{4}], that will match exactly two characters, because you are using two sets of []. And those two characters will be one of \d (any digit), {, 3, 4, or }, because that's what you wrote inside the brackets.
The order doesn't matter inside of the square brackets of a regular expression, so [\d{3}] is the same as [}1527349806{3]. As you can see, that's probably not what you wanted.
What you meant to do was capture the \d{3} and \d{4} strings, and you do that with a regular set of capturing parentheses, like this: (\d{3})(\d{4})
Since you had only one set of parentheses (that is, you had ([\d{3}][\d{4}])) and it contained exactly two []s, it was putting exactly two characters into $1, and nothing at all into $2. That's why, when you attempted to use $2 in the second half of your s///, it was complaining about an uninitialized value in $2. You were attempting to use a value ($2) that simply wasn't set.
(Also, you were doing two sets of matches: One for the m//, and one for the s///. I simply removed the m// match and kept the s/// match, using its return value to determine if we need to print() anything.)
The second part of the s/// does not use regular expressions, so any [, ], {, }, (, or ) will show up literally as that character. So if you don't want square brackets in the final phone number, don't use them. That's why I used s/.../1-701-$1-$2/; instead of s/.../1-701-[$1]-[$2]/;.
So when you wrote s/([\d{3}][\d{4}])/1-701-[$1]-[$2]/, the ([\d{3}][\d{4}]) part was putting two characters into $1, and nothing into $2. That's why you got a result that contained [77] (which was $1 surrounded by brackets) and [] (which was $2 (an uninitialized value) surrounded by brackets).
As for the second part of your post, I notice that you use a lot of capturing parentheses in your regular expressions, but you never actually use what you capture. That is, you never use $1 (or $2). For example, you write:
if($searchValue =~ m/(\D)/) {
which has m/(\D)/, yet you never use $1 anywhere in that code. I wonder: What's the point of capturing that non-digit character if you don't use it anywhere in your code?
I've seen programmers get confused and mix up the purpose of parentheses and square brackets. When using regular expressions, square brackets ([ and ]) match (not capture) exactly one character. What they match is not put in $1, $2, or any other $n.
Parentheses, on the other hand, capture whatever they match, by setting $1 (or $2, $3, etc.) to what was matched. In general, you shouldn't use parentheses unless you plan on capturing and using that match later. (The main exception to this rule is if you need to group a set of matches, like this: m/I have a (cat|dog|bird)/.)
Many programmers confuse square brackets and parentheses in regular expressions, and try to use them interchangeably. They'll write something like m/I have a [cat|dog|bird]/ and not realize that it's the same as m/I have a [abcdgiort|]/ (which doesn't capture anything, since there are no parentheses), and wonder why their program complains that $1 is an uninitialized value.
This is a common mistake, so don't feel bad if you didn't know the difference. Now you know, and hopefully you can figure out what needs to be corrected in the second part of your code.
I hope this helps.

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

regex: confirm if a optional portion was matched

I have a string that can be of two forms, and it is unknown which form it will be each time:
hello world[0:10]; or hello world;
There may or may not be the brackets with numbers. The two words (hello and world) can vary. If the brackets and numbers are there, the first number is always 0 and the second number (10) varies.
I need to capture the first word (hello) and, if it exists, the second number (10). I also need to know which form of the string it was.
hello world[0:10]; I would capture {hello, 10, form1}, and hello world; I would capture {hello, form2}. I don't really care how the "form" is formatted, I just need to be able to differentiate. It can be a bit (1=form1, 0=form2), structural (form1 puts me in one scope and form2 another), etc.
I currently have the following (now working) regex:
/(\w*) \s \w* (?:\[0:(\d*)\])?;/x
This gives me $1 = hello and potentially $2 = 10. I now just need to know if the bracketed numbers were there or not. This will be repeated many times, so I can't assume $2 = undef going into the regex. $2 could also be the same thing a few times in a row so I can't just look for a change in $2 before and after the regex.
My best solution so far is to run the regex twice, the first time with the brackets and the second time without:
if( /(\w*) \s \w* \[0:(\d*)\];/x ) {...}
elsif( /(\w*) \s \w*;/x ) {...}
This seems very inefficient and inelegant though so I was wondering if there is a better way?
You can use ? to optionally match portions of your regex. Then you can capture the output directly as a return value from the regex.
my $re = qr{ (\w*) \s* (?:\[0:(\d+)\])?; }x;
if( my($word, $num) = $line =~ $re ) {
say "Word: $word";
say "Num: $num" if defined $num;
}
else {
say "No match";
}
(?:\[0:(\d+)\])? says there may be a [0:\d+]. (?:) makes the grouping non-capturing so only \d+ is captured.
$1 and $2 are also safe to use, they are reset on each match, but using lexical variables makes things more explicit.

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

How to match a string which starts with either a new line or after a comma?

My string is $tables="newdb1.table1:100,db2.table2:90,db1.table1:90". My search string is db1.table1 and my aim is to extract the value after : (i.e 90 in this case).
I am using:
if ($tables =~ /db1.table1:(\d+)/) { print $1; }
but the problem is it is matching newdb1.table1:100 and printing 100.
Can you please give my a regular expression to match a string which either starts with a newline or has comma before it.
Use word boundaries:
if ($tables =~ /\bdb1.table1:(\d+)/) { print $1; }
here __^^
if ($tables =~ /(^|,)db1.table1:(\d+)/) { print $2; }
To answer your exact question, that is to match just after the start of the string or a comma, you want a positive look-behind assertion. You may be tempted to write a pattern of
/(?<=^|,)db1\.table1:(\d+)/
but that may fail with an error of
Variable length lookbehind not implemented in regex m/(?<=^|,)db1\.table1:(\d+)/ ...
So hold the regex engine’s hand a bit by making the alternatives equal in length—tricky to do in the general case but workable here.
/(?<=^d|,)db1\.table1:(\d+)/
While we are locking it down, let’s be sure to bracket the end with a look-ahead assertion.
while ($tables =~ /(?<=^d|,)db1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
Output:
[90]
You could also use \b for a regex word boundary, which has the same output.
while ($tables =~ /\bdb1\.table1:(\d+)(?=,|$)/g) {
print "[$1]\n";
}
For the most natural solution, follow the rule of thumb proposed by Randal Schwartz, author of Learning Perl. Use capturing when you know what you want to keep and split when you know what you want to throw away. In your case you have a mixture: you want to discard the comma separators, and you want to keep the digits after the colon for a certain table. Write that as
for (split /\s*,\s*/, $tables) { # / to fix Stack Overflow highlighting
if (my($value) = /^db1\.table1:(\d+)$/) {
print "[$value]\n";
}
}
Output:
[90]

Is there a way, using regular expressions, to match a pattern for text outside of quotes?

As stated in the title, is there a way, using regular expressions, to match a text pattern for text that appears outside of quotes. Ideally, given the following examples, I would want to be able to match the comma that is outside of the quotes, but not the one in the quotes.
This is some text, followed by "text, in quotes!"
or
This is some text, followed by "text, in quotes" with more "text, in quotes!"
Additionally, it would be nice if the expression would respect nested quotes as in the following example. However, if this is technically not feasible with regular expressions then it wold simply be nice to know if that is the case.
The programmer looked up from his desk, "This can't be good," he exclaimed, "the system is saying 'File not found!'"
I have found some expressions for matching something that would be in the quotes, but nothing quite for something outside of the quotes.
Easiest is matching both commas and quoted strings, and then filtering out the quoted strings.
/"[^"]*"|,/g
If you really can't have the quotes matching, you could do something like this:
/,(?=[^"]*(?:"[^"]*"[^"]*)*\Z)/g
This could become slow, because for each comma, it has to look at the remaining characters and count the number of quotes. \Z matches the end of the string. Similar to $, but will never match line ends.
If you don't mind an extra capture group, it could be done like this instead:
/\G((?:[^"]*"[^"]*")*?[^"]*?)(,)/g
This will only scan the string once. It counts the quotes from the beginning of the string instead. \G will match the position where last match ended.
The last pattern could need an example.
Input String: 'This is, some text, followed by "text, in quotes!" and more ,-as'
Matches:
1. ['This is', ',']
2. [' some text', ',']
3. [' and followed by "text, in quotes!" and more ', ',']
It matches the string leading up to the comma, as well as the comma.
This can be done with modern regexes due to the massive number of hacks to regex engines that exist, but let me be the one to post the "Don't Do This With Regular Expressions" answer.
This is not a job for regular expressions. This is a job for a full-blown parser. As an example of something you can't do with (classical) regular expressions, consider this:
()(())(()())
No (classical) regex can determine if those parenthesis are matched properly, but doing so without a regex is trivial:
/* C code */
char string[] = "()(())(()())";
int parens = 0;
for(char *tmp = string; tmp; tmp++)
{
if(*tmp == '(') parens++;
if(*tmp == ')') parens--;
}
if(parens > 0)
{
printf("%s too many open parenthesis.\n", parens);
}
else if(parens < 0)
{
printf("%s too many closing parenthesis.\n", -parens);
}
else
{
printf("Parenthesis match!\n");
}
# Perl code
my $string = "()(())(()())";
my $parens = 0;
for(split(//, $string)) {
$parens++ if $_ eq "(";
$parens-- if $_ eq ")";
}
die "Too many open parenthesis.\n" if $parens > 0;
die "Too many closing parenthesis.\n" if $parens < 0;
print "Parenthesis match!";
See how simple it was to write some non-regex code to do the job for you?
EDIT: Okay, back from seeing Adventureland. :) Try this (written in Perl, commented to help you understand what I'm doing if you don't know Perl):
# split $string into a list, split on the double quote character
my #temp = split(/"/, $string);
# iterate through a list of the number of elements in our list
for(0 .. $#temp) {
# skip odd-numbered elements - only process $list[0], $list[2], etc.
# the reason is that, if we split on "s, every other element is a string
next if $_ & 1;
if($temp[$_] =~ /regex/) {
# do stuff
}
}
Another way to do it:
my $bool = 0;
my $str;
my $match;
# loop through the characters of a string
for(split(//, $string)) {
if($_ eq '"') {
$bool = !$bool;
if($bool) {
# regex time!
$match += $str =~ /regex/;
$str = "";
}
}
if(!$bool) {
# add the current character to our test string
$str .= $_;
}
}
# get trailing string match
$match += $str =~ /regex/;
(I give two because, in another language, one solution may be easier to implement than the other, not just because There's More Than One Way To Do It™.)
Of course, as your problems grow in complexity, there will arise certain benefits of constructing a full-blown parser, but that's a different horse. For now, this will suffice.
As mentioned before, regexp cannot match any nested pattern, since it is not a Context-free language.
So if you have any nested quotes, you are not going to solve this with a regex.
(Except with the "balancing group" feature of a .Net regex engine - as mentioned by Daniel L in the comments - , but I am not making any assumption of the regex flavor here)
Except if you add further specification, like a quote within a quote must be escaped.
In that case, the following:
text before string "string with \escape quote \" still
within quote" text outside quote "within quote \" still inside" outside "
inside" final outside text
would be matched successfully with:
(?ms)((?:\\(?=")|[^"])+)(?:"((?:[^"]|(?<=\\)")+)(?<!\\)")?
group1: text preceding a quoted text
group2: text within double quotes, even if \" are present in it.
Here is an expression that gets the match, but it isn't perfect, as the first match it gets is the whole string, removing the final ".
[^"].*(,).*[^"]
I have been using my Free RegEx tester to see what works.
Test Results
Group Match Collection # 1
Match # 1
Value: This is some text, followed by "text, in quotes!
Captures: 1
Match # 2
Value: ,
Captures: 1
You should better build yourself a simple parser (pseudo-code):
quoted := False
FOR char IN string DO
IF char = '"'
quoted := !quoted
ELSE
IF char = "," AND !quoted
// not quoted comma found
ENDIF
ENDIF
ENDFOR
This really depends on if you allow nested quotes or not.
In theory, with nested quotes you cannot do this (regular languages can't count)
In practice, you might manage if you can constrain the depth. It will get increasingly ugly as you add complexity. This is often how people get into grief with regular expressions (trying to match something that isn't actually regular in general).
Note that some "regex" libraries/languages have added non-regular features.
If this sort of thing gets complicated enough, you'll really have to write/generate a parser for it.
You need more in your description. Do you want any set of possible quoted strings and non-quoted strings like this ...
Lorem ipsum "dolor sit" amet, "consectetur adipiscing" elit.
... or simply the pattern you asked for? This is pretty close I think ...
(?<outside>.*?)(?<inside>(?=\"))
It does capture the "'s however.
Maybe you could do it in two steps?
First you replace the quoted text:
("[^"]*")
and then you extract what you want from the remaining string
,(?=(?:[^"]*"[^"]*")*[^"]*\z)
Regexes may not be able to count, but they can determine whether there's an odd or even number of something. After finding a comma, the lookahead asserts that, if there are any quotation marks ahead, there's an even number of them, meaning the comma is not inside a set of quotes.
This can be tweaked to handle escaped quotes if needed, though the original question didn't mention that. Also, if your regex flavor supports them, I would add atomic groups or possessive quantifiers to keep backtracking in check.