Using Perl split function to keep (capture) some delimiters and discard others - regex

Let's say I am using Perl's split function to split up the contents of a file.
For example:
This foo file has+ a bunch of; (random) things all over "the" place
So let's say I want to use whitespace and the semicolons a delimiters.
So I would use something like:
split(/([\s+\;])/, $fooString)
I'm having trouble figuring out a syntax (or even if it exists) to capture the semicolon and discard the whitespace.

You seem to ask for something like
my #fields_and_delim = split /\s+|(;)/, $string; # not quite right
but this isn't quite what it may seem. It also returns empty elements (with warnings) since when \s+ matches then the () captures nothing but $1 is still returned as asked, and it's undef. There are yet more spurious matches when your delimiters come together in the string.
So filter
my #fields_and_delim = grep { defined and /\S/ } split /(\s+|;)/, $string;
in which case you can normally capture the delimiter.
This can also be done with a regex
my #fields_and_delim = $string =~ /([^\s;]+|;+)/g;
which in this case allows more control over what and how you pick from the string.
If repeated ; need be captured separately change ;+ to ;

I think that what you want is as simple as:
split /\s*;\s*/, $fooString;
That will separate around the ; character that may or may not have any whitespace before or after.
In your example:
>This foo file has+ a bunch of; (random) things all over "the" place<
It would split into:
>This foo file has+ a bunch of<
and:
>(random) things all over "the" place<
By the way, you need to put the result of split into an array; for instance:
my #parts = split /\s*;\s*/, $fooString;
Then $parts[0] and $parts[1] would have the two bits.

I think grep is what you're looking for really, to filter the list for values that aren't all whitespace:
my #all_exc_ws = grep {!/^\s+$/} split(/([\s\;])/, $fooString);
Also I removed the + from your regex since it was inside the [], which changes its meaning.

Related

How can I match only integers in Perl?

So I have an array that goes like this:
my #nums = (1,2,12,24,48,120,360);
I want to check if there is an element that is not an integer inside that array without using loop. It goes like this:
if(grep(!/[^0-9]|\^$/,#nums)){
die "Numbers are not in correct format.";
}else{
#Do something
}
Basically, the format should not be like this (Empty string is acceptable):
1A
A2
#A
#
#######
More examples:
1,2,3,A3 = Unacceptable
1,2,###,2 = unacceptable
1,2,3A,4 = Unacceptable
1, ,3,4=Acceptable
1,2,3,360 = acceptable
I know that there is another way by using look like a number. But I can't use that for some reason (outside of my control/setup reasons). That's why I used the regex method.
My question is, even though the numbers are in not correct format (A60 for example), the condition always return False. Basically, it ignores the incorrect format.
You say in the comments that you don't want to use modules because you can't install them, but there are many core modules that should come with Perl (although some systems screw this up).
zdim's answer in the comments is to look for anything that is not 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. That's the negated character class [^0-9]. A grep in scalar context returns the number of items that match:
my $found_non_ints = grep { /[^0-9]/ } #items;
Instead of that, I'd go back to the non-negated character class and match string that only has zero or more digits. To do this, anchor the pattern to the absolute start and end of the string:
my $found_non_ints = grep { ! /\A[0-9]*\z/ } #items;
But, this doesn't really match integers. It matches positive whole numbers (and zero). If you want to match negative numbers as well, allow an optional - at the start of the string:
my $found_non_ints = grep { ! /\A-?[0-9]*\z/ } #items;
That - would be a problem in the negated character class.
Also, you don't want the $ anchor here: that allows a possible newline to match at the end, and that's a non-digit (the \Z is the same for the end of the string). Also, the meaning of $ can change based on the setting of the /m flag, which might be set with default regex flags.
Here's a short program with your sample data. Note that you need to decide how to split up the list; does whitespace matter? I decided to remove whitespace around the comma:
#!perl
use v5.10;
while( <DATA> ) {
chomp;
my $found_non_ints = grep { ! /\A[0-9]*\z/ } split /\s*,\s*/;
say "$_ => Found $found_non_ints non-ints";
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,,3,4
1,2,3,360
The solution proposed in the question gets close, except that the logic got reversed and there is an error in a regex pattern. One way for it:
if ( grep { /[^0-9] | ^$/x } #nums ) { say 'not all integers' }
Regex explanation
[] is a character class: it matches any one of the characters listed inside (so [abc] matches either of a, b, or c) -- but when it starts with a ^ it matches any character not listed; so [^abc] matches any char not being either of a, b, or c. The pattern 0-9 inside a character class specifies all digits in that range (and we can also use a-z and A-Z)
So [^0-9] matches any character that is not a digit
Then that is or-ed by | with a ^$: ^ matches beginning of the string and $ is for the end of it. So ^$ match a string without anything -- an empty string! We need to account for that as [^0-9] doesn't while an array element can be an empty string. (It can also be a undef but from my understanding that is not possible with actual data, and a regex on undef would draw a warning.)
Note that $ allows for a newline as well, and that ^ and $ may change their meaning if /m modifier is in use, matching on linefeeds inside a string. However, in all these cases we'd be matching a non-digit, which is precisely the point here
/x modifier makes it disregard literal spaces inside so we can space things out for easier reading. (It also allows for newlines and comments with #, so complex patterns can be organized and documented very nicely)
So that's all -- the regex tries to match anything that shouldn't be in an integer (assumed to be strictly positive in OP's data).
If it matches any such, in any one of the array elements, then grep returns a list which isn't empty (but has at least one element) and that is "true" under if. So we caught a non-integer and we go into if's block to deal with that.
A little aside: we can also declare and populate an array right inside the if condition, to catch all those non-integers:
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-integers: ', join ' ', map { "|$_|" } #non_ints;
}
This also reads more nicely, telling by the array name what we're after in that complicated condition: "non_ints." I put || around each item in print to be able to see an empty string.†
Now, when you put an exclamation mark in front of that regex, it reverses the true/false return from the regex and our code goes haywire. So drop that !.
The other error is in escaping the ^ by having \^. This would match a literal ^ character, robbing ^ of its special meaning as a pattern in regex, explained above. So drop that \.
One other way is in using an extremely useful List::Util library, which is "core" (so it is normally installed with Perl, even though that can get messed up).
Among a number of essential functions it gives us any, and with it we have
use List::Util qw(any);
if ( any { /[^0-9]|^$/ } #nums ) { say 'not all integers' }
I like any firstly because the name of the function includes at least a part of the needed logic, making code that much clearer and easier to comprehend: is there any element of #nums for which the code in the block is true? So any element which contains a non-digit? Precisely what is needed here.
Then, another advantage is that any will quit as soon as it finds one match, while grep continues through the whole list. But this efficiency advantage shows only on very large arrays or a lot of repeated checks. Also, on the other hand sometimes we want to count all instances.
I'd also like to point out some of any's siblings: none and notall. These names themselves also capture a good deal of logic, making otherwise possibly convoluted code that much clearer. Browse through this library to get accustomed to what is in there.
† A program with your test data
use warnings;
use strict;
use feature 'say';
while (<DATA>) {
chomp;
my #nums = split /\s*,\s*/;
say "Data: #nums";
if ( my #non_ints = grep { /[^0-9] | ^$/x } #nums ) {
say 'Non-ints: ', join ' ', map { "|$_|" } #non_ints;
}
say '---';
}
__DATA__
1A
A2
#A
#
1,2,3,A3
1,2,###,2
1,2,3A,4
1, ,3,4
1,2,3,360

Why one word breaks all right output in regex (perl)?

I want to understand the situation with regular expression in Perl.
$str = "123-abc 23-rr";
Need to show both words beside minus.
Regular expression is:
#mas=$str=~/(?:([\d\w]+)\-([\d\w]+))/gx;
And it show right output: 123, abc, 23, rr.
But if I change string a little and put one word in start:
$str = "word 123-abc 23-rr";
And I want to take account this first word, so I change my regexp:
#mas=$str=~/\w+\s(?:\s*([\d\w]+)\-([\d\w]+))*/gx;
My output must be same, but there are: 23, rr. If I remove \s* or * the output is 123, abc. But it's still not right. Anyone knows why?
Rather than making an ever more specific regex for an ever more specific string, consider taking advantage of the overall pattern.
Each piece is separated by whitespace.
The first piece is a word.
The rest are pairs separated by dashes.
First split the pieces on whitespace.
my #pieces = split /\s+/, $str;
Then remove the first piece, it doesn't have to be split.
my $word = shift #pieces;
Then split each piece on - into pairs.
my %pairs = map { split /-/, $_ } #words;
For each match, each capture is returned.
In the first snippet, the pattern matches twice.
123-abc 23-rr
\_____/ \___/
There are two captures, so four (2*2=4) values are returned.
In the second snippet, the pattern matches once.
word 123-abc 23-rr
\________________/
There are two captures, so two (2*1=2) values are returned.

Regex Remove From x To y

I'm new to regex, I know the basics but only the basics. I need to parse a string to remove all occurances of one string to another. For example,
Here is some random text
This wants to stay
foo
This wants to be removed
bar
And this wants to stay
So the desired output would be
Here is some random text
This wants to stay
And this wants to stay
And removed would be
foo
This wants to be removed
bar
It will always follow the pattern of match 'this string' to 'that string' and remove everything in between, including 'this string' and 'that string'.
The file is a text file, for the sake of this question, the pattern will always start with foo and end with bar, removing foo, bar and everything in between.
Foo and Bar ARE part of the file and need removing.
Regexes are probably the wrong tool here. I'd probably use string equality along with the flip-flop operator.
while (<$input_fh>) {
print $output_fh unless ($_ eq "foo\n" .. $_ eq "bar\n");
}
You could do it with a regex and a match operator.
while (<$input_fh>) {
print $output_fh unless /foo/ .. /bar/;
}
That looks neater, but the regexes will match if the strings appear anywhere on an input line.
Update: Inverted the logic on the tests - so it's now correct.
Are you looking for something like this?
#!/usr/bin/perl
$start = "foo";
$end = "bar";
while (<STDIN>) {
$str = $str . $_;
}
$str =~ s/(.*)$start\n.*$end\n(.*)/\1\2/s;
print $str;
The only part of real importance to you is the regex I suppose, but I declare the start and end, then read from standard input and tack each concurrent line onto $str. Then I take str and say "whatever is the first thing within perenthesis before foo put first, whatever is in the second after bar parenthesis put last" (with the backslash \1 and \2)
My output from a file containing your lines is:
marshall#marshall-desktop:~$ cat blah | ./haha
Here is some random text
This wants to stay
And this wants to stay
That's not what RegEx is there for. RegEx is there to detect pattern - if you want simple string slice, you should simply iterate over the big string with a simple comparison (or, with other languages which include string operations, indexOf("your string here"); etc. )
However, simple typing of the string would find you the matches:
This wants to be removed will return all occurances of that specific string, and thus it is fit for you.

Extracting first two words in perl using regex

I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.