Regular Expression in Perl - regex

I need to extract the 4th field value (128) from the following line using regular expression.
( '29/11/2010 09:38:05', '41297', '29/11/2010 09:40:30', '128', '17', 'SUCCESS', '30', 'e', '9843171457', '1', '-1')
Please tell me the way to take the 4th value.
Thanks in advance.

Use Text::CSV from CPAN:
my $input = "( '29/11/2010 09:38:05', '41297', '29/11/2010 09:40:30', '128', '17', 'SUCCESS', '30', 'e', '9843171457', '1', '-1')";
my $csv = Text::CSV->new({
quote_char => "'",
always_quote => 1,
allow_whitespace => 1,
});
$csv->parse($input);
my #columns = $csv->fields();
print $columns[3], "\n"; # 128

The brute force way:
/'[^']*',\s*'[^']*',\s*'[^']*',\s*'([^']*)'/
This is a quote, followed by any number of non-quotes, then another quote, a comma, and some optional whitespace. All that is repeated four times with () around the fourth value to capture it. This may not work if the values are allowed to have quotes in them.
As Cameron pointed out, you can avoid the repetition using:
/(?:'[^']*',\s*){3}'([^']*)'/
The ?: tells the regexp parser not to capture the stuff inside the brackets.
Might be easier to split the list up using split with the comma as the delimiter, and then take the fourth element. Of course, if you can have commas inside the values, that may not work.

It's just perl's "split" command
$str = ('29/11/2010 09:38:05','41297','29/11/2010 09:40:30','128','17','SUCCESS','30','e', '9843171457','1','-1');
#vars = split(/','/,$str);
print "${vars[3]}\n";

Related

Perl regular expression to split string by word

I have a string which consists of several words (separated by Capital letter).
For example:
$string1="TestWater"; # to be splited in an array #string1=("Test","Water")
$string2="TodayIsNiceDay"; # as #string2=("Today","Is","Nice","Day")
$string3="EODIsAlwaysGood"; # as #string3=("EOD","Is","Always","Good")
I know that Perl easily split uses the split function for fixed character, or the match regex can separate $1, $2 with fixed amount of variable. But how can this be done dynamically? Thanks in advance!
That post Spliting CamelCase doesn't answer my question, my question is more related to regex in Perl, that one was in Java (differences apply here).
Use split to split a string on a regex. What you want is an upper case character not followed by an upper case character as the boundary, which can be expressed by two look-ahead assertions (perlre for details):
#!/usr/bin/perl
use warnings;
use strict;
use Test::More;
sub split_on_capital {
my ($string) = #_;
return [ split /(?=[[:upper:]](?![[:upper:]]))/, $string ]
}
is_deeply split_on_capital('TestWater'), [ 'Test', 'Water' ];
is_deeply split_on_capital('TodayIsNiceDay'), [ 'Today', 'Is', 'Nice', 'Day' ];
is_deeply split_on_capital('EODIsAlwaysGood'), [ 'EOD', 'Is', 'Always', 'Good' ];
done_testing();
You can do this by using m//g in list context, which returns a list of all matches found. (Rule of thumb: Use m//g if you know what you want to extract; use split if you know what you want to throw away.)
Your case is a bit more complicated because you want to split "EODIs" into ("EOD", "Is").
The following code handles this case:
my #words = $string =~ /\p{Lu}(?:\p{Lu}+(?!\p{Ll})|\p{Ll}*)/g;
I.e. every word starts with an uppercase letter (\p{Lu}) and is followed by either
1 or more uppercase letters (but the last one is not followed by a lowercase letter), or
0 or more lowercase letters (\p{Ll})

Perl split function - use repeating characters as delimiter

I want to split a string using repeating letters as delimiter, for example,
"123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:
#s = split /([[:alpha:]])\1+/, '123aaaa23a3';
But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference.
How can I resolve this situation?
Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.
Something like this perhaps?:
my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;
This'll give you:
$VAR1 = {
'23a3' => '',
'123' => 'a'
};
So you can extract your pattern via keys.
Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).
But something like this:
my #delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g;
print Dumper \%+;
By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.
This is the closest I got:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $str = '123aaaa23a3';
#build a regex out of '2-or-more' characters.
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";
#split on the regex
my #s = split m/$regex/, $str;
print Dumper \#s;
We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.
One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:
use List::Util 1.29 qw( pairkeys );
my #vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';
Gives
Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]
That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:
my #vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;
Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:
use List::Util 1.29 qw( pairvalues );
my #vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';
The 'split' can be made to work directly by using the delayed execution assertion (aka postponed regular subexpression), (??{ code }), in the regular expression:
#s = split /[[:alpha:]](??{"$&+"})/, '123aaaa23a3';
(??{ code }) is documented on the 'perlre' manual page.
Note that, according to the 'perlvar' manual page, the use of $& anywhere in a program imposes a considerable performance penalty on all regular expression matches. I've never found this to be a problem, but YMMV.

Why does this regular expression not capture arithmetic operators?

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...
We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.
This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(

Avoid repeating regex substitution

I have lines of code (making up a Ruby hash) with the form of:
"some text with spaces" => "some other text",
I wrote the following vim style regex pattern to achieve my goal, which is to replace any spaces in the string to the left of the => with +:
:%s/\(.*\".*\)\ (.*\"\ =>.*\,)/\1+\2
Expected output:
"some+text+with+spaces" => "some other text",
Unfortunately, this only replaces the space nearest to the =>. Is there another pattern that will replace all the spaces in one run?
Rather than write a large complex regex a couple of smaller ones would easier
:%s/".\{-}"/\=substitute(submatch(0), ' ', '+', 'g')
For instance this would capture the everything in quotes (escaped quotes break it) and then replace all spaces inside that matched string with pluses.
If you want it to work with strings escaped quotes in the string you just need to replace ".\{-}" with a slightly more complex regex "\(\\.\|[^\"]\)*"
:%s/"\(\\.\|[^\"]\)*"/\=substitute(submatch(0), ' ', '+', 'g')
If you want to restrict the lines that this substitute runs on use a global command.
:g/=>/s/"\(\\.\|[^\"]\)*"/\=substitute(submatch(0), ' ', '+', 'g')
So this will only run on lines with =>.
Relevant help topic :h sub-replace-expression
It's really far from perfect, but it does nearly the job:
:%s/\s\ze[^"]*"\s*=>\s*".*"/+/g
But it doesn't handle escape quotes, so the following line won't be replaced correctly:
"some \"big text\" with many spaces" => "some other text",

Regex: How to "step back"

I am having some trouble cooking up a regex that produces this result:
Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3
How does one step back in regex and discard the last match? That is I need a comma before a space to not match. This what I came up with...
\d+(,|\r)
Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Micheal2,3
The regex feature you're asking about is called a positive lookbehind. But in your case, I don't think you need it. Try this:
\d+(?:,\d+)*
In your example, this will match the comma delimited lists of numbers and exclude the names and trailing commas and whitespace.
Here is a short bit of test code written in PHP that verifies it on your input:
<?php
$input = "Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Micheal2,3";
$matches = array();
preg_match_all('/\d+(?:,\d+)*/', $input, $matches);
print_r($matches[0]);
?>
outputs:
Array
(
[0] => 1
[1] => 1,2
[2] => 1,2,3,4,5,6,7,18
[3] => 2,3
)
I believe \d+,(?!\s) will do what you want. The ?! is a negative lookahead, which only matches if what follows the ?! does not appear at this position in the search string.
>>> re.findall(r'\d+,(?!\s)', 'Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3')
['1,', '1,', '2,', '3,', '4,', '5,', '6,', '7,', '2,']
Or if you want to match the comma-separated list of numbers excluding the final comma use \d+(?:,\d+)*.
>>> re.findall(r'\d+(?:,\d+)*', 'Mike1, misha1,2, miguel1,2,3,4,5,6,7,18, and Michea2,3')
['1', '1,2', '1,2,3,4,5,6,7,18', '2,3']