I want to regexp capture the string between 2 specific sets of double underscores. The string that get captured may itself have single underscore occurrences in it. Here's the test Perl script I've been working with:
#!/usr/bin/env perl
use strict;
my $str = "DFD_20220913_121409_strix1a0__z1_erx_adm__CL1695331__RTL_Dfdsg4__regression__df_umc_nbio_hubs_gfx__220913_150718";
(my $grp) = $str =~ /CL\d+\_\_(\w+)\_\_/;
print "grp = $grp\n";
exit;
This returns...
grp = RTL_Dfdsg4__regression__df_umc_nbio_hubs_gfx
I want...
grp = RTL_Dfdsg4
As you can see, I know something about where the first set of double underscores exists (after the CL\d+). But for some reason, the regexp reads past the next occurrence of the double underscores until it hits the last set.
You need to use the non-greedy quantifier, ?.
(my $grp) = $str =~ /CL\d+__(\w+?)__/;
I removed the unnecessary backslashes from before the underscores.
Note that using the non-greedy modifier is fragile and can easily work differently than intended. This is the robust alternative:
my ( $grp ) = $str =~ /
CL \d+
__
( [^\W_]+ (?: _ [^\W_]+ )* ) # `[^\W_]` is `\w` minus `_`
__
/x;
I'm trying to use this but can't make it work. I want to check the syntax of expressions like this: (1+2)*(3+4)
I have integers, +, * and brackets. That's it, but it can be nested to any depth.
In BNF syntax the expr can be described like this:
expr
<sum>
sum
<product>{+<product>}
product
<atom>{*<atom>}
atom
<number>|(<expr>)
number
<digit>{<digit>}
I tried to translate this to Perl like this:
$number = '\d+';
$atom = "($number|\\((?R)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^$expr$/)
{
print "OK";
}
But it doesn't match! What am I doing wrong?
When you recurse, the ^ at the start of the pattern will fail to match.
Use (?(DEFINE)...) to define the rules instead of using (?R).
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&sum) )
(?<sum> (?&product) (?: \+ (?&product) )*+ )
(?<product> (?&atom) (?: \* (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
which simplifies to
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&binary_op) )
(?<binary_op> (?&atom) (?: [+*] (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
That's assuming you're only trying to check for validity rather than trying to parse the string. If you need to parse the string, you can build a parser using Parse::RecDescent or Marpa::R2.
ikegami's workaround above with the DEFINE stuff is beautiful, but it doesn't answer the question how to do it my way. A minimal change of my code to make it work? ikegami is right, the cause of no match is the ^ in /^$expr$/ . When the parser reenters the regex recursively it again checks for beginning of string, which fails. So I cannot have ^ and $ in the regex it seems. Without them my string matches. But then some invalid strings match too, like A(1+2)*(3+4)B . In the absence of ^ and $ it doesn't necessarily match the whole string. Problem.
ikegami suggested a solution to this in a comment above. I'll just write it out. I have tested it and it works:
$number = '\d+';
$atom = "($number|\\((?1)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^($expr)$/)
{
print "OK";
}
Notice that I now have (?1) instead of (?R) and that I have enclosed $expr in brackets. (?1) refers to the first capture group, which is ($expr). So the recursion reenters this subex instead of the whole regex. ^ is not met again. That solves it.
perl -i -pe 's/(,\h*"[^\n"]*)\n/$1 /g' /opt/data-integration/transfer/events/processing/Master_Events_List.csv
What is going on here? I tried a translator but its a bit vague. What are some examples that might return here?
First, don't try and manipulate CSV (or XML or HTML) with regexes. While CSV might seem simple, it can be subtle. Instead use Text::CSV. The exception is if your CSV is malformed and you're fixing it.
Now, for what your regex is doing. First, let's translate it it from s// to s{}{} which is a bit easier on the eyes and use \x so we can space things out a bit.
s{
# Capture to $1
(
# A comma.
,
# 0 or more `h` "horizontal whitespace": tabs and spaces
\h*
# A quote.
"
# 0 or more of anything which is not a quote or newline.
[^\n"]*
)
# A newline (not captured)
\n
}
# Put the captured bit in with a space after it.
# The `g` says to do it multiple times over the whole string.
{$1 }gx
It will change foo, "bar\n into foo, "bar. I'm guessing it's turning text fields in the CSV with newlines in them into ones with just spaces.
foo, "first
field", "second
field"
Will become
foo, "first field", "second field"
This is something better handled with Text::CSV. I suspect the purpose of the transform is to help out CSV parsers which cannot handle newlines. Text::CSV can with a little coercing.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use autodie;
use Text::CSV;
use IO::Scalar;
use Data::Dumper;
# Pretend our scalar is an IO object so we can use `getline`.
my $str = qq[foo, "bar", "this\nthat"\n];
my $io = IO::Scalar->new(\$str);
# Configure Text::CSV
my $csv = Text::CSV->new({
# Embedded newlines normally aren't allowed, this tells Text::CSV to
# treat the content as binary instead.
binary=> 1,
# Allow spaces between the cells.
allow_whitespace => 1
});
# Use Text::CSV->getline() to do the parsing.
while( my $row = $csv->getline($io) ) {
# Dump the contents of the row
say Dumper $row;
}
And it will correctly parse the row and its embedded newlines.
$VAR1 = [
'foo',
'bar',
'this
that'
];
Edited this to second Schwern (also upvoted): regular expressions seem to be a poor fit for manipulating CSV.
As for the regular expression in question, let's dissect it. Starting with the top level:
's/(,\h*"[^\n"]*)\n/$1 /g'
The s/part1/part2/g expression means "substitute the first part with the second part everywhere".
Now let's examing the "first part":
(,\h*"[^\n"]*)\n
The parentheses are enclosing a group. There is only one group, so it becomes group number 1. We'll come back to that in the next step.
Then, check out https://perldoc.perl.org/perlrebackslash.html for explanation of the character classes. \h is a horizontal whitespace and \n is a logical newline character.
The expression inside the group is stating: "starts with a comma, then any number of horizontal whitespace characters, then anything but a newline and quote; finally, there must be a trailing newline". So it is basically a comma follwed by a csv field.
Lastly, the "second part" reads:
$1
This is just a reference to the group number 1 that was captured earlier followed by a space.
In overall, the whole expression replaces a trailing string field that is not terminated with a quote and removing it's newline terminator.
The best way to fix newlines in quoted fields that masquerade as End-Of-Record :
First, don't try and manipulate CSV (or XML or HTML) with modules. While CSV might seem tricky, it is extremely simple. Don't use Text::CSV. Instead, use a substitute regex with a callback.
Also, you can use the regex to just correctly parse a csv without replacing
newlines, but you probably want to use Perl to fix it for use in some other language.
Regex (with trim)
/((?:^|,|\r?\n))\s*(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"[^\S\r\n]*(?=$|,|\r?\n))|([^,\r\n]*(?=$|,|\r?\n)))/
Explained
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add
# [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
(Note - this requires a script.)
Perl sample
use strict;
use warnings;
$/ = undef;
sub RmvNLs {
my ($delim, $quote, $non_quote) = #_;
if ( defined $non_quote ) {
return $delim . $non_quote;
}
$quote =~ s/\s*\r?\n/ /g;
return $delim . $quote;
}
my $csv = <DATA>;
$csv =~ s/
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
/RmvNLs($1,$2,$3)/xeg;
print $csv;
__DATA__
497,50,2008-08-02T16:56:53Z,469,4,
"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,
PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
PRO1235, "KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
PRO1235,"RICK SMITH",,, #
Output
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING HARD ON BEING A GOOD SERVENT."
PRO1235,"KEITH SMITH",ENGINEER,"keith has been working hard on being a good servent."
PRO1235,"KENNY SMITH",,"keith has been working hard on being a good servent."
PRO1235,"RICK SMITH",,,
I use recursive Perl regular reg-expressions to scan for nested singlechar brackets:
$RE = qr'(?:[\(]((?:(?>[^\(\)]+)|(??{$RE}))*)[\)])';
This lets me scan c-function calls, something like :
"func (a(b()))" ~= /$RE/
matching "(a(b()))" . Now I'd like to parse Pascal style nested [if,if-end] brackets, i.e.:
if (a) then
if (b) then
blaif := 1;
else
blaend := 2;
end if;
end if;
I tried to rewrite $RE from above to:
$RE_if = qr'(?:(?:if)((?:(?>(?!(?:\bif\b|\bend\s+if))+)|(??{$RE_if}))*)\
(?:\bend\s+if))';
But it kindof doesnt work. Does somebody have a regex that handles multichar brackets
like ["if","end if"] ?
-- Greetings Konrad
Let's look at the original pattern: (Extraneous escapes removed. Needless surrounding (?:) removed.)
[(] # Prefix.
(
(?: (?> [^()] +) # Some characters containing neither prefix nor suffix.
| (??{ $RE }) # Recursion
)*
)
[)] # Suffix.
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so:
\bif\b
(
(?: (?> (?:(?! \b(?:end\s+)?if\b ).)+ )
| (??{ $RE })
)*
)
\bend\s+if\b
By the way, (?>PAT+) can be written PAT++.
Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;