extract a part of string using regex - regex

I have a text file with pattern as below.
"s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}"
I am trying to extract the values cap1, cap2, CAP3.
I am trying with regex "([a-z]|[|])cap1(\{(.*?)\})([a-z]|[|]|[0-9])" but with no luck any help is appreciated.

As I understand you want to extract the value of cap1, cap2, CAP3 one by one. There are 3 regex then
For cap1
cap1\{([^\}]*)\}
Explanation
cap1\{ match text cap1{,
([^\}]*) capture any characters except } to group $1,
\} match text }.
For cap2
cap2\[([^\]]*)\]
Explanation
cap2\[ match text cap2[,
([^\]]*) capture any characters except ] to group $1,
\] match text ].
For CAP3
CAP3\{\[([^\]]*)\]\|\[([^\]]*)\]\}
Explanation
CAP3\{ match text CAP3{,
\[([^\]]*)\]\|\[([^\]]*)\] capture any characters except ] to groups $1, $2 respectively,
\} match text }.
Additional: Thank you for a comment from #Borodin, to do this task you don't need to use lookaround but in case that you want to do search and replace, the lookaround may be necessary.
For cap1: (?<=cap1\{)([^\}]*)(?=\})
For cap2: (?<=cap2\[)([^\]]*)(?=\])
For CAP3: (?<=CAP3\{)\[([^\]]*)\]\|\[([^\]]*)\](?=\})

Using a pattern such as this should work:
[{\[]+([^}{\]\[]+)[\]}]+
Code:
$searchText =~ m/[{\[]+([^}{\]\[]+)[\]}]+/
Example:
https://regex101.com/r/qI3fI6/1

Update
I apologise -- I initially mistook your question for something more trivial
Essentially you want to perform a split on pipe | characters, excluding those found inside pairs of brackets or braces [ ... ] or { ... }. As long as you don't need to take account of nesting inside brackets of the same type (i.e. braces will only ever contain brackets, and brackets will only ever contain braces) then it is simply done like this
my #matches = $s =~ m{ \w+ ( \{ [^{}]* \} | \[ [^\[\]]* \] ) }gx;
print "$_\n" for #matches;
output
{capture|these|values}
[capture|these|values]
{[capture|these|values]|[capture|these|values]}
The data you show has no instances of braces containing braces, or brackets containing brackets, but I suspect that there is no theoretical limit to the nesting of the your data in which case some recursion is necessary
The regex pattern in the program below defines the text that can appear inside a pair of matching brackets as a pipe-delimited sequence of
another pair of matching brackets and their content [ ... ]
another pair of matching braces and their content { ... }
a sequence of word characters like capture and values
A pattern matching that is inside the second pair of capturing parentheses. It is a recursive pattern that calls itself using relative numbering (?-1). That could also be absolute numbering (?2) but it would have to be changed if the number of preceding captures was changed
The complete pattern looks for and captures a series of word characters immediately before the recursive pattern to account for the cap1, cap2 etc. This allows the result of a glolbal search to be assigned directly to a hash with the result show below
use strict;
use warnings;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
my %captures = $s =~ m{
( (?> \w+ ) )
(
\{ (?-1) (?> \| (?-1) )* \} |
\[ (?-1) (?> \| (?-1) )* \] |
\w+
)
}gx;
use Data::Dump;
dd \%captures;
output
{
cap1 => "{capture|these|values}",
cap2 => "[capture|these|values]",
CAP3 => "{[capture|these|values]|[capture|these|values]}",
}
Original answer
It looks like you want all identifiers that are preceded by a pipe | character and followed by either a square or curly opening bracket [ or {
This program will do that for you
use strict;
use warnings;
use v5.10;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
for ( $s ) {
my #captures = /\|(\w+)[\[\{]/g;
say for #captures;
}
output
cap1
cap2
CAP3

Related

Explode string with comma when comma is not inside any brackets

I have string "xyz(text1,(text2,text3)),asd" I want to explode it with , but only condition is that explode should happen only on , which are not inside any brackets (here it is ()).
I saw many such solutions on stackoverflow but it didn't work with my pattern. (example1) (example2)
What is correct regex for my pattern?
In my case xyz(text1,(text2,text3)),asd
result should be
xyz(text1,(text2,text3)) and asd.
You may use a matching approach using a regex with a subroutine:
preg_match_all('~\w+(\((?:[^()]++|(?1))*\))?~', $s, $m)
See the regex demo
Details
\w+ - 1+ word chars
(\((?:[^()]++|(?1))*\))? - an optional capturing group matching
\( - a (
(?:[^()]++|(?1))* - zero or more occurrences of
[^()]++ - 1+ chars other than ( and )
| - or
(?1) - the whole Group 1 pattern
\) - a ).
PHP demo:
$rx = '/\w+(\((?:[^()]++|(?1))*\))?/';
$s = 'xyz(text1,(text2,text3)),asd';
if (preg_match_all($rx, $s, $m)) {
print_r($m[0]);
}
Output:
Array
(
[0] => xyz(text1,(text2,text3))
[1] => asd
)
If the requirement is to split at , but only outside nested parenthesis another idea would be to use preg_split and skip the parenthesized stuff also by use of a recursive pattern.
$res = preg_split('/(\((?>[^)(]*(?1)?)*\))(*SKIP)(*F)|,/', $str);
See this pattern demo at regex101 or a PHP demo at eval.in
The left side of the pipe character is used to match and skip what is inside the parenthesis.
On the right side it will match remaining commas that are left outside of the parenthesis.
The pattern used is a variant of different common patterns to match nested parentehsis.

Regular Expression Inception (a match within a match)

I am trying to create a regular expression that captures a named group and then looks within that named group to check if it contains certain qualities.
For example. I have a regular expression that matches a code block and I can use it to match and capture code blocks:
test.pl:
use strict;
use warnings;
my $text = <<'END_TEXT';
block {
// random stuff
}
block {
dog
}
END_TEXT
my $code_block_rx = qr{(?(DEFINE)
(?<code_block>
block\h\{ (?: [^{}]++ | (?&code_block) )*+ \}
)
)}xms;
while ($text =~ m/(?<match>(?&code_block))$code_block_rx/g) {
print $+{match}."\n";
}
This code will print both code blocks. But what if I only want to capture code blocks that contain the word "dog"?
Is there a way (in a single regular expression) to capture a code block, and then if that is found, look within the code block for the word "dog"?
I've tried modifying the regex to use a look ahead assertion, but it just causes the whole thing to fail: /(?<match>(?=dog)(?&code_block))$code_block_rx/g
What am I missing?
You tried to match dog at the position where the match starts.
Instead, you can check if it's in the block you matched.
while ($text =~ /(
\b block \h*+ ( (?&code_block) )
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)/xg) {
my $block_stmt = $1;
my $block_stmt_block = $2;
if ($block_stmt_block =~ /\b dog \b/x) {
say $block_stmt;
}
}
It can be done in a single pattern by using (?(?{!( assertion() )})(*FAIL)) to match against something you already captured.
while ($text =~ m{(
\b block \h*+
# A code_block that contains the word 'dog'.
( (?&code_block) ) (?(?{!( "$^N" =~ /\b dog \b/x )})(*FAIL))
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)}xg) {
say $1;
}

Perl regex - read java file and match entire text of a function in file

I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:
public String example(){
return "hello";
}
What would the regex patter for this look like?
Current Attempt:
use strict;
use warnings;
open ( FILE, "example.java" ) || die "can't open file!";
my #lines = <FILE>;
close (FILE);
my $line;
foreach $line (#lines) {
if($line =~ /String example(.*)}/s){
print $line;
}
}
**Adopted from this answer
Regex:
^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))
Breakdown:
^ \s*
( # (1 start)
[\w\s]+ \( .* \) \s* # How it matches a function definition
( # (2 start)
\{ # Opening curly bracket
( # (3 start)
(?> # Atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # Double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # Single quoted strings
| // .* $ # A comment block starting with //
| /\* [\s\S]*? \*/ # A multi-line comment block /*...*/
( \w+ ) # (4) ^
["']? [^;]+ \4 ; $ # ^
| [^{}<'"/]++ # Force engine to backtrack if it encounters special characters (possessive)
| [^{}]++ # Default matching behavior (possessive)
| (?2) # Recurs 2nd capturing group
)* # Zero to many times of atomic group
) # (3 end)
} # Closing curly bracket
) # (2 end)
) # (1 end)
Revo's regex is the Right Way To Do it (as much as a regex ever can be!).
But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".
So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:
( # Capture \1.
^([\t ])+ # Match and capture leading whitespace to \2.
(?:\w+\s*)? # Privacy specifier, if any.
\w+\s*\( # Name and opening round brace: is a function.
.*? # Need Dot-matches-newline, to match fn body.
\n\2} # Curly brace is as indented as start of fn.
) # End capture of \1.
Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.
Will work with K&R, Hortmann and Allman indent styles.
Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.
Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.
And neither of our regexes capture the comments preceding a function, which is pretty sinful.

Perl regex: Extract token_key value from string

I just started learning Perl and trying to do regex to break down a token key.
The token itself has multiple "columns" and I only need the KEY section.
token:
{
"token_key":"C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8",
"key_type":"xyz",
"expires":1200
}
but I only need this part
C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8
everything else can be ignored when I output it.
Any suggestions or advice are welcome!
Thank You!
This looks like JSON -- perhaps a proper parser would be better?
use JSON::PP;
my $json = JSON::PP->new->utf8->allow_barekey;
my $token = $json->decode('{' . $str . '}')->{'token'};
print $token->{'token_key'};
In any case, you can extract it (a bit more hackishly) with a regex like so:
$str =~ /['"]token_key['"]:\s*['"]([a-f0-9]+)['"]/i;
print $1;
A slightly less hackish regex would be:
# (["'])\s*token_key\s*\1\s*:\s*(["'])((?:(?!\2)[\S\s])*)\2
( ["'] ) # (1)
\s* token_key \s*
\1
\s* : \s*
( ["'] ) # (2)
( # (3 start)
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
Assuming the following:
absence of white characters between "token_key" and its value,
double quotes, not single quotes have been used in parsed strings,
a string of file is in $_ variable.
In that case
if (/"token_key":"([^"]+)/)
{ print "$1\n" }

How can I extract substrings from a string in Perl?

Consider the following strings:
1) Scheme ID: abc-456-hu5t10 (High priority) *****
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *****
and so on in the above format - the parts in bold are changes across the strings.
==> Imagine I've many strings of format Shown above.
I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.
1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
2nd substring containing the word (in eg above it's "High priority")
3rd substring containing * (IF * is present at the end of the string ELSE leave it )
How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?
You could do something like this:
my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END
foreach (split(/\n/,$data)) {
$_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
my ($id,$word,$star) = ($1,$2,$3);
print "$id $word $star\n";
}
The key thing is the Regular expression:
Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?
Which breaks up as follows.
The fixed String "Scheme ID: ":
Scheme ID:
Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:
([a-z0-9-]+)
Followed by one or more whitespace characters:
\s+
Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:
\(([^)]+)\)
Followed by some spaces any maybe a *, captured as $3:
\s*(\*)?
You could use a regular expression such as the following:
/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/
So for example:
$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";
prints
abc-456-hu5t10
High priority
*
(\S*)\s*\((.*?)\)\s*(\*?)
(\S*) picks up anything which is NOT whitespace
\s* 0 or more whitespace characters
\( a literal open parenthesis
(.*?) anything, non-greedy so stops on first occurrence of...
\) a literal close parenthesis
\s* 0 or more whitespace characters
(\*?) 0 or 1 occurances of literal *
Well, a one liner here:
perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt
Expanded to a simple script to explain things a bit better:
#!/usr/bin/perl -ln
#-w : warnings
#-l : print newline after every print
#-n : apply script body to stdin or files listed at commandline, dont print $_
use strict; #always do this.
my $regex = qr{ # precompile regex
Scheme\ ID: # to match beginning of line.
\s+ # 1 or more whitespace
(.*?) # Non greedy match of all characters up to
\s+ # 1 or more whitespace
\( # parenthesis literal
(.*?) # non-greedy match to the next
\) # closing literal parenthesis
\s* # 0 or more whitespace (trailing * is optional)
(\*)? # 0 or 1 literal *s
}x; #x switch allows whitespace in regex to allow documentation.
#values trapped in $1 $2 $3, so do whatever you need to:
#Perl lets you use any characters as delimiters, i like pipes because
#they reduce the amount of escaping when using file paths
m|$regex| && print "$1 : $2 : $3";
#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }
Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.
Long time no Perl
while(<STDIN>) {
next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
print "|$1|$2|$3|\n";
}
This just requires a small change to my last answer:
my ($guid, $scheme, $star) = $line =~ m{
The [ ] Scheme [ ] GUID: [ ]
([a-zA-Z0-9-]+) #capture the guid
[ ]
\( (.+) \) #capture the scheme
(?:
[ ]
([*]) #capture the star
)? #if it exists
}x;
String 1:
$input =~ /'^\S+'/;
$s1 = $&;
String 2:
$input =~ /\(.*\)/;
$s2 = $&;
String 3:
$input =~ /\*?$/;
$s3 = $&;