Perl hash substitution with special characters in keys - regex

My current script will take an expression, ex:
my $expression = '( a || b || c )';
and go through each boolean combination of inputs using sub/replace, like so:
my $keys = join '|', keys %stimhash;
$expression =~ s/($keys)\b/$stimhash{$1}/g;
So for example expression may hold,
( 0 || 1 || 0 )
This works great.
However, I would like to allow the variables (also in %stimhash) to contain a tag, *.
my $expression = '( a* || b* || c* )';
Also, printing the keys of the stimhash returns:
a*|b*|c*
It is not properly substituting/replacing with the extra special character, *.
It gives this warning:
Use of uninitialized value within %stimhash in substitution iterator
I tried using quotemeta() but did not have good results so far.
It will drop the values. An example after the substitution looks like:
( * || * || * )
Any suggestions are appreciated,
John

Problem 1
You use the pattern a* thinking it will match only a*, but a* means "0 or more a". You can use quotemeta to convert text into a regex pattern that matches that text.
Replace
my $keys = join '|', keys %stimhash;
with
my $keys = join '|', map quotemeta, keys %stimhash;
Problem 2
\b
is basically
(?<!\w)(?=\w)|(?<=\w)(?!\w)
But * (like the space) isn't a word character. The solution might be to replace
s/($keys)\b/$stimhash{$1}/g
with
s/($keys)(?![\w*])/$stimhash{$1}/g
though the following make more sense to me
s/(?<![\w*])($keys)(?![\w*])/$stimhash{$1}/g
Personally, I'd use
s{([\w*]+)}{ $stimhash{$1} // $1 }eg

Related

Remove substrings between < and > (including the brackets) with no angle brackets inside

I have to modify a html-like text with the sed command. I have to delete substrings starting with one or more < chars, then having 0 or more occurrences of any characters but angle brackets and then any 1 or more > chars.
For example: from
aaa<bbb>ccc I would like to get aaaccc
I am able to do this with
"s/<[^>]\+>//g"
but this command doesn't work if between <> characters is an empty string, or if there is double <<>> in the text.
For example, from
aa<>bb<cc>vv<<gg>>h
I get
aa<>bbvv>h
instead of
aabbvvh
How can I modify it to give me the right result?
The problem is that once you allow nesting the < and > characters, you convert the language type from "regular" to "context free".
Regular languages are those that are matched by regular expressions, while context free grammars cannot be parsed in general by a regular expression. The unbounded level of nesting is what impedes this, needing a pile based automaton to be able to parse such languages.
But there's a little complicated workaround to this, if you consider that there's an upper limit to the level of nesting you will allow in the text you are facing, then you can convert into regular a language that is not, based on the premise that the non-regular cases will never occur:
Let's suppose you will never have more than three levels of nesting into your pattern, (this allows you to see the pattern and be able to extend it to N levels) you can use the following algorithm to build a regular expression that will allow you to match three levels of nesting, but no more (you can make a regexp to parse N levels, but no more, this is the umbounded bounded nature of regexps :) ).
Let's construct the expression recursively from the bottom up. With only one level of nesting, you have only < and > and you cannot find neither of these inside (if you allow < you allow more nesting levels, which is forbidden at level 0):
{l0} = [^<>]*
a string including no < and > characters.
Your matching text will be of this class of strings, surrounded by a pair of < and > chars:
{l1} = <[^<>]*>
Now, you can build a second level of nesting by alternating {l0}{l1}{l0}{l1}...{l0} (this is, {l0}({l1}{l0})* and surrounding the whole thing with < and >, to build {l2}
{l2} = <{l0}({l1}{l0})*> = <[^<>]*(<[^<>]*>[^<>]*)*>
Now, you can build a third, by alternating sequences of {l0} and {l2} in a pair of brackets... (remember that {l-i} represents a regexp that allows upto i levels of nesting or less)
{l3} = <{l0}({l2}{l0})*> = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
and so on, successively, you form a sequence of
{lN} = <{l0}({l(N-1)}{l0})*>
and stop when you consider there will not be a deeper nesting in your input file.
So your level three regexp is:
<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l3--------------------------------------}
<{l0--}({l2---------------------}{l0--})*>
<{l0--}({l1----}{l0--})*>
<{l0--}>
You can see that the regexp grows as you consider more levels. The good things is that you can consider a maximum level of three or four and most text will fit in this cathegory.
See demo.
NOTE
Never hesitate to build a regular expression, despite of it appearing somewhat complex. Think that you can build it inside your program, just using the techniques I've used to build it (e.g. for a 16 level nesting regexp, you'll get a large string, very difficult to write it by hand, but very easy to build with a computer)
package com.stackoverflow.q61630608;
import java.util.regex.Pattern;
public class NestingRegex {
public static String build_regexp( char left, char right, int level ) {
return level == 0
? "[^" + left + right + "]*"
: level == 1
? left + build_regexp( left, right, 0 ) + right
: left + build_regexp( left, right, 0 )
+ "(" + build_regexp( left, right, level - 1 )
+ build_regexp( left, right, 0 )
+ ")*" + right;
}
public static void main( String[] args ) {
for ( int i = 0; i < 5; i++ )
System.out.println( "{l" + i + "} = "
+ build_regexp( '<', '>', i ) );
Pattern pat = Pattern.compile( build_regexp( '<', '>', 16 ), 0 );
String s = "aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp";
System.out.println(
String.format( "pat.matcher(\"%s\").replaceAll(\"#\") => %s",
s, pat.matcher( s ).replaceAll( "#" ) ) );
}
}
which, on run gives:
{l0} = [^<>]*
{l1} = <[^<>]*>
{l2} = <[^<>]*(<[^<>]*>[^<>]*)*>
{l3} = <[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>
{l4} = <[^<>]*(<[^<>]*(<[^<>]*(<[^<>]*>[^<>]*)*>[^<>]*)*>[^<>]*)*>
pat.matcher("aa<>bb<cc>vv<<gg>>h<iii<jjj>kkk<<lll>mmm>ooo>ppp").replaceAll("#") => aa#bb#vv#h#ppp
The main advantage of using regular expressions is that once you have written it, it compiles into an internal representation that only has to visit each character of the string being matched once, leading to a very efficient final matching code (probably you'll not get so efficient writing the code yourself)
Sed
for sed, you only need to generate an enough deep regexp, and use it to parse your text file:
sed 's/<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>//g' file1.xml
will give you appropiate results (this is 6 levels of nesting or less ---remember the ( and ) must be escaped to be considered group delimiters in sed)
Your regexp can be constructed using shell variables with the following approach:
l0="[^<>]*"
l1="<${l0}>"
l2="<${l0}\(${l1}${l0}\)*>"
l3="<${l0}\(${l2}${l0}\)*>"
l4="<${l0}\(${l3}${l0}\)*>"
l5="<${l0}\(${l4}${l0}\)*>"
l6="<${l0}\(${l5}${l0}\)*>"
echo regexp is "${l6}"
regexp is <[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*\(<[^<>]*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>[^<>]*\)*>
sed -e "s/${l6}/#/g" <<EOF
aa<>bb<cc>vv<<gg>>h<iii<jj<>j>k<k>k<<lll>mmm>ooo>ppp
EOF
aa#bb#vv#h#ppp
(I've used # as substitution pattern, instead, so you can see where in the input string have the patterns been detected)
You may use
sed 's/<\+[^>]*>\+//g'
sed 's/<\{1,\}[^>]*>\{1,\}//g'
sed -E 's/<+[^>]*>+//g'
The patterns match
<\+ / <\{1,\} - 1 or more occurrences of < char
[^>]* - negated bracket expression that matches 0 or more chars other than >
>\+ / >\{1,\} - 1 or more occurrences of > char
Note that in the last, POSIX ERE, example, + that is unescaped is a quantifier matching 1 or more occurrences, same as \+ in the POSIX BRE pattern.
See the online sed demo:
s='aa<>bb<cc>vv<<gg>>h'
sed 's/<\+[^>]*>\+//g' <<< "$s"
sed 's/<\{1,\}[^>]*>\{1,\}//g' <<< "$s"
sed -E 's/<+[^>]*>+//g' <<< "$s"
Result of each sed command is aabbvvh.

how to limit, characters between a range using regular expression

As far as I know {} curly braces are used to limit characters in regular expression like {3,12}, would match character length between 3 to 12.
I am trying to validate username that might contain a period . or _ either one, but not both, doesn't matter placement. For this below regex is working very well.
(^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)
But I also need to limit the string length between 3 to 12, I had tried to put {3,12} in regex, but that doesn't work.
((^[a-z0-9]+$)|(^[a-z0-9]*[\.\_][a-z0-9]*$)){3,12}
See Example: https://regex101.com/r/kN3aO1/1
As hwnd suggested, a simpler solution would be:
^(?=.{3,12}$)[a-z0-9]+(?:[._][a-z0-9]+)?$
Old solution, which is rather complex and convoluted,is left here for reference, but use the one above instead.
^(?!(?:.{13,}|.{1,2})$)(?:([a-z0-9]+)|([a-z0-9]*[\.\_][a-z0-9]*))$
You can add a lookahead for this.
Demo on regex101
I would do this in three steps.
Check to see if the string has any '/' in it.
Check to see if the string has any '_' in it.
Check to see if string length is between 3 and 12.
In Perl:
if ( ( ( $name =~ /_/ ) && ( $name =~ /\./ ) ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
If you want to make sure that the username contains only one dot or underscore, you may count them. Again, in Perl:
my $dcnt = $name =~ tr /././;
my $ucnt = $name =~ tr /_/_/;
if ( ( $dcnt > 0 && $ucnt > 0 ) ||
( $dcnt > 1 ) ||
( $ucnt > 1 ) ||
( length($name) < 3 ) ||
( length($name) > 12 ) )
{
# Handle invalid username
}
Why not one monster regular expression that does everything at once? Well, for the sake of maintainability. If you or a colleague looks at this code in a year's time, when requirements have changed, this approach will make it easier to update the code.
Notice also that {3,12} says nothing about lengths. It allows the previous pattern to match three to twelve times.

Keep track of matches and check against condition

I have $entire_line = "if varC > 0: varB = varC + 2"
I would like my regex to find the following: varC, varB, varB in the $entire_line
These matches then need to be checked to see whether they exist in a HashMap. If so, a $ should be appended to the match.
Hence the output should be:
"if $varC > 0: $varB = $varC + 2"
NOTE: 0 and 2 don't appear in the HashMap.
Currently, I have:
$entire_line =~ s/(\w+)/\$$1/g if (exists($variable_hash{$1}));
However, this does not work as intended as the $1 in exists($variable_hash{$1}) does not refer to the previous regex: $entire_line =~ s/(\w+)/\$$1/g
Is there a proper way to go about this?
Thanks for your help.
Use the /e modifier and put the code into the replacement part:
$entire_line =~ s/(\w+)/exists $variable_hash{$1} ? $variable_hash{$1} : $1/ge;
If I got your question correctly and you don't need to perform variable value substitution (as in #choroba's answer), but only append $ character to known variables, and if the %variables_hash is not very long, how about concatenating all the keys of %variables_hash with a | character to get a regex matching all known variables?
my %variable_hash = (
varA => 1,
# varB => 1, # commented out to check that it will not be replaced
varC => 1,
);
my $entire_line = "if varC > 0: varB = varC + 2;";
my $key_regex = join('|', map { quotemeta $_; } keys %variable_hash);
# $key_regex will contain "varA|varC"
$entire_line =~ s/\b($key_regex)\b/\$$1/g;
# prefix all matching substrings with $ character
print "$entire_line\n";
Also check my comment to #choroba's answer.

In Perl, how many groups are in the matched regex?

I would like to tell the difference between a number 1 and string '1'.
The reason that I want to do this is because I want to determine the number of capturing parentheses in a regular expression after a successful match. According the perlop doc, a list (1) is returned when there are no capturing groups in the pattern. So if I get a successful match and a list (1) then I cannot tell if the pattern has no parens or it has one paren and it matched a '1'. I can resolve that ambiguity if there is a difference between number 1 and string '1'.
You can tell how many capturing groups are in the last successful match by using the special #+ array. $#+ is the number of capturing groups. If that's 0, then there were no capturing parentheses.
For example, bitwise operators behave differently for strings and integers:
~1 = 18446744073709551614
~'1' = Î ('1' = 0x31, ~'1' = ~0x31 = 0xce = 'Î')
#!/usr/bin/perl
($b) = ('1' =~ /(1)/);
print isstring($b) ? "string\n" : "int\n";
($b) = ('1' =~ /1/);
print isstring($b) ? "string\n" : "int\n";
sub isstring() {
return ($_[0] & ~$_[0]);
}
isstring returns either 0 (as a result of numeric bitwise op) which is false, or "\0" (as a result of bitwise string ops, set perldoc perlop) which is true as it is a non-empty string.
If you want to know the number of capture groups a regex matched, just count them. Don't look at the values they return, which appears to be your problem:
You can get the count by looking at the result of the list assignment, which returns the number of items on the right hand side of the list assignment:
my $count = my #array = $string =~ m/.../g;
If you don't need to keep the capture buffers, assign to an empty list:
my $count = () = $string =~ m/.../g;
Or do it in two steps:
my #array = $string =~ m/.../g;
my $count = #array;
You can also use the #+ or #- variables, using some of the tricks I show in the first pages of Mastering Perl. These arrays have the starting and ending positions of each of the capture buffers. The values in index 0 apply to the entire pattern, the values in index 1 are for $1, and so on. The last index, then, is the total number of capture buffers. See perlvar.
Perl converts between strings and numbers automatically as needed. Internally, it tracks the values separately. You can use Devel::Peek to see this in action:
use Devel::Peek;
$x = 1;
$y = '1';
Dump($x);
Dump($y);
The output is:
SV = IV(0x3073f40) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,pIOK)
IV = 1
SV = PV(0x30698cc) at 0x3073484
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x3079bb4 "1"\0
CUR = 1
LEN = 4
Note that the dump of $x has a value for the IV slot, while the dump of $y doesn't but does have a value in the PV slot. Also note that simply using the values in a different context can trigger stringification or nummification and populate the other slots. e.g. if you did $x . '' or $y + 0 before peeking at the value, you'd get this:
SV = PVIV(0x2b30b74) at 0x3073f44
REFCNT = 1
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 1
PV = 0x3079c5c "1"\0
CUR = 1
LEN = 4
At which point 1 and '1' are no longer distinguishable at all.
Check for the definedness of $1 after a successful match. The logic goes like this:
If the list is empty then the pattern match failed
Else if $1 is defined then the list contains all the catpured substrings
Else the match was successful, but there were no captures
Your question doesn't make a lot of sense, but it appears you want to know the difference between:
$a = "foo";
#f = $a =~ /foo/;
and
$a = "foo1";
#f = $a =~ /foo(1)?/;
Since they both return the same thing regardless if a capture was made.
The answer is: Don't try and use the returned array. Check to see if $1 is not equal to ""

Need help converting "sassy" to "$a55y" using a regular expression?

Any s at the beginning of the word should be converted to a $.
Any s inside the word should be converted to a 5.
To match an s at the start of the word, use \b to match word boundaries and \w to match alphanumerics:
/\bs\w/
(as #Matthew points out, the \w is really superfluous:)
/\bs/
Once you've replaced all s at the start of a word, then the only remaining ones are inside the word (I'm assuming that you also want to replace s at the end of a word with 5) so you can simply use
/s/
For completeness, here's how to put it all together (I'm going to assume JavaScript):
function pimpMyEsses(str)
{
return str.replace(/\bs/gi, '$').replace(/s/gi, '5');
}
console.log(pimpMyEsses('slither quantum Sassy. arcades'));
// > "$lither quantum $a55y. arcade5"
Depending on the language it may be possible to capture the substitutions with a single regular expression and replace them procedurally. Here's a PHP example:
<?php
$word = 'sassy';
preg_match_all('/\b(s)|([^s]+)|(s)/', $word, $matches, PREG_SET_ORDER);
/* captures:
* $matches = array(
* array('s','s'),
* array('a','','a'),
* array('s','','','s'),
* array('s','','','s'),
* array('y','','y')
* )
*/
$newword = '';
foreach ($matches as $m){
if ($m[1]) $newword .= '$'; # leading s --> $
elseif ($m[2]) $newword .= $m[2]; # not an s --> as-is
else $newword .= '5'; # any other s --> 5
}
echo $newword;
Because I've used \b to match a word-boundary before the "leading s", the string 'sassy socks' becomes '$a55y $ock5'
If you want only the s at the start of "sassy" to become a $, change the regular expression to:
'/^(s)|([^s]+)|(s)/'
You can do:
/^(s)/ to select only the first "s";
/(?:[^s])(?:(s)[^s]*)+ to select all other "s". Note that the first character will be skipped (which is independent of);
Explain:ignore first character;Repeat one or more: get a "s" and ignore others character that not "s";
Next step: you need to determinate what language you will use.