Replacing ( string ) ^ 2 to sqrt ( string ) in Perl - regex

A given string (called $bbb!) contains many operands and operators. I want to replace every occurrence of
muth ( math ) ^ 2 mith to muth sqrt( math ) mith. (whitespace can be more than just one).
EDIT:
Assume that, in the entire expression, there is only either one (simple linear expression) ^ 2 or none --if it makes it easier.
Inclusive Example:
1.2 * ( 4.7 * a * ( b - 0.02 ) ^ 2 * ( b - 0.02 + 1 ) / ( b - 0.0430 ) )
should be changed to:
1.2 * ( 4.7 * a * sqrt( b - 0.02 ) * ( c - 0.02 + 1 ) / ( d - 0.0430 ) )

Well... weird problem...
Try it with this a bit advanced expression
(?<math>\((?:[^()]+|(?&math))*\))\s*\^\s*2
Hopefully the graphic illustrates what's going on
Debuggex Demo
The replacement string must than be sqrt $1
The command in perl would look like
$bbb =~ s/(?<math>\((?:[^()]+|(?&math))*\))\s*\^\s*2/sqrt $1/
A running example can be found here: http://regex101.com/r/qU8dV0/3
some words on what the heck, this is
the main structure here is anything\s*\^\s*2, it's matching anything followed by ^2
(?<math>...) builds a pattern named math
\(...\) the pattern math must begin with an opening parenthesis and end with a closing one
within the parenthesisses:
[^()]+ anything except parenthesisses is allowed or
(?&math) another in parenthesis wrapped term with the already defined structure, is allowed, so the outer pattern math is recursively repeated

Related

R regex: how to remove "*" only in between a group of variables

I have a group of variable var:
> var
[1] "a1" "a2" "a3" "a4"
here is what I want to achieve: using regex and change strings such as this:
3*a1 + a1*a2 + 4*a3*a4 + a1*a3
to
3a1 + a1*a2 + 4a3*a4 + a1*a3
Basically, I want to trim "*" that is not in between any values in var. Thank you in advance
Can do find (?<![\da-z])(\d+)\* replace $1
(?<! [\da-z] )
( \d+ ) # (1)
\*
Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines
( # (1 start)
(?: [^\da-z] | ^ )
\d+
) # (1 end)
\*
Leading assertions are bad anyways.
Benchmark
Regex1: (?<![\da-z])(\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 1.09 s, 1087.84 ms, 1087844 µs
Regex2: ((?:[^\da-z]|^)\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 0.77 s, 767.04 ms, 767042 µs
You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:
var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"
See the IDEONE demo
Here is the regex:
\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*
Details:
\b - leading word boundary
((?:a1|a2|a3|a4)\*) - Group 1 matching
(?:a1|a2|a3|a4) - either one of your variables
\* - asterisk
(?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
| - or
\* - a "wild" literal asterisk to be removed.
Taking the equation as a string, one option is
gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"
which looks for
a captured group of characters, ( ... )
containing a non-capturing group, (?: ... )
containing the beginning of the line ^
or, |
a space (or \\s)
followed by a digit 0-9, \\d.
The capturing group is followed by an asterisk, \\*,
followed by another capturing group ( ... )
containing an alphanumeric character \\w.
It replaces the above with
the first captured group, \\1,
followed by the second captured group, \\2.
Adjust as necessary.
Thank #alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:
> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"
# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"

Extracting inner groups with regex

I have the following string
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6')) AND (((SUM_RevisionAnomalia_UltRevision_1M = 1) AND (CANT_ConsumoFact_UltRevision_1M > 1)) OR ((SUM_RevisionNoAnomalia_UltRevision_1M + 1) AND (CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2))) OR (SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
and I am trying to extract all inner groups, so my answer should contain
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6'))
(SUM_RevisionAnomalia_UltRevision_1M = 1)
(CANT_ConsumoFact_UltRevision_1M > 1)
(SUM_RevisionNoAnomalia_UltRevision_1M + 1)
(CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2)
(SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
It is quite easy to extract this when there is only 1 set of those strings inside parentheses, but when given the example above my regex captures the whole string.
The regex i am using is
/(\([a-zA-Z0-9\[\]:_+=-\s\.\(\),'óáéíúüçãôàäê><]+\))/g
It seems you just want to match what is in-between ( and ) that is not ( and ) unless these are (...) that are preceded with a word character.
You can use
\((?:[^()]|\b\([^()]*\))*\)
See the regex demo
The regex breakdown:
\( - matching a literal (
(?:[^()]|\b\([^()]*\))* - zero or more sequences of:
[^()] - any character other than ( and )
| - or...
\b\([^()]*\) - a word boundary (i.e. before that position, there must be a word character) followed with ( followed with zero or more characters other than ( and )
\) - a closing )
An alternative pattern can be an unrolled one (more efficient with longer inputs):
\([^()]*(?:\b\([^()]*\)[^()]*)*\)
See another demo

What does regex expression doing?

What does this expression mean?
Pattern.compile("^.*(?=.*\\d).*$", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS)
I tried to split each part of the expression, but could not get its meaning. please help me on this.
From regex101.com:
TL;DR:
Matches any String that contains at least a number (characters '0' to '9').
As a side note I'd like to point out that this is a horrendous way to do so, and could be replaced by the following:
Pattern.compile("\\d");
I basically removed all the nonsense greedy fillers and the useless anchors. Use this regex with Matcher#find() method and not Matcher#matches().
There are two parts to this regex.
1. The part up to (but not including) the digit.
2. The part from the digit to the end of the string.
The regex is processed left to right.
The first thing it see's is .*. This tells it to go directly to the
end of the string and start searching backwards to satisfy ->
The next thing it see's, which is (?=.*\d).
In that assertion the .* is ignored because of the previous .*
since its already at the end.
So the search progresses (using the assertion) to the left until it finds a
position where a digit is directly in front of the current position.
Once that is found, it matches that digit and all past it until the end of
the string. This is the part 2. described above.
Visually, it can be seen if you add some capture groups, and test it on some
real input.
^
( .* ) # (1)
(?=
( .* ) # (2)
( \d ) # (3)
)
( .* ) # (4)
$
Output:
** Grp 0 - ( pos 0 , len 15 )
12hh34ddd567uuu
** Grp 1 - ( pos 0 , len 11 )
12hh34ddd56
** Grp 2 - ( pos 11 , len 0 ) EMPTY
** Grp 3 - ( pos 11 , len 1 )
7
** Grp 4 - ( pos 11 , len 4 )
7uuu

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Processing a Comma Separated List Before Shunting-Yard

So I'm processing some math from XML strings using the Shunting-Yard algorithm. The trick is that I want to allow the generation of random values by using comma separated lists. For example...
( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )
I've already got a basic Shunting-Yard processor working. But I want to pre-process the string to randomly pick one of the values from the list before processing the expression. Such that I might end up with:
( ( 3 + 4 ) * 12 ) * 4 )
The Shunting-Yard setup is already pretty complicated, as far as my understanding is concerned, so I'm hesitant to try to alter it to handle this. Handling that with error checking sounds like a nightmare. As such, I'm assuming it would make sense to look for that pattern beforehand? I was considering using a regular expression, but I'm not one of "those" people... though I wish that I was... and while I've found some examples, I'm not sure how I might modify them to check for the parenthesis first? I'm also not confident that this would be the best solution.
As a side note, if the solution is regex, it should be able to match strings (just characters, no symbols) in the comma list as well, as I'll be processing for specific strings for values in my Shunting-Yard implementation.
Thanks for your thoughts in advance.
This is easily solved using two regexes. The first regex, applied to the overall text, matches each parenthesized list of comma separated values. The second regex, applied to each of the previously matched lists, matches each of the values in the list. Here is a PHP script with a function that, given an input text having multiple lists, replaces each list with one of its values randomly chosen:
<?php // test.php 20110425_0900
function substitute_random_value($text) {
$re = '/
# Match parenthesized list of comma separated words.
\( # Opening delimiter.
\s* # Optional whitespace.
\w+ # required first value.
(?: # Group for additional values.
\s* , \s* # Values separated by a comma, ws
\w+ # Next value.
)+ # One or more additional values.
\s* # Optional whitespace.
\) # Closing delimiter.
/x';
// Match each parenthesized list and replace with one of the values.
$text = preg_replace_callback($re, '_srv_callback', $text);
return $text;
}
function _srv_callback($matches_paren) {
// Grab all word options in parenthesized list into $matches.
$count = preg_match_all('/\w+/', $matches_paren[0], $matches);
// Randomly pick one of the matches and return it.
return $matches[0][rand(0, $count - 1)];
}
// Read input text
$data_in = file_get_contents('testdata.txt');
// Process text multiple times to verify random replacements.
$data_out = "Run 1:\n". substitute_random_value($data_in);
$data_out .= "Run 2:\n". substitute_random_value($data_in);
$data_out .= "Run 3:\n". substitute_random_value($data_in);
// Write output text
file_put_contents('testdata_out.txt', $data_out);
?>
The substitute_random_value() function calls the PHP preg_replace_callback() function, which matches and replaces each list with one of the values in the list. It calls the _srv_callback() function which randomly picks out one of the values and returns it as the replacement value.
Given this input test data (testdata.txt):
( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )
( ( 3 + 4 ) * 12 ) * ( 12, 13) )
( ( 3 + 4 ) * 12 ) * ( 22, 23, 24) )
( ( 3 + 4 ) * 12 ) * ( 32, 33, 34, 35 ) )
Here is the output from one example run of the script:
Run 1:
( ( 3 + 4 ) * 12 ) * 5 )
( ( 3 + 4 ) * 12 ) * 13 )
( ( 3 + 4 ) * 12 ) * 22 )
( ( 3 + 4 ) * 12 ) * 35 )
Run 2:
( ( 3 + 4 ) * 12 ) * 3 )
( ( 3 + 4 ) * 12 ) * 12 )
( ( 3 + 4 ) * 12 ) * 22 )
( ( 3 + 4 ) * 12 ) * 33 )
Run 3:
( ( 3 + 4 ) * 12 ) * 3 )
( ( 3 + 4 ) * 12 ) * 12 )
( ( 3 + 4 ) * 12 ) * 23 )
( ( 3 + 4 ) * 12 ) * 32 )
Note that this solution uses \w+ to match values consisting of "word" characters, i.e. [A-Za-z0-9_]. This can be easily changed if this does not meet your requirements.
Edit: Here is a Javascript version of the substitute_random_value() function:
function substitute_random_value(text) {
// Replace each parenthesized list with one of the values.
return text.replace(/\(\s*\w+(?:\s*,\s*\w+)+\s*\)/g,
function (m0) {
// Capture all word values in parenthesized list into values.
var values = m0.match(/\w+/g);
// Randomly pick one of the matches and return it.
return values[Math.floor(Math.random() * values.length)];
});
}