Processing a Comma Separated List Before Shunting-Yard - regex

So I'm processing some math from XML strings using the Shunting-Yard algorithm. The trick is that I want to allow the generation of random values by using comma separated lists. For example...
( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )
I've already got a basic Shunting-Yard processor working. But I want to pre-process the string to randomly pick one of the values from the list before processing the expression. Such that I might end up with:
( ( 3 + 4 ) * 12 ) * 4 )
The Shunting-Yard setup is already pretty complicated, as far as my understanding is concerned, so I'm hesitant to try to alter it to handle this. Handling that with error checking sounds like a nightmare. As such, I'm assuming it would make sense to look for that pattern beforehand? I was considering using a regular expression, but I'm not one of "those" people... though I wish that I was... and while I've found some examples, I'm not sure how I might modify them to check for the parenthesis first? I'm also not confident that this would be the best solution.
As a side note, if the solution is regex, it should be able to match strings (just characters, no symbols) in the comma list as well, as I'll be processing for specific strings for values in my Shunting-Yard implementation.
Thanks for your thoughts in advance.

This is easily solved using two regexes. The first regex, applied to the overall text, matches each parenthesized list of comma separated values. The second regex, applied to each of the previously matched lists, matches each of the values in the list. Here is a PHP script with a function that, given an input text having multiple lists, replaces each list with one of its values randomly chosen:
<?php // test.php 20110425_0900
function substitute_random_value($text) {
$re = '/
# Match parenthesized list of comma separated words.
\( # Opening delimiter.
\s* # Optional whitespace.
\w+ # required first value.
(?: # Group for additional values.
\s* , \s* # Values separated by a comma, ws
\w+ # Next value.
)+ # One or more additional values.
\s* # Optional whitespace.
\) # Closing delimiter.
/x';
// Match each parenthesized list and replace with one of the values.
$text = preg_replace_callback($re, '_srv_callback', $text);
return $text;
}
function _srv_callback($matches_paren) {
// Grab all word options in parenthesized list into $matches.
$count = preg_match_all('/\w+/', $matches_paren[0], $matches);
// Randomly pick one of the matches and return it.
return $matches[0][rand(0, $count - 1)];
}
// Read input text
$data_in = file_get_contents('testdata.txt');
// Process text multiple times to verify random replacements.
$data_out = "Run 1:\n". substitute_random_value($data_in);
$data_out .= "Run 2:\n". substitute_random_value($data_in);
$data_out .= "Run 3:\n". substitute_random_value($data_in);
// Write output text
file_put_contents('testdata_out.txt', $data_out);
?>
The substitute_random_value() function calls the PHP preg_replace_callback() function, which matches and replaces each list with one of the values in the list. It calls the _srv_callback() function which randomly picks out one of the values and returns it as the replacement value.
Given this input test data (testdata.txt):
( ( 3 + 4 ) * 12 ) * ( 2, 3, 4, 5 ) )
( ( 3 + 4 ) * 12 ) * ( 12, 13) )
( ( 3 + 4 ) * 12 ) * ( 22, 23, 24) )
( ( 3 + 4 ) * 12 ) * ( 32, 33, 34, 35 ) )
Here is the output from one example run of the script:
Run 1:
( ( 3 + 4 ) * 12 ) * 5 )
( ( 3 + 4 ) * 12 ) * 13 )
( ( 3 + 4 ) * 12 ) * 22 )
( ( 3 + 4 ) * 12 ) * 35 )
Run 2:
( ( 3 + 4 ) * 12 ) * 3 )
( ( 3 + 4 ) * 12 ) * 12 )
( ( 3 + 4 ) * 12 ) * 22 )
( ( 3 + 4 ) * 12 ) * 33 )
Run 3:
( ( 3 + 4 ) * 12 ) * 3 )
( ( 3 + 4 ) * 12 ) * 12 )
( ( 3 + 4 ) * 12 ) * 23 )
( ( 3 + 4 ) * 12 ) * 32 )
Note that this solution uses \w+ to match values consisting of "word" characters, i.e. [A-Za-z0-9_]. This can be easily changed if this does not meet your requirements.
Edit: Here is a Javascript version of the substitute_random_value() function:
function substitute_random_value(text) {
// Replace each parenthesized list with one of the values.
return text.replace(/\(\s*\w+(?:\s*,\s*\w+)+\s*\)/g,
function (m0) {
// Capture all word values in parenthesized list into values.
var values = m0.match(/\w+/g);
// Randomly pick one of the matches and return it.
return values[Math.floor(Math.random() * values.length)];
});
}

Related

Iterate through captures with boost::regex

I have a regular expression to capture three fields in a HTML tag using boost::regex
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(.*?)\\s*>(.*?)<"
So, from
Deutsch
I get
de
Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de"
Deutsch
But I´d like to have {de, Porky%E2%80%99s, Deutsch} instead.
How can I make my regex to stop matching the second field as soon as it finds the first white space?
I tried
"\\/\\/(.{1,3}?)\\.wikipedia\\.[a-z]+\\/wiki\\/(\\S*?)*>(.*?)<"
So the second field matches everything but whitespace but I get this crash report
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): Ran out of stack space trying to match the regular expression.
This might work -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*).*?>(.*?)<"
I would use this instead -
"//(.{1,3}?)\\.wikipedia\\.[a-z]+/wiki/([^\\s>\"]*)[^>]*>(.*?)<"
Formatted:
//
( .{1,3}? ) # (1)
\.
wikipedia
\.
[a-z]+
/wiki/
( [^\s>"]* ) # (2)
[^>]*
>
( .*? ) # (3)
<
Output:
** Grp 0 - ( pos 9 , len 98 )
//de.wikipedia.org/wiki/Porky%E2%80%99s" title="Porky’s – German" lang="de" hreflang="de">Deutsch<
** Grp 1 - ( pos 11 , len 2 )
de
** Grp 2 - ( pos 33 , len 15 )
Porky%E2%80%99s
** Grp 3 - ( pos 99 , len 7 )
Deutsch

What does regex expression doing?

What does this expression mean?
Pattern.compile("^.*(?=.*\\d).*$", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS)
I tried to split each part of the expression, but could not get its meaning. please help me on this.
From regex101.com:
TL;DR:
Matches any String that contains at least a number (characters '0' to '9').
As a side note I'd like to point out that this is a horrendous way to do so, and could be replaced by the following:
Pattern.compile("\\d");
I basically removed all the nonsense greedy fillers and the useless anchors. Use this regex with Matcher#find() method and not Matcher#matches().
There are two parts to this regex.
1. The part up to (but not including) the digit.
2. The part from the digit to the end of the string.
The regex is processed left to right.
The first thing it see's is .*. This tells it to go directly to the
end of the string and start searching backwards to satisfy ->
The next thing it see's, which is (?=.*\d).
In that assertion the .* is ignored because of the previous .*
since its already at the end.
So the search progresses (using the assertion) to the left until it finds a
position where a digit is directly in front of the current position.
Once that is found, it matches that digit and all past it until the end of
the string. This is the part 2. described above.
Visually, it can be seen if you add some capture groups, and test it on some
real input.
^
( .* ) # (1)
(?=
( .* ) # (2)
( \d ) # (3)
)
( .* ) # (4)
$
Output:
** Grp 0 - ( pos 0 , len 15 )
12hh34ddd567uuu
** Grp 1 - ( pos 0 , len 11 )
12hh34ddd56
** Grp 2 - ( pos 11 , len 0 ) EMPTY
** Grp 3 - ( pos 11 , len 1 )
7
** Grp 4 - ( pos 11 , len 4 )
7uuu

Regex fails to extract a double parameter substring from a string

I am trying to use the Regex library tools to extract double and integer parameters from a text file. Here is a minimal code that captures the 'std::regex_error' message I've been getting:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string My_String = "delta = -002.050";
std::smatch Match;
std::regex Base("/^[0-9]+(\\.[0-9]+)?$");
std::regex_match(My_String,Match,Base);
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
return 0;
}
I am not much familiar with the Regex library, and couldn't find anything immediately useful. Any idea what causes this error message? To compile my code, I use g++ with -std=c++11 enabled. However, I am sure that the problem is not caused by my g++ compiler as suggested in the answers given to this earlier question (I tried several g++ compilers here).
I expect to get "-002.050" from the string "delta = -002.050", but I get:
terminate called after throwing an instance of 'std::regex_error'
what(): regex_error
Abort
Assuming you have gcc4.9 (older versions do not ship with a libstdc++ version that supports <regex>), then you can get the desired result by changing your regex to
std::regex Base("[0-9]+(\\.[0-9]+)?");
This will capture the fractional part of the floating point number in the input, along with the decimal point.
There are a couple of problems with your original regex. I think the leading / is an error. And then you're trying match the entire string by enclosing the regular expression in ^...$, which is clearly not what you want.
Finally, since you only want to match part of the input string, and not the entire thing, you need to use regex_search instead of regex_match.
std::regex Base(R"([0-9]+(\.[0-9]+)?)"); // use raw string literals to avoid
// having to escape backslashes
if(std::regex_search(My_String,Match,Base)) {
std::ssub_match Sub_Match = Match[1];
std::string Sub_String = Sub_Match.str();
std::cout << Sub_String << std::endl;
}
Live demo
I expect to get "-002.050" from the string "delta = -002.050"
To do that, modify the regex in the example above to
std::regex Base(R"(([+-]{0,1}[0-9]+\.[0-9]+))");
The above will match a single, optional, leading + or - sign.
The leading forward slash doesn't look right. Also, it looks like you are trying to match an entire line, due to the leading ^ and trailing $, but I'm not really sure that is what you want. Also, your expression isn't matching the negative sign.
Try this:
std::regex Base("-?[0-9]+(\\.[0-9]+)?$");
I think you are getting an error because what within the smatch object
is not valid.
To avoid this you have to check for a match.
Beyond that a general regex is
# "(?<![-.\\d])(?=[-.\\d]*\\d)(-?\\d*)(\\.\\d*)?(?![-.\\d])"
(?<! [-.\d] ) # Lookbehind, not these chars in behind
# This won't match like -'-3.44'
# Remove if not needed
(?= [-.\d]* \d ) # Lookahead, subject has to contain a digit
# Here, all the parts of a valid number are
# in front, now just define an arbitrary form
# to pick them out.
# Note - the form is all optional, let the engine
# choose what to match.
# -----------------
( -? \d* ) # (1), Required group before decimal, can be empty
( \. \d* )? # (2), Optional group, can be null
# change to (\.\d*) if decimal required
(?! [-.\d] ) # Lookahead, not these chars in front
# This won't match like '3.44'.66
# Remove if not needed
Sample output:
** Grp 0 - ( pos 9 , len 8 )
-002.050
** Grp 1 - ( pos 9 , len 4 )
-002
** Grp 2 - ( pos 13 , len 4 )
.050
-----------------
** Grp 0 - ( pos 28 , len 3 )
.65
** Grp 1 - ( pos 28 , len 0 ) EMPTY
** Grp 2 - ( pos 28 , len 3 )
.65
-----------------
** Grp 0 - ( pos 33 , len 4 )
1.00
** Grp 1 - ( pos 33 , len 1 )
1
** Grp 2 - ( pos 34 , len 3 )
.00
-----------------
** Grp 0 - ( pos 39 , len 4 )
9999
** Grp 1 - ( pos 39 , len 4 )
9999
** Grp 2 - NULL
-----------------
** Grp 0 - ( pos 104 , len 4 )
-99.
** Grp 1 - ( pos 104 , len 3 )
-99
** Grp 2 - ( pos 107 , len 1 )
.

Parse Maven Filename

How can I parse a maven filename into the artifact and and version?
The filenames look like this:
test-file-12.2.2-SNAPSHOT.jar
test-lookup-1.0.16.jar
I need to get
test-file
12.2.2-SNAPSHOT
test-lookup
1.0.16
So the artifactId is the text before the first instance of a dash and a number and the version is the text after the first instance of a number up to .jar.
I could probably do it with split and several loops and checks but it feels like there should be a simpler way.
EDIT:
Actually, the regex wasn't as complicated as I thought!
new File("test").eachFile() { file ->
String fileName = file.name[0..file.name.lastIndexOf('.') - 1]
//Split at the first instance of a dash and a number
def split = fileName.split("-[\\d]")
String artifactId = split[0]
String version = fileName.substring(artifactId.length() + 1, fileName.length())
println(artifactId)
println(version)
}
EDIT2: Hmm. It fails on examples such as this:
http://mvnrepository.com/artifact/org.xhtmlrenderer/core-renderer/R8
core-renderer-R8.jar
Basically its just this ^(.+?)-(\d.*?)\.jar$
used in multi-line mode if there is more than one line.
^
( .+? )
-
( \d .*? )
\. jar
$
Output:
** Grp 0 - ( pos 0 , len 29 )
test-file-12.2.2-SNAPSHOT.jar
** Grp 1 - ( pos 0 , len 9 )
test-file
** Grp 2 - ( pos 10 , len 15 )
12.2.2-SNAPSHOT
--------------------------
** Grp 0 - ( pos 31 , len 22 )
test-lookup-1.0.16.jar
** Grp 1 - ( pos 31 , len 11 )
test-lookup
** Grp 2 - ( pos 43 , len 6 )
1.0.16

Replacing ( string ) ^ 2 to sqrt ( string ) in Perl

A given string (called $bbb!) contains many operands and operators. I want to replace every occurrence of
muth ( math ) ^ 2 mith to muth sqrt( math ) mith. (whitespace can be more than just one).
EDIT:
Assume that, in the entire expression, there is only either one (simple linear expression) ^ 2 or none --if it makes it easier.
Inclusive Example:
1.2 * ( 4.7 * a * ( b - 0.02 ) ^ 2 * ( b - 0.02 + 1 ) / ( b - 0.0430 ) )
should be changed to:
1.2 * ( 4.7 * a * sqrt( b - 0.02 ) * ( c - 0.02 + 1 ) / ( d - 0.0430 ) )
Well... weird problem...
Try it with this a bit advanced expression
(?<math>\((?:[^()]+|(?&math))*\))\s*\^\s*2
Hopefully the graphic illustrates what's going on
Debuggex Demo
The replacement string must than be sqrt $1
The command in perl would look like
$bbb =~ s/(?<math>\((?:[^()]+|(?&math))*\))\s*\^\s*2/sqrt $1/
A running example can be found here: http://regex101.com/r/qU8dV0/3
some words on what the heck, this is
the main structure here is anything\s*\^\s*2, it's matching anything followed by ^2
(?<math>...) builds a pattern named math
\(...\) the pattern math must begin with an opening parenthesis and end with a closing one
within the parenthesisses:
[^()]+ anything except parenthesisses is allowed or
(?&math) another in parenthesis wrapped term with the already defined structure, is allowed, so the outer pattern math is recursively repeated