What would be the Regular Expression to retrieve number in front of % symbol in string? - regex

I'm using a math parser that uses % as the mod symbol. I'd like to use it for the percent symbol instead to allow users to type "25%*10" or "25% of 10" and receive the answer "2.5" (they could type anything).
I'd then use the regex asked for to get the "25%" (could be in any part of the string) and do a simple calculation (25 / 100) and then replace the 25% in the string with the calculated value to pass to the math parser. Basically I'm doing some pre calculations on the number in front of the percent symbol before passing the full string to the math parser.
I've always struggled with regex and was hoping someone could help me.
Thanks

Easiest way would be to let the math parser do the dividing:
s/([0-9.]+)%/($1\/100)/g
Which just replaces "25%" with "(25/100)" ... so if the user typed "25%*10" they'll now have "(25/100)*10" which when evaluated gives them the right answer.
Alternatively, in Perl, you could do:
s{([0-9.]+)%}{$1/100}eg
which would have Perl calculate the division by a hundred and pass that to the math parser

The regular expression to find a number before a '%' sign is /(\d+)%/.
However, mathematical expressions are not a regular language, so a regular expression is likely not the right tool. You should instead tell your parser to interpret '%' as "take whatever comes before this and divide it by 100". "Whatever comes before this" can then be any mathematical expression, and you won't run into operator precedence problems, resp. you could sort these out in the syntax tree parser.
Mixing text substitution with real language features often violates the principle of least astonishment. When a user sees that '%' divides whatever comes before by 100, he might try to use expressions like (23+42)%, but that will just produce a syntax error. Also, you need a more elaborate regex if you have something like 1.34e-14%, but these things would just sort themselves out when you use the tree parser.

Related

Are long regular expressions worse than short ones?

I was trying to learn about regular expressions for a project where I want to create a textmate grammar, regexes seem relatively simple but really hard to read for me, so I tried to create a utility module hat could generate them, it kinda works as intended and generate regular expressions that actually work, all aliased by easy to understand names.
for example:
struc_enum = OrGroup("struct", "enum")
whitespace = TAB_SPACE.at_least(1)
results in:
(?:struct|enum)
[ \t]+
in this case, there's not much benefit in using python aliases but then I can do:
valid_name = r"\b" + Group(ALPHA, ALPHANUMERIC.repeated())
struc_enum = OrGroup("struct", "enum")
typed_name = (struc_enum + whitespace).optional() + valid_name + whitespace + valid_name.captured()
and ths is what print(typed_name) displays:
(?:(?:(?:struct|enum)[ \t]+)?\b[a-zA-Z][a-zA-Z\d]*[ \t]+(\b[a-zA-Z][a-zA-Z\d]*))
This method can be used to create small snippets and concatenate them to construct more complex patterns, but for each level of concatenation the expression grows exponentially large, such that I could easily get at this point:
(?:(func)[\s]+([a-zA-Z_]+[a-zA-Z\d_]*)[\s]*\([\s]*(?:[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?(?:[\s]*,[\s]*[a-zA-Z_]+[a-zA-Z\d_]*(?:[\s]*[a-zA-Z_]+[a-zA-Z\d_]*[*]{,2})?)*[\s]*)?\))
In an atom grammar this big regex above can match lines like this, but it doesn't seem to work elsewhere:
func myfunc(asd asd*, asd*, asdasd)
func do_foo01(type arg1, int arg2)
With enough patience, a human might construct an equivalent expression but probably much shorter, which raises the question. Are big regular expressions worse or better than the equivalent shorter ones int terms of computational overhead? At which point can we consider regexes too big?
Since the original problem you set out to solve is that long regular expressions are difficult to read, you may wish to consider extended (verbose) regular expressions. Extended regular expressions allow whitespace and comments, which can make a regular expression much easier to read.
Contrast this regular expression:
charref = re.compile("&#(0[0-7]+"
"|[0-9]+"
"|x[0-9a-fA-F]+);")
with the same regular expression, with comments:
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)
Example taken from Regular Expression HOWTO
I think this is a fine idea, but you need to be clear with yourself about the scale of the project you're undertaking.
We almost never need to use regular expressions; we could take apart every string and write our own parsing operations using starts_with and ifs etc. But regex syntax is a mature, powerful system that let's us succinctly express certain kinds of logic.
Often regexes are hard to read. There are some tools that can help, but the idea of a less succinct system for doing the stuff we currently do with regexs is sound. The hard part will be replicating the breadth, power, and reliability of existing regex systems.
My guess is you'll be best served by learning to tolerate the density of regular expressions. Possibly we might be better served by you building a easier-to-read system for sting-parsing, but you'll have about 20 years of catching up to do.
Regarding performance: Regexes are (can be) compiled. Depending on the context, this can have a big performance benefit.
Anyway, like any sufficiently powerful language, the length of the instruction is a poor indicator of it's run-time complexity.

Is there an efficient way to find a string fulfilling given regex?

Let's say I've got such regex (python notation) r'^namespace/(\w+)/([0-9]+)/', is there a way to reverse this regex and find a string fulfilling it?
By reversing I don't mean manual constructing 'namespace/' + 'a_1' + '/' + '1', but systematic way to reverse any regular expression consisting of some special characters. So that for every regex I can generate (any) string fulfilling it.
The only thing that comes to my mind is to parse the given regex with some other regexs, but it does not seem acceptable solution. Although I expect the whole operation to have huge complexity, I still look for at least a bit more sophisticated way to do it.
The only thing that comes to my mind is to parse the given regex with some other regexs, but it does not seem acceptable solution
You don't need to parse the regex with regexes, but yes you will need to parse it. When you have an AST of the regular expression, you can easily traverse that and build a possible match in linear time (for plain regular expression, nothing too fancy like lookaround).
Check Enumerating Regular Languages for an example code and continuative links.

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)
It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.
It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

How to write the regex for this expression

I want to match strings like this: !! so I suppose the input have the right elements but whether the they are evaluable, that is left for the evaluator!
1+(2-3)*(4/5)
what is the regex for matching this, something like this: ([0-9\+-\*/\(\)]+)? but this seems not working.
If you are only asking for a character validation, you can use
^[0-9+*/()-]*$
You don't need to escape characters in a character class (inside square brackets). And if you must include an hyphen, you HAVE to put it at the end, otherwise it would be considered as the character range operator.
That said, keep in mind this will only guarantee you that you have no other characters. It will NOT validate the structure (regexes are not the right tool for that). However, since you stated an evaluator will then process the input, that might be right for you.
You can't, this is not a regular language. Though some regexp implementations may provide additional features to match balanced parenthesis.
Regular expressions can not match arbitrary arithmetic formulas. Regexps only describe regular languages, while arithmetic formulas use a recursive grammar. See http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
A regex may be possible if you limit nesting depth, but if you want it all the way, with matching bracket detection, it will probably be very, very complicated.
If you want to match "1+(2-3)*(4/5)", then you can use this regular expression.
/1+\(2-3\)\*\(4\/5)/
What's that? That doesn't tell you what you want to know? Well, then what do you want to know? What information are you trying to extract from the string?
You can't just say "strings like this". Your question is not nearly enough clear.
If your question is to evaluate if a equation is valid then you will need a parser to Tokenize the expression than a grammar to evaluate if the expression is right.
You cant check if the equation as balanced parenthesis using regex. This is because a regular expression is equivalent to a Deterministic Finite Automata. Since the automata is finite, you will never have a automata big enough to check parenthesis.

In "aa67bc54c9", is there any way to print "aa" 67 times, "bc" 54 times and so on, using regular expressions?

I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.