Calculator which can take words as input

Calculator which can take words as input - c++

I want to write a calculator which can take words as input.
for e.g. "two plus five multiply with 7" should give 37 as output.
I won't lie, this is a homework so before doing this, I thought if I can be pointed to something which might be useful for these kinds of things and I am not aware of.
Also, approach n how to do this would be ok too , I guess. It has to be written in C++. No other language would be accepted.
Thanks.
[Edit] -- Thanks for the answers. This is an introductory course. So keeping things as simple as possible would be appreciated. I should have mentioned this earlier.
[Edit 2] -- Reached a stage where I can give input in numbers and get correct output with precedence and all. Just want to see how to convert word to number now. Thanks to everybody who is trying to help. This is why SO rocks.

As long as the acceptable input is strict enough, writing a recursive descent parser should be easy. The grammar for this shouldn't differ much from a grammar for a simple calculator that only accepts digits and symbols.
With an std::istream class and the extraction operator (>>) you can easily break the input into separate words.
Here's a lecture that shows how to do this for a regular calculator with digits and symbols.
The main difference from that in your case is parsing the numbers. Instead of the symbols '0-9', your input will have words like "zero", "one", "nine", "hundreds", "thousands", etc. To parse such a stream of words you can do something like this:
break the words into groups, nested by the "multipliers", like "thousands", "hundreds", "billions", etc; these groups can nest. Tens ("ten", "twenty", "thirty"), teens ("eleven", "twelve", "thirteen", ...), and units ("one", "two", "three", ...) don't nest anything. You turn something like "one hundred and three thousands two hundred and ninety seven" into something like this:
+--------+---------+-----+
/ | | \
thousand hundred ninety seven
/ \ |
hundred three two
|
one
for each group, recursively sum all its components, and then multiply it by its "multiplier":
103297
+--------+------+----+
/ | | \
(* 1000) + (* 100) + 90 + 7
/ \ |
(* 100) + 3 2
|
1

This was some of my favorite stuff in school.
The right way to do this is to implement it as a parser -- write a grammar, tokenize your input, and parse away. Then you can evaluate it recursively.
If none of that sounds familiar, though (i.e. this is an introductory class) then you can hack it together in a much less robust way -- just do it how you'd do it naturally. Convert the sentence to numbers and operations, and then just do the operations.

NOTE: I'm assuming this is an introductory level course, and not a course in compiler theory. If you're being taught about specific things relevant to this (e.g. algorithms other than what I mention), you will almost certainly be expected apply those concepts.
First, you'll have to understand the individual words. For this purpose, you can likely just hack it together - read one word at a time and try to understand that. Gradually build up a function which can read the set of numbers you need to be able to work with.
If the input is simple enough (only expressions of the basic form you provide, no parentheses or anything), you can simply alternate between reading a number and an operator until the input is fully read (but if this is supposed to be robust, you need to stop and display an error if the last thing you read is an operator), so you can write separate methods for numbers and operators.
To understand how the expression needs to be calculated, use the shunting yard algorithm to parse the expression with proper operator precedence. You can likely combine the initial parsing of the words with this by simply using that to supply the tokens to the algorithm.
To actually calculate the result, you'll need to evaluate the parsed expression. To do that, you can simply use a stack for the output: when you would normally push an operator, you instead pop 2 values, apply the operator to them, and push the result. After the shunting yard algorithm is complete, there will be one value on the stack, containing the final result.

Related

Function to count an expression

Is there any function in C++ (included in some library or header file), that would count an expression from string?
Let's say we have a string, which equals 2 + 3 * 8 - 5 (but it's taken from user's keyboard, so we don't know what expression exactly it will be while writing the code), and we want this function co count it, but of course in the correct order (1. power / root 2. times / divide 3. increase / decrease).
I've tried to take all the numbers to an array of ints and operators to an array of chars (okay, actually vectors, because I don't know how many numbers and operators it's going to contain), but I'm not sure what to do next.
Note, that I'm asking if there's any function already written for that, if not I'm going to just try it again.

By count, I'm taking that to mean "evaluate the expression".
You'll have to use a parser generator like boost::spirit to do this properly. If you try hand-writing this I guarantee pain and misery for all involved.
Try looking on here for calculator apps, there are several:
http://boost-spirit.com/repository/applications/show_contents.php
There are also some simple calculator-style grammars in the boost::spirit examples.

Quite surprisingly, others before me have not redirected you to this particular algorithm which is easy to implement and converts your string into this special thing called Reverse Polish Notation (RPN). Computing expressions in RPN is easy, the hard part is implementing Shunting Yard but it is something done many times before and you may find many tutorials on the subject.
Quick overview of the algorithms:
PRN - RPN is a way to write expressions that eliminates the need for parentheses, thus it allows easier computation. Practically speaking, to compute one such expression you walk the string from left to right while keeping a stack of operands. Whenever you meet an operand, you push it onto the stack. Whenever you meet an operation token, you calculate its result on the last 2 operands (if the operation is binary ofcourse, on the last only if it is unary) and push it onto the stack. Rinse and repeat until the end of string.
Shunting Yard really is much harder to simply overview and if I do try to accomplish this task, this answer will end up looking much like the wikipedia article I linked above so I'll save that trouble to both of us.
Tl;DR; Go read the links in the first sentence.

You will have to do this yourself as there is no standard library that handles infix equation resolution
If you do, please include the code for what you are trying and we can help you

Collecting disjoint like terms in polynomial expressions

This is my first post on here so thanks in advance and bear with me if I do not follow the format guide.
My problem is as follows: I have several polynomial expressions in the variable "s" up to degree 10. Each of the coefficients are then functions of up to 10 other variables. Overall the code for all of the coefficient functions takes up about 800 text-wrapped lines of code with single coefficients having up to 40 lines of code. I am writing an optimization routine in C++ that tries to determine the optimal values for each of the 10 variables that the coefficients depend on.
Profiling my code, I see that I am spending 78% of my time in this one function. To optimize, I would like to search the whole code and find redundant calculations, compute them at the start of the routine and replace all of their occurrences with the previously calculated expression. The problem is that the most frequently occurring expressions may be something like:
a0 = ... + R1*R2*G1*R3 + R1*R2*H1*R3 + ...;
I would like to find a way to search through the lines and sort out the R1*R2*R3 terms to replace them with something like X where X = R1*R2*R3; is declared at the start of the code. These regular expressions may occur several hundred times throughout the code, so I am sure that this can drastically improve my run time. Additionally, I can only group things separated by multiplication, not addition.
Basically, I need a replace string function that can find disjoint strings whose member terms are separated by other terms and * signs but not + signs. This may be a tall order, or incredibly simple, I am really not sure.
I have Mathematica, MATLAB, and Maple available to me and run Debian, so I can download something if it is open source that may be more helpful. I typically use Emacs for my programming, though I am by no means proficient with all of its functionality. I am open to any suggestions, and greatly appreciate your help.

Using the standard core-utils, you can do the following:
cat filename.cc | tr " +" "\n\n" | grep "*" | sort | uniq -c
In plain English this translates to: read the file, convert all spaces and pluses into new-lines. Next, only keep the lines containing multiplication, sort them, and display unique occurrences with they frequency.

Lingua::EN::FindNumber numify adding found english numerics

I was looking for a way to convert English numerics to integers and found a great post here: Scalable Regex for English Numerals which is using perl. My issue with using numify stems from the method "adding" numbers together rather than just outputting them. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::EN::FindNumber;
print numify("some text and stuff house bill forty three twenty");
produces 63 rather than than what I expected was 43 20
I am at a loss, being a perl newbie on how to get around this. Is there an overrides that I can somehow tell the methods to not do addition? My only guess at this is that it's simply concatenating the string and its and integer so it adds them?? even knowing that still sadly doesn't help me. Thanks to anyone in the know.

I think your problem here has to do with an ambiguous definition of how a number should be interpreted.
If numify merely checks for words that represent numbers in a sequence and adds them, then there's no way you can overcome this. You can try to implement your own grammar, but I don't think it's completely trivial.
You'd have to catch the first word representing a number, and then check the following words, and try to find a match to your rule. For instance, after "forty", you can have a number from 1 to 9 (one, two, etc...), or "thousand", or... "millions"... I think you get the idea, In this case, you get "three", so... add them up, the next word is twenty, which doesn't match any rule above, so start over as a new number.
Sorry if this seems like I'm just thinking out loud. don't know if there's a library that can do this for you, it's an ambiguous problem, as usual when you're parsing natural language.
Hope it helps!

I think the parser in Lingua::EN::FindNumber is kind of loose about what it considers a number, so that e.g. "three and twenty", "three twenty" or even "forty three twenty" as valid numbers. For that matter, looking at the source, it also seems to accept "baker's dozen", "eleventy-one" and "billiard" as numbers...

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.

I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.

FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.

There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.

Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Matching unmatched strings based on a unknown pattern

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.
The situation is this:
Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).
For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:
{
{
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
}
{
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
}
}
{
{
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
}
}
Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).
Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).
First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.
Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.
My question to you is, what would be a logical approach in the above situation?
As for a reallife example:
Thanks in advance!

Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.
Well, this is just a basic idea but I think it would be a good start to try some experiments...

Building on #PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.
For example, you could use four measures:
How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above
You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.
R has a number of functions for cluster analysis. This might be a good starting point.
Afterthought: the measures can be almost anything you invent. Some more examples:
Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?
Enough for a least a weekend's tinkering...

I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.
Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv
A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)
With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.
Good luck!

Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:
[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]
Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.
May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.
Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.

I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js