Collecting disjoint like terms in polynomial expressions - c++

This is my first post on here so thanks in advance and bear with me if I do not follow the format guide.
My problem is as follows: I have several polynomial expressions in the variable "s" up to degree 10. Each of the coefficients are then functions of up to 10 other variables. Overall the code for all of the coefficient functions takes up about 800 text-wrapped lines of code with single coefficients having up to 40 lines of code. I am writing an optimization routine in C++ that tries to determine the optimal values for each of the 10 variables that the coefficients depend on.
Profiling my code, I see that I am spending 78% of my time in this one function. To optimize, I would like to search the whole code and find redundant calculations, compute them at the start of the routine and replace all of their occurrences with the previously calculated expression. The problem is that the most frequently occurring expressions may be something like:
a0 = ... + R1*R2*G1*R3 + R1*R2*H1*R3 + ...;
I would like to find a way to search through the lines and sort out the R1*R2*R3 terms to replace them with something like X where X = R1*R2*R3; is declared at the start of the code. These regular expressions may occur several hundred times throughout the code, so I am sure that this can drastically improve my run time. Additionally, I can only group things separated by multiplication, not addition.
Basically, I need a replace string function that can find disjoint strings whose member terms are separated by other terms and * signs but not + signs. This may be a tall order, or incredibly simple, I am really not sure.
I have Mathematica, MATLAB, and Maple available to me and run Debian, so I can download something if it is open source that may be more helpful. I typically use Emacs for my programming, though I am by no means proficient with all of its functionality. I am open to any suggestions, and greatly appreciate your help.

Using the standard core-utils, you can do the following:
cat filename.cc | tr " +" "\n\n" | grep "*" | sort | uniq -c
In plain English this translates to: read the file, convert all spaces and pluses into new-lines. Next, only keep the lines containing multiplication, sort them, and display unique occurrences with they frequency.

Related

Function to count an expression

Is there any function in C++ (included in some library or header file), that would count an expression from string?
Let's say we have a string, which equals 2 + 3 * 8 - 5 (but it's taken from user's keyboard, so we don't know what expression exactly it will be while writing the code), and we want this function co count it, but of course in the correct order (1. power / root 2. times / divide 3. increase / decrease).
I've tried to take all the numbers to an array of ints and operators to an array of chars (okay, actually vectors, because I don't know how many numbers and operators it's going to contain), but I'm not sure what to do next.
Note, that I'm asking if there's any function already written for that, if not I'm going to just try it again.
By count, I'm taking that to mean "evaluate the expression".
You'll have to use a parser generator like boost::spirit to do this properly. If you try hand-writing this I guarantee pain and misery for all involved.
Try looking on here for calculator apps, there are several:
http://boost-spirit.com/repository/applications/show_contents.php
There are also some simple calculator-style grammars in the boost::spirit examples.
Quite surprisingly, others before me have not redirected you to this particular algorithm which is easy to implement and converts your string into this special thing called Reverse Polish Notation (RPN). Computing expressions in RPN is easy, the hard part is implementing Shunting Yard but it is something done many times before and you may find many tutorials on the subject.
Quick overview of the algorithms:
PRN - RPN is a way to write expressions that eliminates the need for parentheses, thus it allows easier computation. Practically speaking, to compute one such expression you walk the string from left to right while keeping a stack of operands. Whenever you meet an operand, you push it onto the stack. Whenever you meet an operation token, you calculate its result on the last 2 operands (if the operation is binary ofcourse, on the last only if it is unary) and push it onto the stack. Rinse and repeat until the end of string.
Shunting Yard really is much harder to simply overview and if I do try to accomplish this task, this answer will end up looking much like the wikipedia article I linked above so I'll save that trouble to both of us.
Tl;DR; Go read the links in the first sentence.
You will have to do this yourself as there is no standard library that handles infix equation resolution
If you do, please include the code for what you are trying and we can help you

Regular expression matching for removing certain uses of the period character

I have some Fortran 77 source files that I'm trying to convert from a non-standard STRUCTURE and RECORD syntax to the standardized Fortran 90 TYPE syntax. One tricky aspect of this is the different way that structure members are addressed.
Non-standard:
s.member = 1
Standard:
s%member = 1
So, I need to trap all uses of periods in these sort of scenarios and replace them with % characters. Not too bad, except when you think about all of the ways that periods can be used (decimal points in numbers, filenames in include statements, punctuation in comments, Fortran 77 relational operators, maybe others). I've done some preprocessing to fix the relational operators to use the Fortran 90 symbols, and I don't really care about mangling the grammar of comments, but I haven't come up with a good approach to translate the . to % for the cases above. It seems like I should be able to do this with sed, but I'm not sure how to match the instances I need to fix. Here are the rules that I've thought of:
On a line-by-line basis:
If the line begins with <whitespace>include, then we shouldn't do anything to that line; pass it through to the output, so we don't mess up the filename inside the include statement.
The following strings are operators that don't have symbolic equivalents, so they must be left alone: .not. .and. .or. .eqv. .neqv.
Otherwise, if we find a period that is surrounded by 2 non-numeric characters (so it's not a decimal point), then it should be the operator that I'm looking to replace. Change that period to a %.
I'm not a native Fortran speaker myself, so here are some examples:
include 'file.inc' ! We don't want to do anything here. The line can
! begin with some amount of whitespace
if x == 1 .or. y > 2.0 ! In this case, we don't want to touch the periods that
! are part of the logical operator ".or.". We also don't
! want to touch the period that is the decimal point
! in "2.0".
if a.member < 4.0 .and. b.othermember == 1.0 ! We don't want to touch the periods
! inside the numbers, but we need to
! change the "a." and "b." to "a%"
! and "b%".
Any good way of tackling this problem?
Edit: I actually found some additional operators that contain a dot in them that don't have symbolic equivalents. I've updated the rule list above.
You can't do this with a regexp, and it's not that easy.
If I had to do what you have to, I would probably do it by hand, unless the codebase is huge. If the former applies, first replace all [a-zA-Z0-9].[a-zA-Z] to something very weird that is guaranteed never to compile, something like "#WHATEVER#", then proceed to search all these entries and replace them by hand after manual control.
If the amount of code is huge, then you need to write a parser. I would suggest you to use python to tokenize basic fortran constructs, but remember that fortran is not an easy language to parse. Work "per routine", and try to find all variable names used, using them as a filter. If you encounter something like a.whatever, and you know that a is in the list of local or global vars, apply the change.
Unless the codebase is really HUUGE (and do think very hard whether this is indeed the case), I'd just take an editor like Vim (vertical select & block select are your friends) a*nd set aside an afternoon to do this by hand*. In one afternoon, my guess is you'll be done with most of it, if not all. Afternoon is a lot of time. Just imagine how many cases you could've covered in these 2 hours alone.
Just by trying to write a parser for something like this, will take you much longer than that.
Of course, the question begs itself ... if the code if F77 which all compilers still support, and the code works ... why are you so keen on changing it?
I'm not that versed in regexps, so I guess I'd try tackling one this from other side. If you grep for the STRUCTURE keyword, you get the list of all the STRUCTURES used in the code. Once you have it, for each STRUCTURE S then you can just replace all instances of S. by S%.
This way you don't have to worry about things like .true., .and., .neq. and their relatives. The main worry then would be to be able to parse the STRUCTURE declarations.
Although the regex below :
(?<!')\b([^.\s]+)(?<!\.(?:not|and|or|eqv|neqv))(?<=\D)\.(?=\D)(?!(?:not|and|or|eqv|neqv)\.)([^.\s]+)\b(?!')
Replace $1%$2
Works perfectly for your examples, I would not recommend using it with your current task. It will definitely not cover all your cases. Now if you care for a 80% coverage or something you could use it but you should probably back up your sources. With the limited set of input cases I had , I am sure that there will be cases that the regex would replace something that it shouldn't.
Good luck :)
This sed oneliner might be a start
sed -r '/^\s*include/b;/^\s*! /b;G;:a;s/^(\.(not|and|or|eqv|neqv)\.)(.*\n.*)/\3\1/;ta;s/^\.([^0-9]{2,})(.*\n.*)/\2%\1/;ta;s/^(.)(.*\n.*)/\2\1/;ta;s/\n//'
Based on your examples, I am guessing it would be enough to protect quoted strings, then replace periods with alphabetics on both sides.
perl -pe '1 while s%(\x27[^\x27]+)\.([^\x27]+\x27)%$1##::##$2%;
s/([a-z])\.([a-z])/$1%$2/g;
s/##::##/./g' file.f
I offer this Perl solution not because sed is not a good enough tool for this, but because it avoids the issue of minor but pesky differences between sed dialects. The ability to use a hex code for the single quotes is a nice bonus.

Calculator which can take words as input

I want to write a calculator which can take words as input.
for e.g. "two plus five multiply with 7" should give 37 as output.
I won't lie, this is a homework so before doing this, I thought if I can be pointed to something which might be useful for these kinds of things and I am not aware of.
Also, approach n how to do this would be ok too , I guess. It has to be written in C++. No other language would be accepted.
Thanks.
[Edit] -- Thanks for the answers. This is an introductory course. So keeping things as simple as possible would be appreciated. I should have mentioned this earlier.
[Edit 2] -- Reached a stage where I can give input in numbers and get correct output with precedence and all. Just want to see how to convert word to number now. Thanks to everybody who is trying to help. This is why SO rocks.
As long as the acceptable input is strict enough, writing a recursive descent parser should be easy. The grammar for this shouldn't differ much from a grammar for a simple calculator that only accepts digits and symbols.
With an std::istream class and the extraction operator (>>) you can easily break the input into separate words.
Here's a lecture that shows how to do this for a regular calculator with digits and symbols.
The main difference from that in your case is parsing the numbers. Instead of the symbols '0-9', your input will have words like "zero", "one", "nine", "hundreds", "thousands", etc. To parse such a stream of words you can do something like this:
break the words into groups, nested by the "multipliers", like "thousands", "hundreds", "billions", etc; these groups can nest. Tens ("ten", "twenty", "thirty"), teens ("eleven", "twelve", "thirteen", ...), and units ("one", "two", "three", ...) don't nest anything. You turn something like "one hundred and three thousands two hundred and ninety seven" into something like this:
+--------+---------+-----+
/ | | \
thousand hundred ninety seven
/ \ |
hundred three two
|
one
for each group, recursively sum all its components, and then multiply it by its "multiplier":
103297
+--------+------+----+
/ | | \
(* 1000) + (* 100) + 90 + 7
/ \ |
(* 100) + 3 2
|
1
This was some of my favorite stuff in school.
The right way to do this is to implement it as a parser -- write a grammar, tokenize your input, and parse away. Then you can evaluate it recursively.
If none of that sounds familiar, though (i.e. this is an introductory class) then you can hack it together in a much less robust way -- just do it how you'd do it naturally. Convert the sentence to numbers and operations, and then just do the operations.
NOTE: I'm assuming this is an introductory level course, and not a course in compiler theory. If you're being taught about specific things relevant to this (e.g. algorithms other than what I mention), you will almost certainly be expected apply those concepts.
First, you'll have to understand the individual words. For this purpose, you can likely just hack it together - read one word at a time and try to understand that. Gradually build up a function which can read the set of numbers you need to be able to work with.
If the input is simple enough (only expressions of the basic form you provide, no parentheses or anything), you can simply alternate between reading a number and an operator until the input is fully read (but if this is supposed to be robust, you need to stop and display an error if the last thing you read is an operator), so you can write separate methods for numbers and operators.
To understand how the expression needs to be calculated, use the shunting yard algorithm to parse the expression with proper operator precedence. You can likely combine the initial parsing of the words with this by simply using that to supply the tokens to the algorithm.
To actually calculate the result, you'll need to evaluate the parsed expression. To do that, you can simply use a stack for the output: when you would normally push an operator, you instead pop 2 values, apply the operator to them, and push the result. After the shunting yard algorithm is complete, there will be one value on the stack, containing the final result.

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

Matching unmatched strings based on a unknown pattern

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.
The situation is this:
Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).
For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:
{
{
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
}
{
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
}
}
{
{
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
}
}
Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).
Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).
First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.
Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.
My question to you is, what would be a logical approach in the above situation?
As for a reallife example:
Thanks in advance!
Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.
Well, this is just a basic idea but I think it would be a good start to try some experiments...
Building on #PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.
For example, you could use four measures:
How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above
You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.
R has a number of functions for cluster analysis. This might be a good starting point.
Afterthought: the measures can be almost anything you invent. Some more examples:
Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?
Enough for a least a weekend's tinkering...
I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.
Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv
A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)
With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.
Good luck!
Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:
[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]
Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.
May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.
Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.
I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.