I want to assign a mathematical expression (or a function) to an attribute in python. But so far I only know how to assign hard parameters like Int, Double or Boolean. In the code bellow is some kind of similarity for what I want.
class MyClass:
def __init__(self, mathExp):
self.Expression=mathExp
def evalExp(self,param)
return self.Expression(param)
What I want is mathExp to be something like "1+2x" and when I execute the function evalExp I can enter a parameter like 5 so that the return is something like 1+2*5=11.
Thanks for your help!,
You need to parse your input, which is a string, into something which can be handled by python. That is if you are not importing a library which can do this kind of stuff for you, like sympy.
If you want to do it yourself you need to build your mathematical language out of simpler expression, so that in the end after parsing your example input, you could have a list/tree/... like:
[1, add, [ 2, mul, 'x']]
Which then could be evaluated step by step. Safe to say, this is not the easiest task to take on.
The genuine solution
If you used reverse polish notation, i.e. changing from "1+2x" to "1 2 x * +", this would be somewhat easier, as the list of tokens could look like:
[1, 2, x, mul, add]
And then you could go through it calculating as you read tokens:
1 – Push number onto a stack
2 – Push another number onto the stack
x – Get an input, and push that onto the stack
mul – Get two numbers from stack, multiply, and push back on stack. I.e. for x=5, push 2*5=10 back on stack
add – Get two numbers from stack, i.e. 1, 10, and add them, push back on stack
Now the input token list is empty, and the stack has one value 11. Ready for the next calculation.
The fake and unsafe solution
However, if you set aside safety and security issues, you could go down this path:
expression = "1 + 2*x" # The '*' needs explicitly to be there
x = 5
result = eval(expression)
print result # Will output '11'
In other words, you then search for the presence of letters/words, and ask using raw_input to set these, before sending the entire expression of to eval. However using eval opens up a lot of pitfalls as it can evaluate all code.
Related
I am working with a messy manually maintained "database" that has a column containing a string with name,value pairs. I am trying to parse the entire column with regexp to pull out the values. The column is huge (>100,000 entries). As a proxy for my actual data, let's use this code:
line1={'''thing1'': ''-583'', ''thing2'': ''245'', ''thing3'': ''246'', ''morestuff'':, '''''};
line2={'''thing1'': ''617'', ''thing2'': ''239'', ''morestuff'':, '''''};
line3={'''thing1'': ''unexpected_string(with)parens5'', ''thing2'': 245, ''thing3'':''246'', ''morestuff'':, '''''};
mycell=vertcat(line1,line2,line3);
This captures the general issues encountered in the database. I want to extract what thing1, thing2, and thing3 are in each line using cellfun to output a scalar cell array. They should normally be 3 digit numbers, but sometimes they have an unexpected form. Sometimes thing3 is completely missing, without the name even showing up in the line. Sometimes there are minor formatting inconsistencies, like single quotes missing around the value, spaces missing, or dashes showing up in front of the three digit value. I have managed to handle all of these, except for the case where thing3 is completely missing.
My general approach has been to use expressions like this:
expr1='(?<=thing1''):\s?''?-?([\w\d().]*?)''?,';
expr2='(?<=thing2''):\s?''?-?([\w\d().]*?)''?,';
expr3='(?<=thing3''):\s?''?-?([\w\d().]*?)''?,';
This looks behind for thingX' and then tries to match : followed by zero or one spaces, followed by 0 or 1 single quote, followed by zero or one dash, followed by any combination of letters, numbers, parentheses, or periods (this is defined as the token), using a lazy match, until zero or one single quote is encountered, followed by a comma. I call regexp as regexp(___,'tokens','once') to return the matching token.
The problem is that when there is no match, regexp returns an empty array. This prevents me from using, say,
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),mycell);
unless I call it with 'UniformOutput',false. The problem with that is twofold. First, I need to then manually find the rows where there was no match. For example, I can do this:
emptyout=cellfun(#(x) isempty(x),out);
emptyID=find(emptyout);
backfill=cell(length(emptyID),1);
[backfill{:}]=deal('Unknown');
out(emptyID)=backfill;
In this example, emptyID has a length of 1 so this code is overkill. But I believe this is the correct way to generalize for when it is longer. This code will change every empty cell array in out with the string Unknown. But this leads to the second problem. I've now got a 'messy' cell array of non-scalar values. I cannot, for example, check unique(out) as a result.
Pardon the long-windedness but I wanted to give a clear example of the problem. Now my actual question is in a few parts:
Is there a way to accomplish what I'm trying to do without using 'UniformOutput',false? For example, is there a way to have regexp pass a custom string if there is no match (e.g. pass 'Unknown' if there is no match)? I can think of one 'cheat', which would be to use the | operator in the expression, and if the first token is not matched, look for something that is ALWAYS found. I would then still need to double back through the output and change every instance of that result to 'Unknown'.
If I take the 'UniformOutput',false approach, how can I recover a scalar cell array at the end to easily manipulate it (e.g. pass it through unique)? I will admit I'm not 100% clear on scalar vs nonscalar cell arrays.
If there is some overall different approach that I'm not thinking of, I'm also open to it.
Tangential to the main question, I also tried using a single expression to run regexp using 3 tokens to pull out the values of thing1, thing2, and thing3 in one pass. This seems to require 'UniformOutput',false even when there are no empty results from regexp. I'm not sure how to get a scalar cell array using this approach (e.g. an Nx1 cell array where each cell is a 3x1 cell).
At the end of the day, I want to build a table using these results:
mytable=table(out1,out2,out3);
Edit: Using celldisp sheds some light on the problem:
celldisp(out)
out{1}{1} =
246
out{2} =
Unknown
out{3}{1} =
246
I assume that I need to change the structure of out so that the contents of out{1}{1} and out{3}{1} are instead just out{1} and out{3}. But I'm not sure how to accomplish this if I need 'UniformOutput',false.
Note: I've not used MATLAB and this doesn't answer the "efficient" aspect, but...
How about forcing there to always be a match?
Just thinking about you really wanting a match to skip this problem, how about an empty match?
Looking on the MATLAB help page here I can see a 'emptymatch' option, perhaps this is something to try.
E.g.
the_thing_i_want_to_find|
Match "the_thing_i_want_to_find" or an empty match, note the | character.
In capture group it might look like this:
(the_thing_i_want_to_find|)
As a workaround, I have found that using regexprep can be used to find entries where thing3 is missing. For example:
replace='$1 ''thing3'': ''Unknown'', ''morestuff''';
missingexpr='(?<=thing2'':\s?)(''?-?[\w\d().]*?''?,) ''morestuff''';
regexprep(mycell{2},missingexpr,replace)
ans =
''thing1': '617', 'thing2': '239', 'thing3': 'Unknown', 'morestuff':, '''
Applying it to the entire array:
fixedcell=cellfun(#(x) regexprep(x,missingexpr,replace),mycell);
out=cellfun(#(x) regexp(x,expr3,'tokens','once'),fixedcell,'UniformOutput',false);
This feels a little roundabout, but it works.
cellfun can be replaced with a plain old for loop. Your code will either be equally fast, or maybe even faster. cellfun is implemented with a loop anyway, there is no advantage of using it other than fewer lines of code. In your explicit loop, you can then check the output of regexp, and build your output array any way you like.
I have been at a loss understanding the use of parentheses in django-viewflow flow code.
For example in the code below
start = (
flow.Start(views.StartView)
.Permission('shipment.can_start_request')
.Next(this.split_clerk_warehouse)
)
# clerk
split_clerk_warehouse = (
flow.Split()
.Next(this.shipment_type)
.Next(this.package_goods)
)
from here
It seems as though, a tuple containing functions is assigned to start and to split_clerk_warehouse e.t.c. What does it mean. From my best guess it would seem that the .Next functions accept a tuple as input.
NOTE I do understand the method chaining used here. I am just at a loss to understand the use of braces.
Thanks.
If I understand correctly, you wonder what the use is of the outer brackets.
Let us first write the (first, but applicable to the second) statement without outer brackets:
start = flow.Start(views.StartView).Permission('shipment.can_start_request').Next(this.split_clerk_warehouse)
This is exactly equivalent to code in your sample. But you probably agree that this is quite unreadable. It requires a user to scroll over the code, and furthermore it is a long chain of characters, without any structure. A programmer would have a hard time understanding it, especially if - later - we would also use brackets inside the parameters of calls.
So perhaps it would make sense to write it like:
start = flow.Start(views.StartView).
Permission('shipment.can_start_request').
Next(this.split_clerk_warehouse)
But this will not work: Python is a language that uses spacing as a way to attach semantics on code. As a result it will break: Python will try to parse the separate linkes as separate statements. But then what to do with the tailing dot? As a result the parser would error.
Now Python has some ways to write statements in a multi-line fashion. For example with backslashes:
start = flow.Start(views.StartView). \
Permission('shipment.can_start_request'). \
Next(this.split_clerk_warehouse)
with the backslash we specify that the next line actually belongs to the current one, and thus it is parsed like we wrote this all on a single line.
The disadvantage is that we easily can forget a backslash here, and this would again let the parser error. Furthermore this requires linear work: for every line we have to add one element.
But programming languages actually typically have a feature that programmers constantly use to group (sub)expressions together: brackets. We use it to give precedence (for example 3 * (2 + 5)), but we can use it to simply group one expression over multiple lines as well, like:
start = (
flow.Start(views.StartView)
.Permission('shipment.can_start_request')
.Next(this.split_clerk_warehouse)
)
Everything that is within the brackets belongs to the same expression, so Python will ignore the new lines.
Note that tuple literals also use brackets. For example:
() # empty tuple
(1, ) # singleton tuple (one element)
(1, 'a', 2, 5) # 4-tuple
But here we need to write a comma at the end for a singleton tuple, or multiple elements separated by comma's , (except for the empty tuple).
Here is my function that parses an addition equation.
expr_print({num,X}) -> X;
expr_print({plus,X,Y})->
lists:append("(",expr_print(X),"+",expr_print(Y),")").
Once executed in terminal it should look like this (but it doesn't at the moment):
>math_erlang: expr_print({plus,{num,5},{num,7}}).
>(5+7)
Actually one could do that, but it would not work the way wish in X in {num, X} is a number and not string representation of number.
Strings in Erlang are just lists of numbers. And if those numbers are in wright range they can be printed as string. You should be able to find detail explenation here. So first thing you wold like to do is to make sure that call to expr_print({num, 3}). will return "3" and not 3. You should be able to find solution here.
Second thing is lists:append which takes only one argument, list of list. So your code could look like this
expra_print({num,X}) ->
lists:flatten(io_lib:format("~p", [X]));
expr_print({plus,X,Y})->
lists:append(["(", expr_print(X),"+",expr_print(Y), ")"]).
And this should produce you nice flat string/list.
Another thing is that you might not need flat list. If you planning to writing this to file, or sending over TCP you might want to use iolist, which are much easier to create (you could drop append and flatten calls) and faster.
I am having trouble with Erlang converting listed numbers to characters whenever none of the listed items could not also be representing a character.
I am writing a function to separate all the numbers in a positive integer and put them in a list, for example: digitize(123) should return [1,2,3] and so on.
The following code works fine, except when the list only consist of 8's and/or 9's:
digitize(_N) when _N =:= 0 -> [];
digitize(_N) when _N > 0 -> _H = [_N rem 10], _T = digitize(_N div 10), _T ++ _H.
For example: Instead of digitize(8) returning [8], it gives me the nongraphic character "\b" and digitize(89) returns "\b\t". This is only for numbers 8 and 9 and when they're put alone inside the list. digitize(891) will correctly return [8,9,1] for example.
I am aware of the reason for this but how can I solve it without altering my result? (ex: to contain empty lists inside the result like [[],[8]] for digitize(8)).
If you look at comments you will see that it is more problem of the way shell prints your data, than the data itself. You logic is wright and I would not change it. You could introduce some use of io:format/2 into you code, but I guess that it would make it harder to use this function in other parts of code.
Other way around it is changing the shell settings itself. There is shell:strings/1 functions that disables printing lists as stings, and it should do exactly what you want. Just remember to change it back when you finish with your shell stuff, since it could introduce some confusion when you will start using some "string" returning functions.
As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.