I'm looking for an algorithm to convert a boolean expression tree into an optimal set of bitwise operations where the boolean inputs are indicated by bits in a word.
eg: suppose a set of bit flags like this:
enum
{
Apples = 1 << 0,
Pears = 1 << 1,
Bananas = 1 << 2,
}
and suppose an expression like this:
Apples && (Pears || Bananas)
has been parsed into an expression tree like this:
AND
|
+-- Apples
|
+-- OR
|
+-- Pears
|
+-- Bananas
How would one convert that AST to an optimal set of bitmask/test operations?
A general algorithm would be great, but in my use case most of the expressions are testing each bit as either set or clear and can be reduced to and bitwise test in the form ((input & mask) == test). eg:
Apples && Pears && !Bananas
can be evaluated as
(input & (Apples|Pears|Bananas)) == (Apples|Pears)
The only operators I need to support are boolean AND, OR and NOT.
I'll probably sit down and figure this out myself but thought it was an interesting problem that I've not seen discussed before. I'm wondering if there are existing solutions or if anyone has suggestions for how to approach it.
I figured this out and even wrote an .NET IL generator to generate code from the optimized execution plans. It's too much to explain here, but wrote an article about it.
Also, full code can be found here
Related
I have data as follows - please see the fiddle here for all data and code below:
INSERT INTO t VALUES
('|0|34| first zero'),
('|45|0| second zero'),
('|0|0| both zeroes');
I want to SELECT from the start of the line
1st character in the line is a piple (|)
next characters are a valid (possibly negative - one minus sign) INTEGER
after the valid INT, another pipe
then another valid INT
then a pipe
The rest of the line can be anything at all - including sequences with pipe, INT, pipe, INT - but these are not to be SELECTed!
and I'm using a regex to try and SELECT the valid INTEGERs. A single ZERO is also a valid reading - one ZERO and one ZERO only!
The valid integers must be from between the first 3 pipe (|) characters and not elsewhere in the line - i.e.
^|3|3|adfasfadf |555|6666| -- tuple (3, 3) is valid
but
^|--567|-765| adfasdf -- tuple (--567, -765) is invalid - two minus signs!
and
^|This is stuff.... |34|56| -- tuple (34, 56) is invalid - doesn't start pipe, int, pipe, int!
Now, my regexes (so far) are as follows:
SELECT
SUBSTRING(a, '^\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n1,
SUBSTRING(a, '^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|') AS n2,
a
FROM t;
and the results I'm getting for my 3 records of interest are:
n1 n2 a
0 NULL |0|34| first zero -- don't want NULL, want 34
45 0 |45|0| second zero -- OK!
0 NULL |0|0| both zeroes -- don't want NULL, want 0
3 3 |3|3| some stuff here
...
... other data snipped - but working OK!
...
Now, the reason why it works for the middle one is that I have (0{1}|.... other parts of the regex in both the upper and lower one!
So, that means take 1 and only 1 zero OR... the other parts of the regex. Fine, I've got that much!
However, and this is the crux of my problem, when I try to change:
'^\|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
to
'^\|0{1}|[-+]?[1-9]{1}\d*\|(0{1}|[-+]?[1-9]{1}\d*)\|'
Notice the 0{1}| bit I've added near the beginning of my regex - so, this should allow one and only one ZERO at the beginning of the second string (preceded by a pipe literal (|)) OR the rest... the pipe at the end of my 5 character snippet above in this case being part of the regex.
But the result I get is unchanged for the first 3 records - shown above, but it now messes up many records further down - one example a record like this:
|--567|-765|A test of bad negatives...
which obviously fails (NULL, NULL) in the first SELECT now returns (NULL,-765) for the second. If the first fails, I want the second to fail!
I'm at a loss to understand why adding 0{1}|... should have this effect, and I'm also at a loss to understand why my (0, NULL), (45, 0) and (0, NULL) don't give me (0, 0), (45, 0) and (0, 0) as I would expect?
The 0{1}| snippet appears to work fine in the capturing groups, but not outside - is this the problem? Is there a problem with PostgreSQL's regex implementation?
All I did was add a bit to the regex which said as well as what you've accepted before, please accept one and only one leading ZERO!
I have a feeling there's something about regexes I'm missing - so my question is as follows:
could I please receive an explanation as to what's going on with my regex at the moment?
could I please get a corrected regex that will work for INTEGERs as I've indicated. I know there are alternatives, but I'd like to get to the bottom of the mistake I'm making here and, finally
is there an optimum/best method to achieve what I want using regexes? This one was sort of cobbled together and then added to as further necessary conditions became clearer.
I would want any answer(s) to work with the fiddle I've supplied.
Should you require any further information, please don't hesitate to ask! This is not a simple "please give me a regex for INTs" question - my primary interest is in fixing this one to gain understanding!
Some simplifications could be done to the patterns.
SELECT
SUBSTRING(a, '^\|(0|[+-]?[1-9][0-9]*)\|[+-]?[0-9]+\|') AS n1,
SUBSTRING(a, '^\|[+-]?[0-9]+\|(0|[+-]?[1-9][0-9]*)\|') AS n2,
a
FROM t;
n1 | n2 | a
:--- | :--- | :--------------------------------------------------------------
0 | 34 | |0|34| first zero
45 | 0 | |45|0| second zero
0 | 0 | |0|0| both zeroes
3 | 3 | |3|3| some stuff here
null | null | |SE + 18.5D some other stuff
-567 | -765 | |-567|-765|A test of negatives...
null | null | |--567|-765|A test of bad negatives...
null | null | |000|00|A test of zeroes...
54 | 45 | |54|45| yet more stuff
32 | 23 | |32|23| yet more |78|78| stuff
null | null | |This is more text |11|111|22222||| and stuff |||||||
null | null | |1 1|1 1 1|22222|
null | null | |71253412|ahgsdfhgasfghasf
null | null | |aadfsd|34|Fails if first fails - deliberate - Unix philosophy!
db<>fiddle here
I am working with a string variable response in Stata. This variable stores complete sentences, and many of these sentences have repeated phrases.
For example:
how do you know how do you know what it is?
it was during the during the past thirty days
well well I would hope I would hope that they're doing that
I want to clean these strings by removing all repeated phrases.
In other words, I want to transform this sentence:
how do you know how do you know what it is?
to the one below:
how do you know what it is?
So far, I have tried to fix each case individually, but this is incredibly time-consuming as there are thousands of repeated words/phrases.
I would like to run code that can identify when a phrase is repeated within the same observation / string, and then remove one instance of that phrase (or word).
I imagine regular expressions would help, but I cannot figure out much more than this.
The following works for me:
clear
input str80 string
"Pearly Spencer how do you know how do you know what it is?"
"it was during the during the past thirty days"
"well well I would hope I would hope that they're doing that"
"well well they're doing that I would hope I would hope "
"well well I would hope I would hope that they're doing that but but they don't"
end
clonevar wanted = string
local stop = 0
while `stop' == 0 {
generate dup = ustrregexs(2) if ustrregexm(wanted, "(\W|^)(.+)\s\2")
replace wanted = subinstr(wanted, dup, "", 1)
capture assert dup == ""
if _rc == 0 local stop = 1
else drop dup
}
replace wanted = strtrim(stritrim(wanted))
list wanted
+----------------------------------------------------------+
| wanted |
|----------------------------------------------------------|
1. | Pearly Spencer how do you know what it is? |
2. | it was during the past thirty days |
3. | well I would hope that they're doing that |
4. | well they're doing that I would hope |
5. | well I would hope that they're doing that but they don't |
+----------------------------------------------------------+
The above solution uses a regular expression to first identify repeated words / phrases. Then it eliminates this from the string by substituting a space in its place.
Because this particular regular expression does not find all sets in one pass (for example in the last observation there are three sets - well, I would hope and but), the process is repeated using a while loop until no repeated elements remain in the string.
In the final step, all unnecessary spaces are deleted to bring the string back to shape.
How do I build a H2O word2vec training_frame that distinguishes between different document/sentences etc.?
As far as I can read from the very limited documentation I have found, you simply supply one long list of words? Such as
'This' 'is' 'the' 'first' 'This' 'is' 'number' 'two'
However it would make sense to be able to distinguish – ideally something like this:
Name | ID
This | 1
is | 1
the | 1
first | 1
This | 2
is | 2
number | 2
two | 2
Is that possible?
word2vec is a type of unsupervised learning: it turns string data into numbers. So to do a classification you need to do a two-step process:
word2vec for strings to numbers
any supervised learning technique for numbers to categories
The documentation contains links to a categorization example in each of R and Python. This tutorial shows the same process on a different data set (and there should be a H2O World 2017 video that goes with that).
By the way, in your original example, you don't just supply the words; the sentences are separated by NA. If you give h2o.tokenize() a vector of sentences, it will make this format for you. So your example would actually be:
'This' 'is' 'the' 'first' NA 'This' 'is' 'number' 'two'
Is there any way to code a binary variable dependent on keywords being present in a given string variable? Simple example:
I have a string variable that describes various meals and a dummy variable that denotes if a given meal is breakfast or not. Is there any way to code
breakfast = 1 if meal== [then something saying contains eggs, bacon, etc.]
This is a silly example, but I am more interested in identifying a shortcut to coding binary variables, based on information found in string data.
The inbuilt strpos() will yield a positive value if a string is found inside another. Building on that
gen breakfast = strpos(meal, "bacon") | strpos(meal, "eggs")
and so forth. In practice, working with a string made lower case will often help, or indeed be essential. Also, if you have a long list, you may prefer
gen breakfast = 0
quietly foreach thing in bacon eggs cereal "orange juice" {
replace breakfast = breakfast | strpos(lower(meal), `"`thing'"')
}
The principle here is using | (or) as a logical operator, yielding 1 (true) if any argument is non-zero. Note that lower() is included to compare with a lower case version of the original.
This technique is naturally not robust to spelling mistakes or small variations in wording.
You can use the incss function of the egenmore package for this.
ssc install egenmore
egen bacon = incss(meal), sub(bacon) insensitive
This gives you a dummy equal to one if for a given observation the string variable "meal" contains the word bacon. It is zero otherwise. The option insensitive tells Stata to not consider case sensitivity (otherwise Bacon is different from bacon). As far as I know you can only search for one sub-string at a time but you can easily write a loop for this:
foreach word in bacon eggs cheese {
egen `word' = incss(meal), sub(`word') insensitive
}
I was thinking of a fast method to look for a submatrix m in a bigger mtrix M.
I also need to identify partial matches.
Couple of approaches I could think of are :
Optimize the normal bruteforce to process only incremental rows and columns.
May be extend Rabin-karp algorithm to 2-d but not sure how to handle partial matches with it.
I believe this is quite frequently encountered problem in image processing and would appreciate if someone could pour in their inputs or point me to resources/papers on this topic.
EDIT:
Smaller example:
Bigger matrix:
1 2 3 4 5
4 5 6 7 8
9 7 6 5 2
Smaller Matrix:
7 8
5 2
Result: (row: 1 col: 3)
An example of Smaller matrix which qualifies as a partial match at (1, 3):
7 9
5 2
If More than half of pixels match, then it is taken as partial match.
Thanks.
I recommend doing an internet search on "2d pattern matching algorithms". You'll get plenty of results. I'll just link the first hit on Google, a paper that presents an algorithm for your problem.
You can also take a look at the citations at the end of the paper to get an idea of other existing algorithms.
The abstract:
An algorithm for searching for a two dimensional m x m pattern in a two dimensional n x n text is presented. It performs on the average less comparisons than the size of the text: n^2/m using m^2 extra space. Basically, it uses multiple string matching on only n/m rows of the text. It runs in at most 2n^2 time and is close to the optimal n^2 time for many patterns. It steadily extends to an alphabet-independent algorithm with a similar worst case. Experimental results are included for a practical version.
There are very fast algorithms for this if you are willing to preprocess the matrix and if you have many queries for the same matrix.
Have a look at the papers on Algebraic Databases by the Research group on Multimedia Databases (Prof. Clausen, University of Bonn). Have a look at this paper for example: http://www-mmdb.iai.uni-bonn.de/download/publications/sigir-03.pdf
The basic idea is to generalize inverted list, so they use any kind of algebraic transformation, instead of just shifts in one direction as with ordinary inverted lists.
This means that this approach works whenever the modifications you need to do to the input data can be modelled algebraically. This specifically that queries which are translated in any number of dimensions, rotated, flipped etc can all be retrieved.
The paper is mainly showing this for musical data, since this is their main research interest, but you might be able to find others, which show how to adapt this to image data as well (or you can try to adapt it yourself, if you understand the principle it's quite simple).
Edit:
This idea also works with partial matches, if you define them correctly.
There is no way to do this fast if you only ever need to match one small matrix against one big matrix. But if you need to do many small matrices against big matrices, then preprocess the big matrix.
A simple example, exact match, many 3x3 matrices against one giant matrix.
Make a new "match matrix", same size as "big matrix", For each location in big matrix compute a 3x3 hash for each x,y to x+3,y+3 in big matrix. Now you just scan the match matrix for matching hashes.
You can achieve partial matches with specialized hash functions that give the same hash to things that have the same partial matching properties. Tricky.
If you want to speed up further and have memory for it, create a hash table for the match matrix, and lookup the hashes in the hash table.
The 3x3 solution will work for any test matrix 3x3 or larger. You don't need to have a perfect hash method - you need just something that will reject the majority of bad matches, and then do a full match for potential matches in the hash table.
I think you cannot just guess where the submatrix is with some approach, but you can optimize your searching.
For example, given a matrix A MxN and a submatrix B mxn, you can do like:
SearchSubMatrix (Matrix A, Matrix B)
answer = (-1, -1)
Loop1:
for i = 0 ... (M-m-1)
|
| for j = 0 ... (N-n-1)
| |
| | bool found = true
| |
| | if A[i][j] = B[0][0] then
| | |
| | | Loop2:
| | | for r = 0 ... (m-1)
| | | | for s = 0 ... (n-1)
| | | | | if B[r][s] != A[r+i][s+j] then
| | | | | | found = false
| | | | | | break Loop2
| |
| | if found then
| | | answer = (i, j)
| | | break Loop1
|
return answer
Doing this, you will reduce your search in the reason of the size of the submatrix.
Matrix Submatrix Worst Case:
1 2 3 4 2 4 [1][2][3] 4
4 3 2 1 3 2 [4][3][2] 1
1 3 2 4 [1][3]{2 4}
4 1 3 2 4 1 {3 2}
(M-m+1)(N-n+1) = (4-2+1)(4-2+1) = 9
Although this is O(M*N), it will never look M*N times, unless your submatrix has only 1 dimension.