How to use environments for lookups - regex

My question builds upon the topic of matching a string against multiple patterns. One solution discussed here is to use sapply(keywords, grepl, strings, ignore.case=TRUE) which yields a two-dimensional matrix.
However, I run into significant speed issues, when applying this approach to 5K+ keywords and 60K+ strings..(I cancelled the process after 12hrs).
One idea is to use hash tables, or environments in R. However, I don't get how "translate/convert" my strings into an environment while keeping the numerical index?
I have strings[1]... till strings[60000]
e <- new.env(hash=TRUE)
for (i in 1:length(strings)) {
assign(x=i, value=strings, envir=e)
}
As x in assign must be a character, I can't use it like this, but I hope you get my idea..I want to be able to index the environment with the same numbers like in my string[...] vector
Thanks for your help!

R environments are not used as much as perl hashes are, I think
just because there are not widely understood 'idioms' for doing
so. In your case the key question is, do you really want the
numerical index? If so it should be the value. The key is your
string, that's the whole point of the exercise.
e <- new.env(hash=T)
strings <- as.character(chickwts$feed) # note! not unique
sapply(1:length(strings), function(i)assign(strings[i], i, e))
e$horsebean # returns 10
In this example only the last index associated with each string
is kept, but you can assign anything that might be useful to each
key, such as a vector of indices.
You can then lookup your data in a number of ways. You can regex search
for keys using ls, for example, and retrieve the values using mget():
# find all keys containing 'beans'
ls(e, patt='bean')
# retrieve bean data
mget(ls(e, pat='bean'),e)

Related

Using Regex in R with huge amount of strings

I have very big data and the next step is to delete certain strings (i.e. the associated rows) based on patterns. I need to use Regex for that. For example image column A as:
A-929.XZT-93002-B-DKE
A-938-XZT-29849-B-DKE
A-938-AXZ-93923-B-DKE
...
...
There are many more columns besides A. Now I want to delete all rows completely which contain the phrade "XZT" with any element before except a character. In this case it would be row1 and row2.
My question is as follows:
Can this be done in R as effectively as for example in VBA? Which package would you recommend to do so, or can it be done just as effectively with the base functions?
I am asking because there are different ways to apply Regex in R and I have to do it for about ~ 20,000++ rows numerous times, so I want to do it as fast as possible.
Thanks

Clojure dictionary of words

I want a dictionary of English words available, to pick random english words. I have a dictionary text file that I downloaded form the internet which has almost 1 million words, what's the best way to go about using this list in Clojure, given that most of the time I'll only need 1 randomly selected word?
Edit:
To answer the comments, this is for some tests which I may turn into load tests which is why I want a decent number of random words and I guess access speed is the most important thing. I do not want to use a database for this. I originally thought of a dictionary just because that's the first thing that popped into my mind but I think a random sequence of letters and numbers would be good enough, perhaps I will just use a UUID as a string.
Read all the words into a Vector and then call rand-nth , e.g.
(rand-nth all-words)
rand-nth uses the nth function on the underlying data structure and Clojure Vectors have log32N performance for index based retrieval.
Edit: This is assuming that it is for a test environment as you described in your question. A more memory efficient method would be to use RandomAccessFile and seek to a random location in the file of words, read until you find the first word delimiter (e.g. comma, EOL) and then read the following bytes until the next delimiter which will give you a random word.

R: searching within split character strings with apply

Within a large data frame, I have a column containing character strings e.g. "1&27&32" representing a combination of codes. I'd like to split each element in the column, search for a particular code (e.g. "1"), and return the row number if that element does in fact contain the code of interest. I was thinking something along the lines of:
apply(df["MEDS"],2,function(x){x.split<-strsplit(x,"&")if(grep(1,x.split)){return(row(x))}})
But I can't figure out where to go from there since that gives me the error:
Error in apply(df["MEDS"], 2, function(x) { :
dim(X) must have a positive length
Any corrections or suggestions would be greatly appreciated, thanks!
I see a couple of problems here (in addition to the missing semicolon in the function).
df["MEDS"] is more correctly written df[,"MEDS"]. It is a single column. apply() is meant to operate on each column/row of a matrix as if they were vectors. If you want to operate on a single column, you don't need apply()
strsplit() returns a list of vectors. Since you are applying it to a row at a time, the list will have one element (which is a character vector). So you should extract that vector by indexing the list element strsplit(x,"&")[[1]].
You are returning row(x) is if the input to your function is a matrix or knows what row it came from. It does not. apply() will pull each row and pass it to your function as a vector, so row(x) will fail.
There might be other issues as well. I didn't get it fully running.
As I mentioned, you don't need apply() at all. You really only need to look at the 1 column. You don't even need to split it.
OneRows <- which(grepl('(^|&)1(&|$)', df$MEDS))
as Matthew suggested. Or if your intention is to subset the dataframe,
newdf <- df[grepl((^|&)1(&|$)', df$MEDS),]

SQL Hash table for words

I'm trying to solve the "find all possible words for a set of letters" problems. There are some good answers out there, but I still can't figure it out.
In my first test, I put the whole dictionary in an array and then looped through each letter. This is super fast, but it takes forever to load the dictionary in the array, and requires huge amount of memory.
So I need to store the dictionary (750,000) letter is a sql database.
I guess there are two solutions to find all the possible words:
Make an advance query that returns all the possible words
Make a simple query that return a fraction of the database with words that might be possible, and then quickly loop through that array and valide the words.
The problem?:
It must be super fast. An iPhone 4 need to be able to get all possible words in under 5-6 seconds so it doesn't hinder the game.
Here's a similar questions:
IOS: Sqlite. Find record fast
Sulthans answer seems like a good idea. Create a hash table, and then:
Bitmask for ASCII letters (ignoring any non-ASCII alphabets). Bit at
position 0 means the word contains "a", at position 1 contains "b"
etc. If we create the same bitmask for our letters, we can select
words such as (wordMask & ~lettersMask) == 0
How do you make the bitmask, hash table, and how do you construct the sql query?
Thanks
sql is probably not the best option. The traditional data structure for storing a collection of words is called a Trie. I'm sure there implementations out there you can find. Someone else will have an answer to that.
The algorithm I envision is to permute the letters you are given, and check each permutation to see if it is in the Trie.

regex matching multiple values when they might not exist

I am trying to right a preg_match_all to match horse race distance.
My source lists races as:
xmxfxy
I want to match the m value, the f value, the y value. However different races will maybe only have m, or f, or y, or two of them or even all three.
// e.g. $raw = 5f213y;
preg_match_all('/(\d{1,})m|(\d{1,})f|(\d{1,})y/', $raw, $distance);
The above sort of works, but for some reason the matches appear in unpredictable positions in the returned array. I guess it is because it is running the match 3 times for each OR. How do I match all three (that may or may not exist) in a single run.
EDIT
A full sample string is:
Hardings Catering Services Handicap (Div I) Cl6 5f213y
If I understand you correctly, you're processing listings (like the one in your question) one at a time. If that's the case, you should be using preg_match, not preg_match_all, and the regex should match the whole "distance" code, not individual components of it. Try this:
preg_match('#\b(?:(?<M>\d+)m|(?<F>\d+)f|(?<Y>\d+)y){1,3}\b#',
$raw, $distance);
The results are now stored in a one-dimensional array, but you don't need to worry about the group numbers anyway; you can access them by name instead (e.g., $distance['M'], $distance['F'], $distance['Y']).
Note that, while this regex matches codes with one, two, or three components, it doesn't require the letters to be unique. There's nothing to stop it from matching something like 1m2m3m (a weakness shared by your own approach, by the way).
you can use "?" as a conditional
preg_match_all('/((\d{1,})m)?|((\d{1,})f)?|((\d{1,})y)?/', $raw, $distance);
If I understand what you're asking correctly, you would like to get each number from these values separately? This works for me:
$input = "Hardings Catering Services Handicap (Div I) Cl6 5f213y";
preg_match_all('/((\d+)(m|f|y))/', $input, $matches);
After the preg_match_all() executes, $matches[2] holds an array of the numbers that matched (in this case, $matches[2][0] is 5 and $matches[2][1] is 213.
If all three values exist, m will be in $matches[2][0], f in $matches[2][1], and y in $matches[2][2]. If any values are missing, the next value gets bumped up a spot. It may also come in handy that $matches[3] will hold an array of the corresponding letter matched on, so if you need to check whether it was an m, f, or y, you can.
If this isn't what you're after, please provide an example of the output you would like to see for this or another sample input.