"The Grid Search" on HackerRank with R language - regex

I've been working on a pretty simple problem on HackerRank for a few days, but I'm stuck with timeout problems and I can't optimize my code any further.
The problem is this:
Given a 2D array of digits (dimensions R * C), try to find the occurrence of a given 2D pattern of digits (dimensions r * c).
Here you are a reproducible example of variables:
pattern <- c("11111", "11111", "11110")
text <- c("111111111111111",
"111111111111111",
"111111011111111",
"111111111111111",
"111111111111111")
R <- 5
C <- 15
r <- 3
c <- 5
It's sort of a regex problem, but in 2D, and this is something that I couldn't find anywhere as a ready-to-use function in R.
There are a few corner cases that I've been able to cope with, trying to avoid the brute force option (the above version is one of those cases, where the usual 'regexp' miserably fails to find the pattern).
Below it's my code: it works perfectly fine for 13 out of 15 cases, but it fails for reason of timeout when it goes against some tests with (e.g.) R*C = 500*500 and r*c = 236*208.
RW <- c()
pattern2 <- paste0(pattern, collapse = "")
RW <- c(rep(NA,(C-c+1)*(R-r+1)))
for (v in 1:(C-c+1))
{
for (y in 1:(R-r+1))
{
RW[(C-c+1)*(y-1)+v] <- paste0(substr(text[y:(y+r-1)],v,c+v-1),collapse="")
}
}
per <- ifelse(pattern %in% RW, result <- "YES",result <- "NO")
cat(result, "\n")
Please note that there are up to 5 cases for each test, and this is the reason why my code fails: while it could work breaking the test in 5 parts, it passes the time threshold when the cases are combined together with big RC and rc dimensions.
Does anybody have an idea on how to improve the code performance?

If you want to keep your approach my first suggestion would be to convert the strings to numeric matrices because substr is probably not very fast.
You can further make use of more sophisticated matching algorithms which shift the position for more than one place like the Knuth-Morris-Pratt Algorithm.
However, for loops will always be quite slow in R, so I feel that the best approach in this situation would indeed be a regular expression. If you concatenate the rows of the big grid into one long string the number of characters between the rows of the pattern is fixed. This means you can do something like this (which, I believe, solves the test case you gave):
grepl(
paste0(pattern[1], ".{", C - c, ",}",
pattern[2], ".{", C - c, "}",
pattern[3]),
paste0(text, collapse = "")
)
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Related

find number of substrings that become palindromes after you reduce the substring

This is an interview problem.
You are given a string. Example S = "abbba".
If I choose a substring say "bbba" then this selection gets reduced to "ba" (All continously repeating characters are dropped).
You need to find number of odd/even length substring selection that would result to a palindrome after reduction.
I can solve it using 2D dp if it was not the condition of selection reduction which makes problem complicated.
First, reduce your entire string and save the quantity that each character present in the reduced string appears in the original string (can be done in O(n)). Let the reduced string be x1x2...xk and the respective quantitites be q1, q2, ..., qk.
Calculate the 2D dp you mentioned but for the reduced string (takes O(k^2)).
Now, it becomes a combinatorics problem. It can be solved using simple combinatorics principles, like the additive principle and the multiplicative principle. The total of substrings that become palindromes after you reduce it is:
q1 * dp[1][1] + q2 * (dp[1][2] + dp[2][2]) + ... + qk * (dp[1][k] + dp[2][k] + ... + dp[k][k])
It takes O(k^2) to compute this sum.
Now, how many of those have odd length ? How many have even length ?
To find it, you need some more simple observations about odd and even numbers and some careful case by case analysis. I will let you try it. Let me know if you have some problem.

I need to rework this code so it includes everything that could read "inpatient" - including when misspelt?

I either need to find a way of recoding this to include misspelt versions of inpatient or 'flagging' those that haven't been affected by this?
df1$Admission_Type <- as.character(df1$Admission_Type)
df1$Admission_Type[df1$Admission_Type == "Inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "INPATIENT"]<-"ip"
It repeats like this.
To deal with case issues, convert all to lower case
df1 <- data.frame(Admission_Type = c("Inpatient", "inpatient", "INPATIENT", "inp", "impatient"), stringsAsFactors = FALSE)
df1$Admission_Type <- tolower(df1$Admission_Type)
Then you can use regular expressions to deal with misspellings. While impossible to get all, you can use intuition to get close. In my example, I made the (intentional) misspelling of "impatient". You can set up a regular expression to detect this possibly common mistake as such
grep("^i[nm]pat[ie][ei]nt", df1$Admission_Type, ignore.case = TRUE)
where I allowed the second position to be either an 'n' or 'm', or the 'ie' to be switched at positions 6-7. This returns
[1] 1 2 3 5
You can add likely possible misspelled letters to each position. Plenty of tips on how to make this regex more complicated to allow for missing/extra letters if you search.
Note you can use gsub to do the replacement automatically.
df1$Admission_Type[grepl("inpatient", df1$Admission_Type, ignore.case=TRUE)] = "ip" will cover the cases you listed. #JohnSG's answer shows how to include potential misspellings into the regular expression as well. (You'll probably want to create a new column to store your recodings (at least while you're testing out different options) rather than overwriting the original column of data.)
As #alistaire mentioned, you can use agrep for approximate matching. For example:
x = c("inpatient","Inpatient","Impatient","inpateint")
agrep("inpatient", x, max.dist=2, ignore.case=TRUE)
So, in your case, you could do:
df1$Admission_Type[agrep("inpatient", df1$aAdmisstion_Type, max.dist=2, ignore.case=TRUE)] = "ip"
agrep returns the indices of the matching values. max.dist controls how different the actual values can be from the target value and still be considered a match. You'll probably need to test and tweak this to capture mispellings while avoiding incorrect matches.
grepl covers the cases you listed in your questions, but for future reference, if you ever do need to match on a number of separate values, you can reduce the amount of code needed by using the %in% function. In your case, that would be:
df1$Admission_Type[df1$Admission_Type %in% c("Inpatient","inpatient","INPATIENT")]<-"ip"

Extracting sentences using scan() in R

I've been told that I shouldn't use R to scan text (but I have been doing so, anyway, pending the acquisition of other skills) and encountered a problem that has confused me sufficiently to retreat to these fora. Thanks for any help, in advance.
I'm trying to store a large amount of text (e.g., a short story) as a vector of strings, each of which is a separate sentence. I've been doing this using the scan() function, but I am encountering two basic problems: (1) scan() only seems to allow a single separating character, whereas sentences can obviously end in multiple ways. I know how to mark the end of a sentence using regex (e.g. [!?\.], but I don't know of a function in R that uses regular expressions to split text. (2) scan() seems to automatically regard a new line as a new field, whereas I want it to ignore new lines unless they coincide with the end of a sentence.
download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
sep=".",quote=NULL)
If I don't include the 'quote=NULL' option, scan() throws the warning that an EOF (end of field, I'm guessing) falls within a quoted string. This produces a handful of multi-line elements/fields, but pretty erratically. I can't seem to discern a pattern.
Sorry if this has been asked before. I'm sure there's an easy solution. I would prefer one that helps me make sense of why scan() isn't working the way I would expect, but if there are better tools to read text in R, please do let me know.
R has some really strong text mining capability, with many strong packages. For example, tm, rvest, stringi and others.
But here is a simple example of doing this almost completely in base R. I only use the %>% pipe from magrittr because I think this makes the code a bit more readable.
the specific answer to your question is you can use regular expressions to search for multiple punctuation marks. In the example below I use "[\\.?!] ", meaning a period, question mark or exclamation mark, followed by a space. You may have to experiment.
Try this:
library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"
corpus <- url %>%
paste(readLines(url), collapse=" ") %>%
gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)
head(corpus)
z <- corpus %>%
gsub(" +", " ", .) %>%
strsplit(split = "[\\.?!] ")
z[[1]]
The results:
z[[1]]
[1] " THE THREE LITTLE PIGS Once upon a time "
[2] ""
[3] ""
[4] "there were three little pigs, who left their mummy and daddy to see the world"
[5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"
[6] "None were happier than the three little pigs, and they easily made friends with everyone"
[7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"
[8] "Autumn came and it began to rain"
[9] "The three little pigs started to feel they needed a real home"
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"
...etc

Time complexity of regex and Allowing jitter in pattern finding

To find patterns in string, I have the following code. In it, find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
An example for the above mentioned code: for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab".
NOW I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can anyone suggest a modification of the above mentioned code or any new code for pattern finding, that could allow such jitters?
Also can anyone throw some light on the TIME COMPLEXITY and INTERNAL ALGORITHM used for the regexpr function ?
Thanks! :)
Not very efficient but tada:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
found <- FALSE
for(sublen in len:1) {
for(inlen in 0:sublen) {
pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length")[1] > 0){
found = TRUE
break;
}
}
if(found) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")[1] - 1) else ""
}
find.string("a0cc0vaaaabaaadbaaabbaa00bvw"); # returns "aaaab"
Without any fuzzy matching tool available, I manually check each possibility. I use an inner loop to try different size prefix and suffix lengths on either size of the "jitter" character. The prefix is grouped as \2 and the suffix as \4 (the jitter is \3 but I don't use it). Then, the repeated part tries to match \2.\4 - so the prefix, any new jitter character, and the suffix.
I say not efficient because its evaluating O(len^2) different patterns, versus O(len) patterns in your code. For large len this might become a problem.
Note that I have multiple groups, and only look at the [1] position. The full r variable has more useful information, for example [1] will be the first part, [5] will be the 2nd part, [6] will be the 3rd part, etc. Also [3] will be the "jitter" character in the 1st part.
Regarding the complexity of the actual regex: it varies a lot. However, often the construction (setup) of a particular regex is vastly more intensive then the actual matching, which is why a single pattern used repeatedly can produce better results than multiple patterns. In truth, this varies a lot based on the pattern and the engine you're using - see links at the end for more info about complexity.
Regarding how regex works: just a note, this is going to be a very theoretical overview, its not meant to indicate how any particular regex engine works.
For a more practical overview, there are plenty of sites that cover just enough to know how to use a regex, but not how to build your own engine. - for example http://www.regular-expressions.info/engine.html
Regex is what's known as a state machine, specifically a (non-deterministic) finite state automaton (NFA). A very simple, real world state machine is a lightbulb: its either on, or off, and different inputs can change the state its in. A regex is much more complex, (generally) each symbol in the pattern forms a state, and different input can send it to different states. So if you have \d\d\d, 3 virtual states each can accept any digit, and any other input goes to a 4th "failure" state. The end result is the end state after all input is 'consumed'.
Perhaps you can imagine: this gets vastly more complicated, with many many states, when you use any ambiguity, such as wildcards or alternation. So our \d\d\d regex will basically be linear. But more complicated one will not be. Part of the optimization in a regex engine is converting a NFA to a DFA - a deterministic finite state automaton. Here, the ambiguity is removed, generating many more states, and this is the very computationally complex process referenced above (the construction stage).
This is really just a very theoretical overview of an ideal NFA. In practice, modern regex grammars can do a lot more than this, for example backtracing is not technically possible in a "proper" regex.
This might be a bit too high-level, but thats the basic idea. If you're curious, there are plenty of good articles about regex, different flavors, and their complexity. For example: http://swtch.com/~rsc/regexp/regexp1.html
There's basically two regex algorithm types, Perl-Style (with a lot of complex backtracking) and Thompson-NFA.
http://swtch.com/~rsc/regexp/regexp1.html
To determine which engine R uses R's svn repo is here:
*root repo:
http://svn.r-project.org/R/
http://svn.r-project.org/R/branches\R-exp-uncmin\src\regex
I poked around in there a bit and found a file called "engine.c" On first glance it doesn't look like a Thompson-NFA but I didn't take long to read it.
At any rate, the first link goes in depth into the complexity question in general and should give you a great idea as to how regex parsing works under the hood to boot.

Similar String algorithm

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.
Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!
One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).
One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance
You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.
i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.
The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.
Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.