Fuzzy, but not too fuzzy string matching with agrep - regex

I have a string like this:
text <- c("Car", "Ca-R", "My Car", "I drive cars", "Chars", "CanCan")
I would like to match a pattern so it is only matched once and with max. one substitution/insertion. the result should look like this:
> "Car"
I tried the following to match my pattern only once with max. substitution/insertion etc and get the following:
> agrep("ca?", text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
[1] "Car" "Ca-R" "My Car" "I drive cars" "CanCan"
Is there a way to exclude the strings which are n-characters longer than my pattern?

An alternative which replaces agrep with adist:
text[which(adist("ca?", text, ignore.case=TRUE) <= 1)]
adist gives the number of insertions/deletions/substitutions required to convert one string to another, so keeping only elements with an adist of equal to or less than one should give you what you want, I think.
This answer is probably less appropriate if you really want to exclude things "n-characters longer" than the pattern (with n being variable), rather than just match whole words (where n is always 1 in your example).

You can use nchar to limit the strings based on their length:
pattern <- "ca?"
matches <- agrep(pattern, text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
n <- 4
matches[nchar(matches) < n+nchar(pattern)]
# [1] "Car" "Ca-R" "My Car" "CanCan"

Related

Lookbehind in Access VBA

I am trying to search for all occurrences of doubles in a long chunk of text. the text represents the description of multiple defects in a system. The doubles I am looking for are depths that are normally in the text multiple times as "n.nnnW X n.nnnL X n.nnnD". The n.nnn is normally 0.017D (example) but I want to account for 5.567D if that ever comes up.
The problem is that there are also occurrences of the terms within 1.5d of and also a .015dia. Case for the letters in these are also varied some are all caps and some are all lowercase. The text sometimes also has a space between the number and the "d" and sometimes spells out the word "deep" or "depth" like this: 0.017 deep.
I need the values to be extracted as doubles eventually so I can do math on them.
I have the following regexp pattern: [.](?:\d*\.)?\d+(\s?)[dD](?!ia|IA)
This pattern seems to find all the things I need and even eliminates the diameters that are spelled out as n.nnndia or n.nnnDIA. The thing the pattern DOES NOT catch is the within 1.5d of text.
After some light research I noted that in VBA the lookbehind code is NOT supported; and even so, I was never able to get the lookbehind pattern to work anyhow (using Regex101).
Here is my Access VBA code to illustrate how I am doing it. The r EXP(0) value is my pattern above.
Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM [NCR_RawImport-TEXT] WHERE " & skinTermSQL, dbOpenSnapshot)
'rEXP(0) = "[.](?:\d*\.)?\d+(\s?)[dD](?!ia|IA)"
tVAL = (CDbl(Me.cmbGDepth.Value) / 1000)
Do While Not rs2.EOF
found = False
Set regEXP1 = CreateObject("VBScript.RegExp")
regEXP1.IgnoreCase = True
regEXP1.Global = True
For i = 0 To rEXPIndx - 1
regEXP1.Pattern = rEXP(i)
Set Matches = regEXP1.Execute(rs2.NARR_TXT)
For Each Match In Matches
aVAL = CDbl(Trim(Replace(Replace(UCase(Match.Value), " D", ""), "D", ""))) 'convert matched value to a double.
If (aVAL >= tVAL) Then
found = True
End If
Next
Set Matches = Nothing
Next i
Set regEXP1 = Nothing
If (found) Then
strSql = "UPDATE [NCR_FinalData] SET [NCR_FinalData].SRCH = [NCR_FinalData].SRCH & 'G' WHERE [NCR_FinalData].SRCH Not Like '*G*' AND [NCR_FinalData].NC_KEY = '" & rs2.NC_KEY & "';"
Call writeLog("cmdUpdateNCRs: " & strSql)
DoCmd.RunSQL strSql
End If
rs2.MoveNext
Loop
Set rs2 = Nothing

R - Gsub return first match

I want to extract the 12 and the 0 from the test vector. Every time I try it would either give me 120 or 12:0
TestVector <- c("12:0")
gsub("\\b[:numeric:]*",replacement = "\\1", x = TestVector, fixed = F)
What can I use to extract the 12 and the 0. Can we just have one where I just extract the 12 so I can change it to extract the 0. Can we do this exclusively with gsub?
One option, which doesn't involve using explicit regular expressions, would be to use strsplit() and split the timestamp on the colon:
TestVector <- c("12:0")
parts <- unlist(strsplit(TestVector, ":")))
> parts[1]
[1] "12"
> parts[2]
[1] "0"
Try this
gsub("\\b(\\d+):(\\d+)\\b",replacement = "\\1 \\2", x = TestVector, fixed = F)
Regex Breakdown
\\b #Word boundary
(\\d+) #Find all digits before :
: #Match literally colon
(\\d+) #Find all digits after :
\\b #Word boundary
I think there is no named class as [:numeric:] in R till I know, but it has named class [[:digit:]]. You can use it as
gsub("\\b([[:digit:]]+):([[:digit:]]+)\\b",replacement = "\\1 \\2", x = TestVector)
As suggested by rawr, a much simpler and intuitive way to do it would be to just simply replace : with space
gsub(":",replacement = " ", x = TestVector, fixed = F)
This can be done using scan from base R
scan(text=TestVector, sep=":", what=numeric(), quiet=TRUE)
#[1] 12 0
or with str_extract
library(stringr)
str_extract_all(TestVector, "[^:]+")[[1]]

R: Can grep() include more than one pattern?

For instance, in this example, I would like to remove the elements in text that contain http and america.
> text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
Hence, I would use the logical operator, |.
> pattern <- "http|america"
Which works because this is considered to be one pattern.
> grep(pattern, text, invert = TRUE, value = TRUE)
[1] "One word#" "three two one"
What if I have a long list of words that I would like to use in the pattern? How can I do it? I don't think I can keep on using the logical operators a lot of times.
Thank you in advance!
Generally, as #akrun said:
text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
pattern = c("http", "america")
grep(paste(pattern, collapse = "|"), text, invert = TRUE, value = TRUE)
# [1] "One word#" "three two one"
You wrote that your list of words is "long." This solution doesn't scale indefinitely, unsurprisingly:
long_pattern = paste(rep(pattern, 1300), collapse = "|")
nchar(long_pattern)
# [1] 16899
grep(long_pattern, text, invert = TRUE, value = TRUE)
# Error in grep(long_pattern, text, invert = TRUE, value = TRUE) :
But if necessary, you could MapReduce, starting with something along the lines of:
text[Reduce(`&`, Map(function(p) !grepl(p, text), long_pattern))]
# [1] "One word#" "three two one"

Matching specific lengths with regexp in Matlab

String matching question in Matlab.
if i have a matrix
a = ['thehe'];
str = {'the','he'};
match = regexp(a,str);
the output is match =
[1] [1x2 double]
because it found 'he' twice and 'the' once
how can i make it so it looks from left to right of my string a and
only matches 'the' once and 'he' once?
To answer the explicit question, from the documentation for regexp you can specify the once search option:
a = 'thehe';
str = {'the','he'};
match = regexp(a,str, 'once');
Which returns:
match =
[1] [2]
Where match is a 1x2 cell array whose cell value(s) correspond to the first index of the match in a for each cell of str.
I understand from what the ambiguously described details I'v read, that you want the indexes of non-interleaved occurences of the and he, means 1, and 4.
a = ['thehe'];
str = {'the';'[^t]he'};
match = regexp(a,str)
after this print the two results.
a(match{1}:match{1}+2)
ans =
the
and
a(match{2}+1:match{2}+2)
ans =
he
no third occurence !
a(match{3})
??? Index exceeds matrix dimensions.

Pattern matching and replacement in R

I am not familiar at all with regular expressions, and would like to do pattern matching and replacement in R.
I would like to replace the pattern #1, #2 in the vector: original = c("#1", "#2", "#10", "#11") with each value of the vector vec = c(1,2).
The result I am looking for is the following vector: c("1", "2", "#10", "#11")
I am not sure how to do that. I tried doing:
for(i in 1:2) {
pattern = paste("#", i, sep = "")
original = gsub(pattern, vec[i], original, fixed = TRUE)
}
but I get :
#> original
#[1] "1" "2" "10" "11"
instead of: "1" "2" "#10" "#11"
I would appreciate any help I can get! Thank you!
Specify that you are matching the entire string from start (^) to end ($).
Here, I've matched exactly the conditions you are looking at in this example, but I'm guessing you'll need to extend it:
> gsub("^#([1-2])$", "\\1", original)
[1] "1" "2" "#10" "#11"
So, that's basically, "from the start, look for a hash symbol followed by either the exact number one or two. The one or two should be just one digit (that's why we don't use * or + or something) and also ends the string. Oh, and capture that one or two because we want to 'backreference' it."
Another option using gsubfn:
library(gsubfn)
gsubfn("^#([1-2])$", I, original) ## Function substituting
[1] "1" "2" "#10" "#11"
Or if you want to explicitly use the values of your vector , using vec values:
gsubfn("^#[1-2]$", as.list(setNames(vec,c("#1", "#2"))), original)
Or formula notation equivalent to function notation:
gsubfn("^#([1-2])$", ~ x, original) ## formula substituting
Here's a slightly different take that uses zero width negative lookahead assertion (what a mouthful!). This is the (?!...) which matches # at the start of a string as long as it is not followed by whatever is in .... In this case two (or equivalently, more as long as they are contiguous) digits. It replaces them with nothing.
gsub( "^#(?![0-9]{2})" , "" , original , perl = TRUE )
[1] "1" "2" "#10" "#11"