R agrep: how to match with more than 1 substitution - regex

I'm trying to match a string to a vector of strings:
a <- c('abcde', 'abcdf', 'abcdg')
agrep('abcdh', a, max.distance=list(substitutions=1))
# [1] 1 2 3
agrep('abchh', a, max.distance=list(substitutions=2))
# character(0)
I didn't expect the latter result as substituting two characters from
the pattern makes the pattern identical to the vector elements. This does, however, work with all instead of substitutions:
agrep('abchh', a, max.distance=list(all=2))
# [1] 1 2 3
What do I need to change to match with more than 1 substitution allowed? Is substitution just a broken option? Thanks.
Note: this question is essentially the same as this one: https://stat.ethz.ch/pipermail/r-help/2011-June/281731.html, but that was never answered.

I did not realize that the questions were that old, anyway:
The function needs cost to be appropiate. As ping said, you must set the maximum number of match cost; in your example:
a <- c('abcde', 'abcdf', 'abcdg')
agrep('abcdh', a, max.distance = list(cost = 1))
[1] 1 2 3
agrep('abchh', a, max.distance = 2)
[1] 1 2 3
Now, if you set cost the program can do insertions, deletions and substitutions. If you want only evaluate substitutions, then:
agrep('abhhh', a,
max.distance=list(cost=3, substitutions=3,
deletions=0, insertions=0))
[1] 1 2 3

Related

Finding the first occurence of 1-digit number in a list in Raku

I've got a number of lists of various lengths. Each of the lists starts with some numbers which are multiple digits but ends up with a recurring 1-digit number. For instance:
my #d = <751932 512775 64440 59994 9992 3799 423 2 2 2 2>;
my #e = <3750 3177 4536 4545 686 3 3 3>;
I'd like to find the position of the first occurence of the 1-digit number (for #d 7 and for #e 5) without constructing any loop. Ideally a lambda (or any other practical thing) should iterate over the list using a condition such as $_.chars == 1 and as soon as the condition is fulfilled it should stop and return the position. Instead of returing the position, it might as well return the list up until the 1-digit number; changes and improvisations are welcome. How to do it?
You want the :k modifier on first:
say #d.first( *.chars == 1, :k ) # 7
say #e.first( *.chars == 1, :k ) # 5
See first for more information.
To answer your second part of the question:
say #d[^$_] with #d.first( *.chars == 1, :k );
# (751932 512775 64440 59994 9992 3799 423)
say #e[^$_] with #e.first( *.chars == 1, :k );
# (3750 3177 4536 4545 686)
Make sure that you use the with to ensure you only show the slice if first actually found an entry.
See with for more information.

Grabbing columns with special characters and upper case letters

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.
I have tried a few things but nothing where I'm apple to catch the column names within the loop.
data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5),
thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
fiv=c(1,3,5,1,3,5,1,3,5,1,3,5),
six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
ten=c(1:12),
ele=rep(1,12),
twe=c(1,2,1,2,1,2,1,2,1,2,1,2),
thir=c("THiS","THAT34","T(&*(", "!!!","#$#","$Q%J","who","THIS","this","this","this","this"),
stringsAsFactors = FALSE)
data
colls <- c()
spec=c("$","%","&")
for( col in names(data) ) {
if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
for( col in names(data) ) {
if( any(data[,col]) %in% spec ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
Can anyone shed light on a good way to tackle this problem.
EDIT:
The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do
I would use grep() to search for the pattern you are interested in. See here.
[:upper:] Matches any upper case letters.
Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.
The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)
[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.
Note that rather than use [:punct:] you could define your special characters manually.
We can try the resultant code on the first row of your data set:
#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match.
We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.
grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.
Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.
To do this we must convert our script into a function which we will call with apply()
myFunction <- function(x){
matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
#Given a set of logical vectors 'matches', is at least one of the values true? using any()
return(any(matches))
}
apply(X = data, 1, myFunction)
The 1 above instructs apply() to reiterate across rows rather than columns.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.
If you are just interested in which values in column thirteen fit the stated criteria you can use:
matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
To subset your dataframe on matching rows:
data[matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
3 5 5 D D 5 D D D 5.33 3 1 1 T(&*(
4 1 1 E A 1 E A A 1.44 4 1 2 !!!
5 3 3 F B 3 F B B 3.11 5 1 1 #$#
6 5 5 G D 5 G D D 5.33 6 1 2 $Q%J
8 3 3 I B 3 I B B 3.66 8 1 2 THIS
To subset your dataframe on non-matching rows:
data[!matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
1 1 1 A A 1 A A A 1.24 1 1 1 THiS
2 3 3 B B 3 B B B 3.52 2 1 2 THAT34
7 1 1 H A 1 H A A 1.55 7 1 1 who
9 5 5 J D 5 J D D 5.33 9 1 1 this
10 1 1 H A 1 H A A 1.32 10 1 2 this
11 3 3 I B 3 I B B 3.54 11 1 1 this
12 5 5 J D 5 J D D 5.77 12 1 2 this
Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.
EDIT:
To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:
colnames(data)[apply(X = data, 2, myFunction)]
"thr" "fou" "six" "sev" "eig" "thir"
The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.
I would collapse the data into strings (one string per row)
strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*(" "11EA1EAA1.44 412!!!" "33FB3FBB3.11 511#$#"
# [5] "55GD5GDD5.33 612$Q%J" "33IB3IBB3.66 812THIS"
# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))
strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511#$#" "55GD5GDD5.33 612$Q%J"
You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

pattern matching in R using grepl

I have a dataframe dat like this
P pedigree cas
1 M rs2745406 T
2 M rs6939431 A
3 M SNP_DPB1_33156641 G
4 M SNP_DPB1_33156664_G P
5 M SNP_DPB1_33156664_A A
6 M SNP_DPB1_33156664_T A
I want to exclude all rows where the pedigree column starts with SNP_ and ends with either G, C, T, or A (_[GCTA]). In this case, this would be rows 4,5,6.
How can I achieve this in R? I have tried
multisnp <- which(grepl("^SNP_*_[GCTA]$", dat$pedigree)=="TRUE")
new_dat <- dat[-multisnp,]
My multisnp vector is empty, but I can't figure out how to fix it so that it matches the pattern I want. I think it is my wildcard * usage that is wrong.
You can use the following with .*? (match everything in non greedy way):
multisnp <- which(grepl("^SNP_.*?_[GCTA]$", dat$pedigree))
^^^
You can subset dat like this
new_dat <- dat[!grepl("^SNP_.*_[GCTA]$", dat$pedigree), ]
Regarding the code that you've tried, I'm not sure that grepl("^SNP_*_[GCTA]$") will complete without an error since you aren't passing in an x vector to grepl. See ?grepl for more info.

How do I count the number of words in a text (string)?

I have this string vector (for example):
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
To count the number of words in this vector I used this (as given here Count the number of words in a string in R?, which is a possible duplicate but with another issue)
No_words <- sapply(gregexpr("\\W+", str), length) + 1
but it returns
6 2 2 2
String has only 1 element in last two places (i.e. "tusla" and "laq")
so it should return
6 2 1 1
How do I get around this problem?
You can try
sapply(gregexpr("\\S+", x), length)
## [1] 6 2 1 1
Or as suggested in comments you can try
sapply(strsplit(x, "\\s+"), length)
## [1] 6 2 1 1
Use the stringi package and stri_count:
require(stringi)
str <- c(
"this is a string current trey",
"nospaces",
"multiple spaces",
" leadingspaces",
"trailingspaces ",
" leading and trailing ",
"just one space each")
> stri_count(str,regex="\\S+")
[1] 6 1 2 1 1 3 4
Use the wc-function from the qdap package.
str <- c("this is a string current trey",
"feather rtttt",
"tusla",
"laq")
library("qdap")
wc(str)
That returns:
wc(str)
[1] 6 2 1 1

Generating a vector of the number of items in each list item

I have a list containing 98 items. But each item contains 0, 1, 2, 3, 4 or 5 character strings.
I know how to get the length of the list and in fact someone has asked the question before and got voted down for presumably asking such an easy question.
But I want a vector that is 98 elements long with each element being an integer from 0 to 5 telling me how many character strings there are in each list item.
I was expecting the following to work but it did not.
lapply(name.of.list,length())
From my question you will see that I do not really know the nomeclature of lists and items. Feel free to straighten me out.
Farrel, I do not exactly follow as 'item' is not an R type. Maybe you have a list of length 98 where each element is a vector of character string?
In that case, consider this:
R> fl <- list(A=c("un", "deux"), B=c("one"), C=c("eins", "zwei", "drei"))
R> lapply(fl, function(x) length(x))
$A
[1] 2
$B
[1] 1
$C
[1] 3
R> do.call(rbind, lapply(fl, function(x) length(x)))
[,1]
A 2
B 1
C 3
R>
So there is you vector of the length of your list, telling you how many strings each list element has. Note the last do.call(rbind, someList) as we got a list back from lapply.
If, on the other hand, you want to count the length of all the strings at each list position, replace the simple length(x) with a new function counting the characters:
R> lapply(fl, function(x) { sapply(x, function(y) nchar(y)) } )
$A
un deux
2 4
$B
one
3
$C
eins zwei drei
4 4 4
R>
If that is not want you want, maybe you could mock up some example input data?
Edit:: In response to your comments, what you wanted is probably:
R> do.call(rbind, lapply(fl, length))
[,1]
A 2
B 1
C 3
R>
Note that I pass in length, the name of a function, and not length(), the (displayed) body of a function. Because that is easy to mix up, I simply apply almost always wrap an anonymous function around as in my first answer.
And yes, this can also be done with just sapply or even some of the **ply functions:
R> sapply(fl, length)
A B C
2 1 3
R> lapply(fl, length)
[1] 2 1 3
R>
All this seems very complicated - there is a function specifically doing what you were asking for:
lengths #note the plural "s"
Using Dirks sample data:
fl <- list(A=c("un", "deux"), B=c("one"), C=c("eins", "zwei", "drei"))
lengths(fl)
will return a named integer vector:
A B C
2 1 3
The code below accepts a list and returns a vector of lengths:
x = c("vectors", "matrices", "arrays", "factors", "dataframes", "formulas",
"shingles", "datesandtimes", "connections", "lists")
xl = list(x)
fnx = function(xl){length(unlist(strsplit(x, "")))}
lv = sapply(x, fnx)