colsplit to split into individual characters - regex

colsplit in package reshape2 can be used to split character data:
colsplit(c("A_1", "A_2", "A_3"), pattern="_", c("Letter", "Number"))
Letter Number
1 A 1
2 A 2
3 A 3
In his paper "Rehaping data with the Reshape Package", Hadley Wickham gives an example of using colsplit to split data into individual characters. His example should produce the above from the data c("A1", "A2", "A3") which he does by omitting the pattern argument. But this throws an error.
The documentation for str_split_fixed which colsplit calls says that setting pattern="" will split into individual characters, but this does not work.
Is there any way to use colsplit so that it splits into individual character.
This is R 3.1.1 and packages are up to date.

The problem is that you are referring to an article about "reshape" but are using "reshape2". The two are not the same and they don't work the same:
library(reshape)
library(reshape2)
reshape:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 A 1
# 2 A 2
# 3 A 3
reshape2:::colsplit(c("A1", "A2", "A3"), "", c("V1", "V2"))
# V1 V2
# 1 NA A1
# 2 NA A2
# 3 NA A3
If you don't have to go the colsplit way, there are other options:
do.call(rbind, strsplit(c("A1", "A2", "A3"), "", fixed = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"
Or, a more general approach (for example characters followed by numbers, not necessarily one character each):
do.call(rbind, strsplit(c("A1", "A2", "A3"),
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "2"
# [3,] "A" "3"

Using qdap:
library(qdap)
colSplit(c("A1", "A2", "A3"), "")
## X1 X2
## 1 A 1
## 2 A 2
## 3 A 3

Related

Capturing parts of string using regular expression in R

I have these strings:
myseq <- c("ALM_GSK_LN_06.ID","AS04_LV_06.ID.png","AS04_SP_06.IP.png")
What I want to do is to capture parts of the sequence
ALM_GSK LN ID
AS04 LV ID
AS04 SP IP
I tried this but failed:
library(stringr)
str_match(myseq, "([A-Z]+)_(LN|LV|SP)_06\\.([A-Z]+)")
Which produces:
[,1] [,2] [,3] [,4]
[1,] "GSK_LN_06.ID" "GSK" "LN" "ID"
[2,] NA NA NA NA
[3,] NA NA NA NA
>
What's the right way to do it?
You are pretty close. Here is a small adjustment:
str_match(myseq, "(.+)_(LN|LV|SP)_06\\.([A-Z]+)")[, -1]
produces:
[,1] [,2] [,3]
[1,] "ALM_GSK" "LN" "ID"
[2,] "AS04" "LV" "ID"
[3,] "AS04" "SP" "IP"
Yours doesn't work because your first token matches neither numbers or underscores, which you need for "AS04" (numbers) and "ALM_GSK" (underscores).
Your regular expression incorrectly matches the prefix because [A-Z]+ only matches letters. To fix this simply change the first group to a greedy operator such as (.+), here is another solution.
library(gsubfn)
myseq <- c('ALM_GSK_LN_06.ID', 'AS04_LV_06.ID.png', 'AS04_SP_06.IP.png')
strapply(myseq, '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)', c, simplify = rbind)
# [,1] [,2] [,3]
# [1,] "ALM_GSK" "LN" "ID"
# [2,] "AS04" "LV" "ID"
# [3,] "AS04" "SP" "IP"
Totally stealing #hwnd's regex but in a tidyr/dplyr approach:
library(dplyr); library(tidyr)
data_frame(myseq) %>%
extract(myseq, c('A', 'B', 'C'), '(.+)_([A-Z]+)[^.]+\\.([A-Z]+)')
## A B C
## 1 ALM_GSK LN ID
## 2 AS04 LV ID
## 3 AS04 SP IP

Extract location data using regex in R

R newbie here
I have data that looks something like this:
{'id': 19847005, 'profile_sidebar_fill_color': u'http://pbs.foo.com/profile_background', 'profile_text_color': u'333333', 'followers_count': 1105, 'location': u'San Diego, CA', 'profile_background_color': u'9AE4E8', 'listed_count': 43, '009', 'time_zone': u'Pacific Time (US & Canada)', 'protected': False}
I want to extract the location data from this text: San Diego, CA.
I have been trying to use this stringr package to accomplish this, but can't quite get the regex right to capture the city and state. Sometimes state will be present, other times not present.
location_pattern <- "'location':\su'(\w+)'"
rawdata$location <- str_extract(rawdata$user, location_pattern)
You could try
str_extract_all(str1, perl("(?<=location.: u.)[^']+(?=')"))[[1]]
#[1] "San Diego, CA"
Others have given possible solutions, but not explained what likely went wrong with your attempt.
The str_extract function uses POSIX extended regular expressions that do not understand \w and \s, those are specific to Perl regular expressions. You can use the perl function in the stringr package instead and it will then recognize the shortcuts, or you can use [[:space:]] in place of \s and [[:alnum:]_] in place of \w though more likely you will want something like [[:alpha], ] or [^'].
Also, R's string parser gets a shot at the string before it is passed to the matching function, therefore you will need \\s and \\w if you use the perl function (or other regular expressions function in R). the first \ escapes the second so that a single \ remains in the string to be interpreted as part of the regular expression.
It looks like a json string, but if you're not too concerned about that, then perhaps this would help.
library(stringi)
ss <- stri_split_regex(x, "[{}]|u?'|(, '(009')?)|: ", omit=TRUE)[[1]]
(m <- matrix(ss, ncol = 2, byrow = TRUE))
# [,1] [,2]
# [1,] "id" "19847005"
# [2,] "profile_sidebar_fill_color" "http://pbs.foo.com/profile_background"
# [3,] "profile_text_color" "333333"
# [4,] "followers_count" "1105"
# [5,] "location" "San Diego, CA"
# [6,] "profile_background_color" "9AE4E8"
# [7,] "listed_count" "43"
# [8,] "time_zone" "Pacific Time (US & Canada)"
# [9,] "protected" "False"
So now you have the ID names in the left column and the values on the right. It would probably be simple to reassemble the json from this point if need be.
Also, regarding the json-ness, we can coerce m to a data.frame (or leave it as a matrix), and then use jsonlite::toJSON
library(jsonlite)
json <- toJSON(setNames(as.data.frame(m), c("ID", "Value")))
fromJSON(json)
# ID Value
# 1 id 19847005
# 2 profile_sidebar_fill_color http://pbs.foo.com/profile_background
# 3 profile_text_color 333333
# 4 followers_count 1105
# 5 location San Diego, CA
# 6 profile_background_color 9AE4E8
# 7 listed_count 43
# 8 time_zone Pacific Time (US & Canada)
# 9 protected False

Split column label by number of letters/characters in R

I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=) is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}. The dot (.) matches any character. {7} means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
Here is a gsubfn solution:
library(gsubfn)
strapplyc(ID1, "(.*)(.)", simplify = cbind)
which gives this matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "Indiv01" "Indiv01" "Indiv02" "Indiv02" "Speci03" "Speci03"
[2,] "A" "B" "A" "B" "A" "B"
or use rbind in place of cbind if you want two columns (rather than two rows).
There are a couple of ways you could go about this.
To extract the final character
First, with substr:
new.vec <- sapply(old.vec, function(x) substr(x, nchar(x), nchar(x)))
or, with sub:
new.vec <- sub('.*(.)', '\\1', old.vec)
where old.vec is the vector of strings that you want to split.
For interest, the latter option uses a regular expression that translates to: "capture (indicating by surrounding with parentheses) the single character (.) that follows zero or more other characters (.*), and replace matches with the captured content (\\1)". For more info, see ?gsub, and here.
The above options allow for varying string lengths. However, if you do always want to split after 7 characters, and the second part of the string always has just a single character, then the following should work:
new.vec <- substr(old.vec, 8, 8)
(Edited to include method for extracting the first part of the string.)
To extract all but the final character
The process is similar.
new.vec <- sapply(old.vec, function(x) substr(x, 1, nchar(x) - 1))
new.vec <- sub('(.*).', '\\1', old.vec)
new.vec <- substr(old.vec, 1, 7)

R programming: regexpr next occurence

I have the following code and results:
> x <- c("ABCDE CDEFG FGHIJ")
> x
[1] "ABCDE CDEFG FGHIJ"
> regexpr("D", x)
[1] 4
attr(,"match.length")
[1] 1
regexpr only returns the first occurrence of "D", how can I get it to return all occurrences of "D"
You were so close -- just a couple of line down from regexpr in the help file...
gregexpr("D", x)
# [[1]]
# [1] 4 8
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE
You can also make use strsplit like this :
which(unlist(strsplit(x,split=""))=="D")
[1] 4 8
This way you can also have exact match for D.
I'm sure there will be a pure regex approach. However stringr::str_locate_all will suffice
library(stringr)
unique(unlist(str_locate_all(x, 'D')))
## [1] 4 8

R grep: Match one string against multiple patterns

In R, grep usually matches a vector of multiple strings against one regexp.
Q: Is there a possibility to match a single string against multiple regexps? (without looping through each single regexp pattern)?
Some background:
I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):
ab 10 37 41
abbrach* 38
abbreche 39
abbrich* 39
abend* 37
abendessen* 60 63
aber 20 23 45
abermals 37
Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit).
Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.
[related question, other programming language]
What about applying the regexpr function over a vector of keywords?
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
sapply(keywords, regexpr, strings, ignore.case=TRUE)
dog cat bird
[1,] 15 -1 -1
[2,] -1 4 15
[3,] -1 -1 -1
sapply(keywords, regexpr, strings[1], ignore.case=TRUE)
dog cat bird
15 -1 -1
Values returned are the position of the first character in the match, with -1 meaning no match.
If the position of the match is irrelevant, use grepl instead:
sapply(keywords, grepl, strings, ignore.case=TRUE)
dog cat bird
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE FALSE FALSE
Update: This runs relatively quick on my system, even with a large number of keywords:
# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936
system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))
user system elapsed
7.495 0.155 7.596
dim(matches)
[1] 3 234936
re2r package can match multiple patterns (in parallel). Minimal example:
# compile patterns
re <- re2r::re2(keywords)
# match strings
re2r::re2_detect(strings, re, parallel = TRUE)
To expand on the other answer, to transform the sapply() output into a useful logical vector you need to further use an apply() step.
keywords <- c("dog", "cat", "bird")
strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")
(matches <- sapply(keywords, grepl, strings, ignore.case=TRUE))
# dog cat bird
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE TRUE
# [3,] FALSE FALSE FALSE
To know which strings contain any of the keywords (patterns):
apply(matches, 1, any)
# [1] TRUE TRUE FALSE
To know which keywords (patterns) were matched in the supplied strings:
apply(matches, 2, any)
# dog cat bird
# TRUE TRUE TRUE