Convert character to lowerCamelCase in R - regex

I have character vector which looks like this:
x <- c("cult", "brother sister relationship", "word title")
And I want to convert it to the lowerCamelCase style looking like this:
c("cult", "brotherSisterRelationship", "wordTitle")
I played around with gsub, gregexpr, strplit, regmatches and many other functions, but couldn't get a grip.
Especially two spaces in a character seem to be difficult to handle.
Maybe someone here has an idea how to do this.

> x <- c("cult", "brother sister relationship", "word title")
> gsub(" ([^ ])", "\\U\\1", x, perl=TRUE)
[1] "cult" "brotherSisterRelationship"
[3] "wordTitle"
Quoting from pattern matching and replacement:
For perl = TRUE only, it can also contain "\U" or "\L" to convert the
rest of the replacement to upper or lower case and "\E" to end case
conversion.

A non-base alternative:
library(R.utils)
toCamelCase(x, capitalize = FALSE)
# [1] "cult" "brotherSisterRelationship" "wordTitle"

Related

R: regular expression lookaround(s) to grab whats between two patterns

I have a vector with strings like:
x <-c('kjsdf_class-X1(z)20_sample-318TT1X.3','kjjwer_class-Z3(z)29_sample-318TT2X.4')
I wanted to use regular expressions to get what is between substrings 'class-' and '_sample' (such as 'X1(z)20' and 'Z3(z)29' in x), and thought the lookaround regex ((?=...), (?!...),... and so) would do it. Cannot get it to work though!
Sorry if this is similar to other SO questions eg here or here).
This is a bit different then what you had in mind, but it will do the job.
gsub("(.*class-)|(.)|(_sample.*)", "\\2", x)
The logic is the following, you have 3 "sets" of strings:
1) characters .* ending in class-
2) characters .
3) Characters starting with _sample and characters afterwords .*
From those you want to keep the second "set" \\2.
Or another maybe easier to understand:
gsub("(.*class-)|(_sample.*)", "", x)
Take any number of characters that end in class- and the string _sample followed by any number of characters, and substitute them with the NULL character ""
We could use str_extract_all from library(stringr)
library(stringr)
unlist(str_extract_all(x, '(?<=class-)[^_]+(?=_sample)'))
#[1] "X1(z)20" "Z3(z)29"
This should also work if there are multiple instances of the pattern within a string
x1 <- paste(x, x)
str_extract_all(x1, '(?<=class-)[^_]+(?=_sample)')
#[[1]]
#[1] "X1(z)20" "X1(z)20"
#[[2]]
#[1] "Z3(z)29" "Z3(z)29"
Basically, we are matching the characters that are between the two lookarounds ((?<=class-) and (?=_sample)). We extract characters that is not a _ (based on the example) preceded by class- and succeded by _sample.
gsub('.*-([^-]+)_.*','\\1',x)
[1] "X1(z)20" "Z3(z)29"

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

Removing character from regexp class in R

Edit: Changing the whole question to make it clearer.
Can I remove a single character from one of the regular expression classes in R (such as [:alnum:])?
For example, match all punctuation ([:punct:]) except the _ character.
I am trying the replace underscores used in markdown for italicizing but the italicized substring may contain a single underscore which I would want to keep.
Edit: As another example, I want to capture everything between pairs of underscores (note one pair contains a single underscore that I want to keep between 1 and 10)
This is _a random_ string with _underscores: rate 1_10 please_
You won't believe it, but lazy matching achieved with a mere ? works as expected here:
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
str <- 'This is a _random string with_ a scale of 1_10.'
gsub("_+([[:print:]]+?)_+", "\\1", str)
Result:
[1] "This is a string with some random underscores in it."
[1] "This is a random string with a scale of 1_10."
Here is the demo program
However, if you want to modify the [[:print:]] class, mind it is basically a [\x20-\x7E] range. The underscore being \x5F, you can easily exclude it from the range, and use [\x20-\x5E\x60-\x7E].
str <- 'This is a _string with_ some _random underscores_ in it.'
gsub("_+([\x20-\x5E\x60-\x7E]+)_+", "\\1", str)
Returns
[1] "This is a string with some random underscores in it."
Similar to #stribizhev:
x <- "This is _a random_ string with _underscores: rate 1_10 please_"
gsub("\\b_(.*?)_\\b", "\\1", x, perl=T)
produces:
[1] "This is a random string with underscores: rate 1_10 please"
Here we use word boundaries and lazy matching. Note that the default regexp engine has issues with lazy repetition and capture groups, so you may want to use perl=T
gsub('(?<=\\D)\\_(?=\\D|$)','',str,perl=T)

R: lookaround within lookaround

I need to match any 'r' that is preceded by two different vowels. For example, 'our' or 'pear' would be matching but 'bar' or 'aar' wouldn't. I did manage to match for the two different vowels, but I still can't make that the condition (...) of lookbehind for the ensuing 'r'. Neither (?<=...)r nor ...\\Kr yields any results. Any ideas?
x <- c('([aeiou])(?!\\1)(?=(?1))')
y <- c('our','pear','bar','aar')
y[grepl(paste0(x,collapse=''),y,perl=T)]
## [1] "our" "pear"`
These two solutions seem to work:
the why not way:
x <- '(?<=a[eiou]|e[aiou]|i[aeou]|o[aeiu]|u[aeio])r'
y[grepl(x, y, perl=T)]
the \K way:
x <- '([aeiou])(?!\\1)[aeiou]\\Kr'
y[grepl(x, y, perl=T)]
The why not way variant (may be more efficient because it searches the "r" before):
x <- 'r(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
or to quickly exclude "r" not preceded by two vowels (without to test the whole alternation)
x <- 'r(?<=[aeiou][aeiou]r)(?<=a[eiou]r|e[aiou]r|i[aeou]r|o[aeiu]r|u[aeio]r)'
As HamZa points out in the comments using skip and fail verbs is one way to do what we want. Basically we tell it to ignore cases where we have two identical vowels followed by "r"
# The following is the beginning of the regex and isn't just R code
# the ([aeiou]) captures the first vowel, the \\1 references what we captured
# so this gives us the same vowel two times in a row
# which we then follow with an "r"
# Then we tell it to skip/fail for this
([aeiou])\\1r(*SKIP)(*FAIL)
Now we told it to skip those cases so now we tell it "or cases where we have two vowels followed by an 'r'" and since we already eliminated the cases where those two vowels are the same this will get us what we want.
|[aeiou]{2}r
Putting it together we end up with
y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
grep("([aeiou])\\1r(*SKIP)(*FAIL)|[aeiou]{2}r", y, perl = TRUE, value = TRUE)
#[1] "our" "pear" "sseiras"
Here is a less than elegant solution:
y[grepl("[aeiou]{2}r", y, perl=T) & !grepl("(.)\\1r", y, perl=T)]
Probably has some corner case failures where the first set matches at different location than the second set (will have to think about that), but something to get you started.
Another one through negative lookahead assertion.
> y <- c('our','pear','bar','aar', "aa", "ae", "are", "aeer", "ssseiras")
> grep("(?!(?:aa|ee|ii|oo|uu)r)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
> grep("(?!aa|ee|ii|oo|uu)[aeiou][aeiou]r", y, perl=TRUE, value=TRUE)
[1] "our" "pear" "ssseiras"
(?!aa|ee|ii|oo|uu) asserts that the first two chars in the match won't be aa or ee or .... or uu. So this [aeiou][aeiou] would match any two vowels other but it wouldn't be repeated . That's why we set the condition at first. r matches the r which follows the vowels.

Separate a sentence into words and endmarks

I want to break a sentence apart into words and end marks (assume all other punctuation has been removed). I've written a working function to break string(s) apart as described but I think the part:
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x, nchar(x), nchar(x))))
is a cob job that can be better achieved without using the substring and just splitting on spaces and between the endmark with an or | statement of sorts but don't know how I'd achieve this. Any direction with this would be appreciated.
breaker <- function(string) {
FUN <- function(x) {
unlist(c(strsplit(x, "[^[:alnum:]'\"]", perl = T), substring(x,
nchar(x), nchar(x))))
}
lapply(string, FUN)
}
#EXAMPLES
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
Here is a regex pattern that'll do the whole job on its own. It will match (and thus allow strsplit() to split the string) either at a space or right before one of the sentence-ending punctuation marks.
pat <- "[[:space:]]|(?=[.!?])"
The first half of the pattern matches space characters, and any match will cause strsplit() to 'eat up' the matched characters when it splits the string. The second half of the pattern (the part inside of the (?=...)) matches sentence-ending punctuation. It is an example of a 'zero-width positive lookahead assertion' (see ?regexp for details), and as such, will not lead strsplit() to 'eat up' the matching punctuation.
For your example vectors, you don't even need the call to lapply():
breaker <- function(X) {
strsplit(X, "[[:space:]]|(?=[.!?])", perl=TRUE)
}
x <- "I'm liking it!"
breaker(x)
y <- c("I'm liking it!", "How much do you like it?", "I'd say it's awesome.")
breaker(y)
you can also use scan_tokenizer() and MC_tokenizer() from the tm package
> library(tm)
> ?MC_tokenizer
> MC_tokenizer("what are the number of words in this sentence?")
[1] "what" "are" "the" "number" "of" "words" "in"
[8] "this" "sentence"