Based on the advice here: Find location of character in string, I tried this:
> gregexpr(pattern ='$',"data.frame.name$variable.name")
[[1]]
[1] 30
attr(,"match.length")
[1] 0
attr(,"useBytes")
[1] TRUE
But it didn't work; note:
> nchar("data.frame.name$variable.name")
[1] 29
How do you find the location of $ in this string?
The problem is that $ is the end-of-string marker in the regex. Try this instead:
> gregexpr(pattern ='\\$',"data.frame.name$variable.name")
[[1]]
[1] 16
attr(,"match.length")
[1] 1
attr(,"useBytes")
[1] TRUE
... which gives the right answer - i.e. 16.
Here's one solution using strsplit and which
> which(strsplit("data.frame.name$variable.name", "")[[1]]=="$")
[1] 16
Related
Hi I have the following data.
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2",
"appple+20gfree",
"BELI HG MSWAT ALA +VAT T 100g BAR WR",
"TOOLAIT CASSE+LSST+SSSRE 40g SAC MDC")
In my second step I remove all whitespace in shopping_list.
require(stringr)
shopping_list_trim <- str_replace_all(shopping_list, fixed(" "), "")
print(shopping_list_trim)
[1] "applesx4" "bagofflour" "bagofsugar"
[4] "milkx2" "appple+20gfree" "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
If I want to extract the string that does not contain plus sign I use the following code.
str_extract(shopping_list_trim, "^[^+]+$")
[1] "applesx4" "bagofflour" "bagofsugar" "milkx2" NA NA NA
Would like to help to extract the string that contain plus sign.
I would like the output to be the following one.
NA NA NA NA "appple+20gfree"
"BELIHGMSWATALA+VATT100gBARWR" "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Does anybody have idea how to extract only string that contains plus sign?
This will do the trick
> str_extract(shopping_list_trim, "^(?=.*\\+)(.+)$")
[1] NA
[2] NA
[3] NA
[4] NA
[5] "appple+20gfree"
[6] "BELIHGMSWATALA+VATT100gBARWR"
[7] "TOOLAITCASSE+LSST+SSSRE40gSACMDC"
Regex Breakdown
^(?=.*\\+) #Lookahead to check if there is one plus sign
(.+)$ #Capture the string if the above is true
If you can't/don't want to use look-arounds, try
^.*\+.*$
It matches anything followed by a + followed by anything :)
See it work here at regex101.
Regards
Need to create an text sparce matrix (DTM) for classification. To prepare the text, first I need to eliminate (separate) the POS-tags the text. My guess was to do it like below. I'm new to R and don't now how to negate a REGEX (see below NOT!).
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
My guess how it could work:
(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS" "/VVFIN" "./$."
[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."
[[3]]
[1] "-/TRUNC" "/APPR" "/KON"
[[4]]
[1] "/PIS" "/ADJD" "./$."
[[5]]
[1] "/NN" "!!!/NE"
But don't konw how to negate the expression like:
# VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (i.e. empty text):
text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")
(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn" "ausläuft"
[[2]]
[1] "Kommt" "vor"
[[3]]
[1] "-RRB" "Durch" "und"
[[4]]
[1] "man" "zügig"
[[5]]
[1] "empfehlung"
With the friendly help of rawr
I have some strings and I'd like to convert each string in a number, so I'd like to use regular expression. My strings can be one of like:
["star"]
["near-star"]
["shared"]
["near-shared"]
["complete"]
["near-complete"]
["null"]
["near-null"]
my problem is that both these statements are true:
> grepl("star", "[\"near-star\"]")
[1] TRUE
> grepl("near-star", "[\"near-star\"]")
[1] TRUE
and this applies also to the other labels... any advice on how to write the right code to match each label is much appreciated.
best regards,
Simone
Trying to answer what I think might be your real problem (convert each string "to" a number)...
Given data:
> strings = c('["star"]', '["near-stat"]', '["shared"]', '["near-shared"]')
> data = sample(strings,20,TRUE)
such that:
> head(data)
[1] "[\"near-stat\"]" "[\"star\"]" "[\"near-shared\"]"
[4] "[\"near-shared\"]" "[\"shared\"]" "[\"star\"]"
Simply do:
> dataf=factor(data)
> as.numeric(dataf)
[1] 2 4 1 1 3 4 1 2 2 1 2 3 4 4 3 4 4 1 1 4
the mapping being given by:
> levels(dataf)
[1] "[\"near-shared\"]" "[\"near-stat\"]" "[\"shared\"]"
[4] "[\"star\"]"
Others have mentioned just using factors or the fixed argument (either of which will work fine for your stated question). But in general if you want to match a string or pattern, but only if it is not preceded by a given string then you can use negative look behind, an extension in Perl regular expressions:
> test <- c('star','near-star')
> grepl('(?<!near-)star', test, perl=TRUE )
[1] TRUE FALSE
The regular expression here say to match the string "star", but only if not preceded by the string "near-". The help page ?regexp has details (you need to scroll almost all the way to the bottom).
You can include the square brackets and quotes in your pattern. Furthermore, you can use fixed = TRUE for matching the string as is.
> grepl("[\"star\"]", "[\"near-star\"]", fixed = TRUE)
[1] FALSE
> grepl("[\"star\"]", "[\"star\"]", fixed = TRUE)
[1] TRUE
How can I extract phone numbers from a text file?
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
This is a question that's been answered for other languages (see php abd general regex) but doesn't seem to have been tackled on SO for R.
I have searched and found what appears to be possible regexes to find phone numbers (In addition to the regexes from other languages above): http://regexlib.com/Search.aspx?k=phone but have not been able to use gsub within R with these to extract all of these numbers in the example.
Ideally, we'd get something like:
[[1]]
[1] "2-613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "1.575.555.5555"
[[5]]
[1] "7164347566"
This is the best I've been able to do- you have a pretty wide range of formats, including some with spaces, so the regex is pretty general. It just says "look for a string of at least 5 characters made up entirely of digits, periods, brackets, hyphens or spaces":
library(stringr)
str_extract_all(x, "(^| )[0-9.() -]{5,}( |$)")
Output:
[[1]]
[1] " 2-613-213-4567 " " 5555555555 "
[[2]]
[1] " 613 213 4567"
[[3]]
[1] " (613)2134567"
[[4]]
[1] "1.575.555.5555 "
[[5]]
[1] "7164347566"
The leading/trailing spaces could probably be fixed with some additional complexity, or you could just fix it in post.
Update: a bit of searching lead me to this answer, which I slightly modified to allow periods. A bit stricter in terms of requiring a valid (US?) phone number, but seems to cover all your examples:
str_extract_all(x, "\\(?\\d{3}\\)?[.-]? *\\d{3}[.-]? *[.-]?\\d{4}")
Output:
[[1]]
[1] "613-213-4567" "5555555555"
[[2]]
[1] "613 213 4567"
[[3]]
[1] "(613)2134567"
[[4]]
[1] "575.555.5555"
[[5]]
[1] "7164347566"
The monstrosity found here also works once you take out the ^ and $ at either end. Use only if you really need it:
huge_regex = "(?:(?:\\+?1\\s*(?:[.-]\\s*)?)?(?:\\(\\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\\s*\\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\\s*(?:[.-]\\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\\s*(?:[.-]\\s*)?([0-9]{4})(?:\\s*(?:#|x\\.?|ext\\.?|extension)\\s*(\\d+))?"
The qdapRegex now has the rm_phone specifically designed for this task:
x <- c(" Mr. Bean bought 2 tickets 2-613-213-4567 or 5555555555 call either one",
"43 Butter Rd, Brossard QC K0A 3P0 – 613 213 4567",
"Please contact Mr. Bean (613)2134567",
"1.575.555.5555 is his #1 number",
"7164347566"
)
library(qdapRegex)
ex_phone(x)
## [[1]]
## [1] "613-213-4567" "5555555555"
##
## [[2]]
## [1] "613 213 4567"
##
## [[3]]
## [1] "(613)2134567"
##
## [[4]]
## [1] "1.575.555.5555"
##
## [[5]]
## [1] "7164347566"
You would need a complex regex to cover all rules for matching phone numbers, but to cover your examples.
> library(stringi)
> unlist(stri_extract_all_regex(x, '(\\d[.-])?\\(?\\d{3}\\)?[-. ]?\\d{3}[-. ]?\\d{4}\\b'))
# [1] "2-613-213-4567" "5555555555" "613 213 4567" "(613)2134567"
# [5] "1.575.555.5555" "7164347566"
I have the following code and results:
> x <- c("ABCDE CDEFG FGHIJ")
> x
[1] "ABCDE CDEFG FGHIJ"
> regexpr("D", x)
[1] 4
attr(,"match.length")
[1] 1
regexpr only returns the first occurrence of "D", how can I get it to return all occurrences of "D"
You were so close -- just a couple of line down from regexpr in the help file...
gregexpr("D", x)
# [[1]]
# [1] 4 8
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE
You can also make use strsplit like this :
which(unlist(strsplit(x,split=""))=="D")
[1] 4 8
This way you can also have exact match for D.
I'm sure there will be a pure regex approach. However stringr::str_locate_all will suffice
library(stringr)
unique(unlist(str_locate_all(x, 'D')))
## [1] 4 8