Length of location sequence (given as a string) in R - regex

I have a list of geolocation sequences. Each element of my list is of the form :
> "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
I would like to be able to get the length of the sequence (4 in the example above). Could you please help me? I've tried to use regular expressions but I couldn't succeed.
Thank you in advance!

As a few folks suggested in the comments, you can count the # of times a specific character appears in your sequences. This assumes the data are well formed and consistent. For example:
library(stringr)
x <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
str_count(x, "\\[") - 1 #subtract 1 since there are two opening [
yields:
> str_count(x, "\\[") - 1
[1] 4

In case you do not want to load a library
str <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
nchar(str)-nchar(gsub("\\]", "", str))-1

You can use:
\[[^[]
And then count the matches.
Demo
In R:
> x <- "[[1.2,2.2],[-1.12,3.45],[12.311,-1.34],[-12.32,33.333]]"
> length(gregexpr("\\[[^\\[]",x)[[1]])
[1] 4

Related

how do you subset data frame based on grep [duplicate]

I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.
My data frame:
x <- c("G448", "G459", "G479", "G406")
y <- c(1:4)
My.Data <- data.frame (x,y)
I have tried:
subset (My.Data, x=="G45*")
But I am unsure how to use wildcards. I have also tried grep() to find the indicies:
grep ("G45*", My.Data$x)
but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.
It's pretty straightforward using [ to extract:
grep will give you the position in which it matched your search pattern (unless you use value = TRUE).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2
You may also use the stringr package
library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))
You may not use '^' (starts with) in this case, to obtain the results you need

Correct wrongly formatted dates

I have some incorrect dates between good formatted dates, looking something like this:
df <- data.frame(col=c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05"))
How can I convert the incorrect format between the existing correctly formatted dates?
I'm able to remove the first dashes, but also the it requires to remove the last 3 characters -01 or -1. So that the corrected values are:
desired <- c("1.1.11","1.1.12","1.1.13","1.1.14","1.10.10","1.10.11","1.10.12","2010-03-31","2010-04-01","2010-04-05"))
What I'm strangling with is the -01 part, since by removing these, would also remove part of the correct formatted dates.
EDIT: The format is mm.dd.yy
Here is a pretty simple solution using sub ...
sub('^-+([^-]+).+', '\\1', df$col)
# [1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
# [6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
Just remove all the non-word characters present at the start or -01 or -1 present at the end which was not preceded by -+ two digits.
> x <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> gsub("^\\W+|(?<!-\\d{2})-0?1$", "", x, perl=T)
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10"
[6] "1.10.11" "1.10.12" "2010-03-31" "2010-04-01" "2010-04-05"
A simple regexp will solve these kinds of problems pretty well:
> df <- c("--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01","---1.10.12-01","2010-03-31","2010-04-01","2010-04-05")
> df
[1] "--1.1.11-01" "--1.11.12-1" "--1.1.13-01" "--1.1.14-01" "--1.10.10-01" "-1.10.11-01" "---1.10.12-01"
[8] "2010-03-31" "2010-04-01" "2010-04-05"
> df <- sub(".*([0-9]{4}\\-[0-9]{2}\\-[0-9]{2}|[0-9]{1,2}\\.[0-9]{1,2}\\.[0-9]{1,2}).*", "\\1", df)
> df
[1] "1.1.11" "1.11.12" "1.1.13" "1.1.14" "1.10.10" "1.10.11" "1.10.12" "2010-03-31" "2010-04-01"
[10] "2010-04-05"
Note that I made it a character vector instead of data.frame.
The solution itself is just matching one pattern or the other pattern and then dropping the rest by replacing it with the subpattern.
I here observe that if the prefix of a date has an entry as -1 or --1 then only there exists a illegal suffix i.e -01.
You could first take all the values in array.
So you will have an array of "--1.1.11-01","--1.11.12-1","--1.1.13-01","--1.1.14-01","--1.10.10-01","-1.10.11-01"
Now you can check for the prefix if is it -1 or --1. if there exists any such thing then you can mark it as to remove the suffix -01 as well .
According to the input pattern above I feel that the above strategy would work.
Please let me know if the strategy works

Replace repeating character with another repeated character

I would like to replace 3 or more consecutive 0s in a string by consecutive 1s. Example: '1001000001' becomes '1001111111'.
In R, I wrote the following code:
gsub("0{3,}","1",reporting_line_string)
but obviously it replaces the 5 0s by a single 1. How can I get 5 1s ?
Thanks,
You can use gsubfn function, which you can supply a replacement function to replace the content matched by the regex.
require(gsubfn)
gsubfn("0{3,}", function (x) paste(replicate(nchar(x), "1"), collapse=""), input)
You can replace paste(replicate(nchar(x), "1"), collapse="") with stri_dup("1", nchar(x)) if you have stringi package installed.
Or a more concise solution, as G. Grothendieck suggested in the comment:
gsubfn("0{3,}", ~ gsub(".", 1, x), input)
Alternatively, you can use the following regex in Perl mode to replace:
gsub("(?!\\A)\\G0|(?=0{3,})0", "1", input, perl=TRUE)
It is extensible to any number of consecutive 0 by changing the 0{3,} part.
I personally don't endorse the use of this solution, though, since it is less maintainable.
Here's an option that builds on your approach, but makes use of gregexpr and regmatches. There's probably a more DRY way to do this, but it's not coming to my mind right now....
x <- c("1001000001", "120000siw22000100")
x
# [1] "1001000001" "120000siw22000100"
a <- regmatches(x, gregexpr("0{3,}", x))
regmatches(x, gregexpr("0{3,}", x)) <- lapply(a, function(x) gsub("0", "1", x))
x
# [1] "1001111111" "121111siw22111100"
For regex ignorants (like me), try some brute force. Split the string into single characters using strsplit, find consecutive runs of "0" using rle, create a vector of relevant indices (run lengths of "0" > 2) using rep, insert a "1" at the indices, paste to a single string.
x2 <- strsplit(x = "1001000001", split = "")[[1]]
r <- rle(x2 == "0")
idx <- rep(x = r$lengths > 2, times = r$lengths)
x2[idx] <- "1"
paste(x2, collapse = "")
# [1] "1001111111"
0(?=00)|(?<=00)0|(?<=0)0(?=0)
You can try this.Replace by 1.See demo.
http://regex101.com/r/dP9rO4/5

strsplit on first instance [duplicate]

This question already has answers here:
Splitting a string on the first space
(7 answers)
Closed 4 years ago.
I would like to write a strsplit command that grabs the first ")" and splits the string.
For example:
f("12)34)56")
"12" "34)56"
I have read over several other related regex SO questions but I am afraid I am not able to make heads or tails of this. Thank you any assistance.
You could get the same list-type result as you would with strsplit if you used regexpr to get the first match, and then the inverted result of regmatches.
x <- "12)34)56"
regmatches(x, regexpr(")", x), invert = TRUE)
# [[1]]
# [1] "12" "34)56"
Need speed? Then go for stringi functions. See timings e.g. here.
library(stringi)
x <- "12)34)56"
stri_split_fixed(str = x, pattern = ")", n = 2)
It might be safer to identify where the character is and then substring either side of it:
x <- "12)34)56"
spl <- regexpr(")",x)
substring(x,c(1,spl+1),c(spl-1,nchar(x)))
#[1] "12" "34)56"
Another option is to use str_split in the package stringr:
library(stringr)
f <- function(string)
{
unlist(str_split(string,"\\)",n=2))
}
> f("12)34)56")
[1] "12" "34)56"
Replace the first ( with the non-printing character "\01" and then strsplit on that. You can use any character you like in place of "\01" as long as it does not appear.
strsplit(sub(")", "\01", "12)34)56"), "\01")

Regular expressions in R to erase all characters after the first space?

I have data in R that can look like this:
USDZAR Curncy
R157 Govt
SPX Index
In other words, one word, in this case a Bloomberg security identifier, followed by another word, which is the security class, separated by a space. I want to strip out the class and the space to get to:
USDZAR
R157
SPX
What's the most efficient way of doing this in R? Is it regular expressions or must I do something as I would in MS Excel using the mid and find commands? eg in Excel I would say:
=MID(#REF, 1, FIND(" ", #REF, 1)-1)
which means return a substring starting at character 1, and ending at the character number of the first space (less 1 to erase the actual space).
Do I need to do something similar in R (in which case, what is the equivalent), or can regular expressions help here? Thanks.
1) Try this where the regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero characters:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
sub(" .*", "", x)
## [1] "USDZAR" "R157" "SPX"
2) An alternative if you wanted the two words in separate columns in a data frame is as follows. Here as.is = TRUE makes the columns be character rather than factor.
read.table(text = x, as.is = TRUE)
## V1 V2
## 1 USDZAR Curncy
## 2 R157 Govt
## 3 SPX Index
It's pretty easy with stringr:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
library(stringr)
str_split_fixed(x, " ", n = 2)[, 1]
If you're like me, in that regexp's will always remain an inscrutable, frustrating mystery, this clunkier solution also exists:
x <- c("USDZAR Curncy", "R157 Govt", "SPX Index")
unlist(lapply(strsplit(x," ",fixed=TRUE),"[",1))
The fixed=TRUE isn't strictly necessary, just pointing out that you can do this (simple case) w/out really knowing the first thing about regexp's.
Edited to reflect #Wojciech's comment.
The regex would be to search for:
\x20.*
and replace with an empty string.
If you want to know whether it's faster, just time it.