Stata - Extract numbers before characters, create a list

Stata - Extract numbers before characters, create a list - list

Good morning,
I have a dataframe where one of the columns has observations that look like that:
row1: 28316496(15)|28943784(8)|28579919(7)
row2: 29343898(1)
I would like to create a new column that would extract the numbers that are not in parenthesis, create a list, and then append all these numbers to create a list with all these numbers.
Said differently at the end, I would like to end up with the following list:
28316496;28943784;28579919;29343898
It could also be any other similar object, I am just interested in getting all these numbers and matching them with another dataset.
I have tried using str_extract_all to extract the numbers but I am having trouble understanding the pattern argument. For instance I have tried:
str_extract_all("28316496(15)|28943784(8)", "\d+(\d)")
and
gsub("\s*\(.*", "", "28316496(15)|28943784(8)")
but it is not returning exactly what I want.
Any idea for extracting the number outside the brackets and create a giant list out of that?
Thanks a lot!

In base R, we can use gsub to remove the (, followed by the digits and ), and use read.table to read it in a data.frame
read.table(text = gsub("\\(\\d+\\)", "", df1$col1),
header = FALSE, sep = "|", fill = TRUE)
V1 V2 V3
1 28316496 28943784 28579919
2 29343898 NA NA
Or using str_extract, use a regex lookaround
library(stringr)
str_extract_all(df1$col1, "\\d+(?=\\()")
[[1]]
[1] "28316496" "28943784" "28579919"
[[2]]
[1] "29343898"
data
df1 <- structure(list(col1 = c("28316496(15)|28943784(8)|28579919(7)",
"29343898(1)")), class = "data.frame", row.names = c(NA, -2L))

Here is a way.
x <- c("28316496(15)|28943784(8)|28579919(7)", "29343898(1)")
y <- strsplit(x, "\\|")
y <- lapply(y, \(.y) sub("\\([^\\(\\)]+\\)$", "", .y))
y
#> [[1]]
#> [1] "28316496" "28943784" "28579919"
#>
#> [[2]]
#> [1] "29343898"
Created on 2022-09-24 with reprex v2.0.2

Related

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.
I have a column which has links like these
http://www.imdb.com/title/tt2569314/companycredits
I want to extract the tt2569314 out of this and store it in a new column.
The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.
I want this to be kind of a mixture of SUBSTR and INSTR in SQL.
Please advise.

You could try this:
a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"

If all the links are similar in path structure, you can use the dirname
x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"
Or you can paste together a regular expression with the base URL
y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"
Or you may even be able to get away with this:
basename(dirname(x))
# [1] "tt2569314"
It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.
library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"

You could try:
str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
library(httr)
gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
#[1] "tt2569314"

You may try this also,
> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

Replace character string elements by indices efficiently in R

I would like to efficiently replace elements in my character object with other particular elements in particular places (these places are indices which I know as they are results of the gregexpr function).
I would like some foo function that works like:
foo("qwerty", c(1,3,5), c("z", "x", "y"))
giving me:
[1] "zwxryy"
I searched the stringr package cran pdf but nothing hit my mind. Thank you in advance for any suggestions.

For example:
xx <- unlist(strsplit("qwerty",""))
xx[c(1,3,5)] <- c("z", "x", "y")
paste0(xx,collapse='')
[1] "zwxryy"

You could also try the one below, if you don't have that many characters to replace
st1 <- "qwerty"
gsub("^.(.).(.).","z\\1x\\2y", st1)
#[1] "zwxryy"

In stringi package there is stri_sub function that works like this:
a <- "12345"
stri_sub(a, from=c(1,3,5),len=1) <- letters[c(1,3,5)]
a
## [1] "a2345" "12c45" "1234e"
it's almost what you want. Just use this in loop:
a <- "12345"
for(i in c(1,3,5)){
stri_sub(a, from=i,len=1) <- letters[i]
}
a
## [1] "a2c4e"
Be aware that this kind of function is on our TODO list, check:
https://github.com/Rexamine/stringi/issues?state=open

Subsetting data using regular expressions in R

I want to extract specific information from within a column in a data frame and add it on to a new column in the same data frame. The complication lies in the fact that some rows do not have the information I want to extract (the 6 characters after "UniProt:") at all, while others have multiple occurrences - I want these to be displayed accordingly as this column contains the identifiers in my data frame.
Here's an example; I've copied a few rows of the column Fasta.headers from my data frame:
Row 1:
H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1
Row 2:
C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1
Row 3:
ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1
Row 4:
T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1
I want the output to be:
H2L0A8;Q9TXU2
O44447
O61907;Q76NP4;P34454

Here strapplyc from the gsubfn package extracts the desired strings from x and sapply collapses multiple strings into a single string separated by semicolons:
library(gsubfn)
sapply(strapplyc(x, "UniProt:([^;]*)"), paste, collapse = ";")
giving:
[1] "H2L0A8;Q9TXU2" "O44447" ""
[4] "O61907;Q76NP4;P34454"
where x is:
x <- c("H05C05.1c;CE43771;WBGene00019157;status:Partially_confirmed;UniProt:H2L0A8;protein_id:CCD72193.1;>H05C05.1a;CE37385;WBGene00019157;status:Partially_confirmed;UniProt:Q9TXU2;protein_id:CCD72188.1",
"C02B10.5;CE16802;WBGene00015330;status:Partially_confirmed;UniProt:O44447;protein_id:CCD61167.1",
"ZK1127.4;CE07643;WBGene00022851;status:Confirmed;protein_id:CCD73716.1",
"T27C4.4a;CE21211;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:O61907;protein_id:CCD74255.1;>T27C4.4b;CE21212;WBGene00003025;locus:lin-40;status:Confirmed;UniProt:Q76NP4;protein_id:CCD74256.1;>T27C4.4d;CE33331;>F54F2.9;CE39158;WBGene00018836;status:Confirmed;UniProt:P34454;protein_id:CCD71243.1")
ADDED some explanation.

An alternative using the infrequently used: regmatches<-
regmatches(x,gregexpr("UniProt:.{7}",x),invert=TRUE) <- ""
gsub("UniProt:","",x)
#[1] "H2L0A8;Q9TXU2;"
#[2] "O44447;"
#[3] ""
#[4] "O61907;Q76NP4;P34454;"
You can also get there with lookaheads and lookbehinds specifying perl=TRUE to the regex:
sapply(regmatches(x,gregexpr("(?<=UniProt:).+?(?=;)",x,perl=TRUE)),
paste,collapse=";")
#[1] "H2L0A8;Q9TXU2" "O44447"
#[3] "" "O61907;Q76NP4;P34454"

stringr split column by alpha and numeric

I can only use stringer/ regular expression, I am working in r
I have a csv I have downloaded called mpg2,and a subset of this containing only Mercedes Benz makes. What I am trying to do is split the model into alpha and numeric so I can plot them. for example, a mercedes C300 would need to be split into C and 300, or GLS500 into GLS and 550.
so now I have all of the model numbers, now I want to split between letters and numbers.
I have tried
mercedes<- subset(mpg2, make=="Mercedes-Benz")
str_split(mercedes$model, "[0:9]")
but this doesn't do what I want it to and I have played with n= and that doesn't work either.
then I have
MB$modelnumber<-as.numeric(gsub("([0-9]+).*$", "\\1", mercedes$model))
Which makes a column of only numbers, I can't get the letters to work.
If I need to upload my specific dataset let me know, I just have to figure out how to do that.
But I need to basically split "XYZ123" into its alpha and numeric parts and put them in 2 separate columns.

something like this :
x <- "XYZ123"
x <- gsub("([0-9]+)",",\\1",x)
strsplit(x,",")
i ve replaced the original group of numbers by ,group of numbers. so that i can do a split on ot easily.

You can use something like this:
SplitMe <- function(string, alphaFirst = TRUE) {
Pattern <- ifelse(isTRUE(alphaFirst), "(?<=[a-zA-Z])(?=[0-9])", "(?<=[0-9])(?=[a-zA-Z])")
strsplit(string, split = Pattern, perl = T)
}
String <- c("C300", "GLS500", "XYZ123")
SplitMe(String)
# [[1]]
# [1] "C" "300"
#
# [[2]]
# [1] "GLS" "500"
#
# [[3]]
# [1] "XYZ" "123"
To get the output as a two column matrix, just use do.call(rbind, ...):
do.call(rbind, SplitMe(String))
# [,1] [,2]
# [1,] "C" "300"
# [2,] "GLS" "500"
# [3,] "XYZ" "123"
The above is just a convenience function that I have saved for the following scenarios:
strsplit(String, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T)
and
strsplit(String, split = "(?<=[0-9])(?=[a-zA-Z])", perl = T)
This function won't change a GLS500 into a GLS550 though.

Extract info inside all parenthesis in R

I have a character string and what to extract the information inside of multiple parentheses. Currently I can extract the information from the last parenthesis with the code below. How would I do it so it extracts multiple parentheses and returns as a vector?
j <- "What kind of cheese isn't your cheese? (wonder) Nacho cheese! (groan) (Laugh)"
sub("\\).*", "", sub(".*\\(", "", j))
Current output is:
[1] "Laugh"
Desired output is:
[1] "wonder" "groan" "Laugh"

Here is an example:
> gsub("[\\(\\)]", "", regmatches(j, gregexpr("\\(.*?\\)", j))[[1]])
[1] "wonder" "groan" "Laugh"
I think this should work well:
> regmatches(j, gregexpr("(?=\\().*?(?<=\\))", j, perl=T))[[1]]
[1] "(wonder)" "(groan)" "(Laugh)"
but the results includes parenthesis... why?
This works:
regmatches(j, gregexpr("(?<=\\().*?(?=\\))", j, perl=T))[[1]]
Thanks #MartinMorgan for the comment.

Using the stringr package we can reduce this a little bit.
library(stringr)
# Get the parenthesis and what is inside
k <- str_extract_all(j, "\\([^()]+\\)")[[1]]
# Remove parenthesis
k <- substring(k, 2, nchar(k)-1)
#kohske uses regmatches but I'm currently using 2.13 so don't have access to that function at the moment. This adds the dependency on stringr but I think it is a little easier to work with and the code is a little clearer (well... as clear as using regular expressions can be...)
Edit: We could also try something like this -
re <- "\\(([^()]+)\\)"
gsub(re, "\\1", str_extract_all(j, re)[[1]])
This one works by defining a marked subexpression inside the regular expression. It extracts everything that matches the regex and then gsub extracts only the portion inside the subexpression.

I think there are basically three easy ways of extracting multiple capture groups in R (without using substitution); str_match_all, str_extract_all, and regmatches/gregexpr combo.
I like #kohske's regex, which looks behind for an open parenthesis ?<=\\(, looks ahead for a closing parenthesis ?=\\), and grabs everything in the middle (lazily) .+?, in other words (?<=\\().+?(?=\\))
Using the same regex:
str_match_all returns the answer as a matrix.
str_match_all(j, "(?<=\\().+?(?=\\))")
[,1]
[1,] "wonder"
[2,] "groan"
[3,] "Laugh"
# Subset the matrix like this....
str_match_all(j, "(?<=\\().+?(?=\\))")[[1]][,1]
[1] "wonder" "groan" "Laugh"
str_extract_all returns the answer as a list.
str_extract_all(j, "(?<=\\().+?(?=\\))")
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
str_extract_all(j, "(?<=\\().+?(?=\\))")[[1]]
[1] "wonder" "groan" "Laugh"
regmatches/gregexpr also returns the answer as a list. Since this is a base R option, some people prefer it. Note the recommended perl = TRUE.
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))
[[1]]
[1] "wonder" "groan" "Laugh"
#Subset the list...
regmatches(j, gregexpr( "(?<=\\().+?(?=\\))", j, perl = T))[[1]]
[1] "wonder" "groan" "Laugh"
Hopefully, the SO community will correct/edit this answer if I've mischaracterized the most popular options.

Using rex may make this type of task a little simpler.
matches <- re_matches(j,
rex(
"(",
capture(name = "text", except_any_of(")")),
")"),
global = TRUE)
matches[[1]]$text
#>[1] "wonder" "groan" "Laugh"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stata - Extract numbers before characters, create a list - list

Here is a way. x <- c("28316496(15)|28943784(8)|28579919(7)", "29343898(1)") y <- strsplit(x, "\\|") y <- lapply(y, \(.y) sub("\\([^\\(\\)]+\\)$", "", .y)) y #> [[1]] #> [1] "28316496" "28943784" "28579919" #> #> [[2]] #> [1] "29343898" Created on 2022-09-24 with reprex v2.0.2

Related

Extract substring in R from string with fixed start position and end point as a character found

Replace character string elements by indices efficiently in R

Subsetting data using regular expressions in R

stringr split column by alpha and numeric

Extract info inside all parenthesis in R

Categories

Resources