Use string comparisons to split a column in R - regex

To the best of my search this question hasn't been asked before.
I have a dataframe column called Product. This column has the company name as well as product model in just one column.
product.df <- data.frame("Product" = c("Company1 123M UG", "Company1 234M-I", "Company2 763-87-U","Company2 777-87", "Company3 Name1 87M", "Company3 Name1 O77M", "Company3 Name1 765-U MP"))
I want to split out the company names and product model number from this single column into two columns. I need a function that can find similar words between rows and classify them as Company names and the rest of the letters as product model number. No two rows as far as i can tell have same model numbers. So in the case above. I would get this answer
new.product.df <- data.frame("CompanyName" = c("Company1", "Company1", "Company2","Company2", "Company3 Name1", "Company3 Name1", "Company3 Name1"), "Model" = c("123M UG", "234M-I", "763-87-U", "777-87", "87M", "O77M", "765-U MP"))
I need a function that can compare two strings and return me similar continuous letters and dissimilar letters.

If you're guaranteed the first word is always a company name, then simply do a fixed split on the first space with max 2 output:
require(stringi)
stri_split_fixed(product.pd[,1], ' ', n=2)
or:
apply(product.df, 2, function(...) { stri_split_fixed(..., ' ', n=2) } )
[1] "Company1" "123M UG"
[1] "Company1" "234M-I"
[1] "Company2" "763-87-U"
[1] "Company2" "777-87"
[1] "Company3" "Name1 87M"
[1] "Company3" "Name1 O77M"
[1] "Company3" "Name1 765-U MP"

Try this
new.product.df <- data.frame(company=
unlist(lapply(strsplit(as.character(product.df$Product), split=" .[0-9]"), function(x) x[1])),
name =
unlist(lapply(strsplit(as.character(product.df$Product), split="[1|2] "), function(x) x[2]))
)

according to your data the separator between company and product is the first space character , so the first step we need to convert this first space character to something else , in this example to __ , later I'll tell you why we need to do this .
this is your actual data
Product
1 Company1 123M UG
2 Company1 234M-I
3 Company2 763-87-U
4 Company2 777-87
5 Company3 Name1 87M
6 Company3 Name1 O77M
7 Company3 Name1 765-U MP
this code to do this kind of conversion
product.df$Product <- sub(product.df$Product , pattern = " " , replacement = "__" ,
perl = T)
the data should be something like this
Product
1 Company1__123M UG
2 Company1__234M-I
3 Company2__763-87-U
4 Company2__777-87
5 Company3__Name1 87M
6 Company3__Name1 O77M
7 Company3__Name1 765-U MP
then use the tidyr library to separate this new data frame
library("tidyr")
new.product.df <- separate( product.df , Product , c("Company" , "Model") , sep = "__")
the reason behind converting space character to __ is that company name also may include space character like companies 123M UG & Name1 87M this will cause error later so the first step in this solution is to avoid this later when separating the column.
of course it will be better if we separated on the first occurrence of space character , but I don't know how because the global modifier is turned on by default for separator regex , so any suggestions are welcome

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Remove everything except period and numbers from string regex in R

I know there are many questions on stack overflow regarding regex but I cannot accomplish this one easy task with the available help I've seen. Here's my data:
a<-c("Los Angeles, CA","New York, NY", "San Jose, CA")
b<-c("c(34.0522, 118.2437)","c(40.7128, 74.0059)","c(37.3382, 121.8863)")
df<-data.frame(a,b)
df
a b
1 Los Angeles, CA c(34.0522, 118.2437)
2 New York, NY c(40.7128, 74.0059)
3 San Jose, CA c(37.3382, 121.8863)
I would like to remove the everything but the numbers and the period (i.e. remove "c", ")" and "(". This is what I've tried thus far:
str_replace(df$b,"[^0-9.]","" )
[1] "(34.0522, 118.2437)" "(40.7128, 74.0059)" "(37.3382, 121.8863)"
str_replace(df$b,"[^\\d\\)]+","" )
[1] "34.0522, 118.2437)" "40.7128, 74.0059)" "37.3382, 121.8863)"
Not sure what's left to try. I would like to end up with the following:
[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Thanks.
If I understand you correctly, this is what you want:
df$b <- gsub("[^[:digit:]., ]", "", df$b)
or:
df$b <- strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")
> df
a b
1 Los Angeles, CA 34.0522, 118.2437
2 New York, NY 40.7128, 74.0059
3 San Jose, CA 37.3382, 121.8863
or if you want all the "numbers" as a numeric vector:
as.numeric(unlist(strsplit(gsub("[^[:digit:]. ]", "", df$b), " +")))
[1] 34.0522 118.2437 40.7128 74.0059 37.3382 121.8863
Try this
gsub("[\\c|\\(|\\)]", "",df$b)
#[1] "34.0522, 118.2437" "40.7128, 74.0059" "37.3382, 121.8863"
Not a regular expression solution, but a simple one.
The elements of b are R expressions, so loop over each element, parsing it, then creating the string you want.
vapply(
b,
function(bi)
{
toString(eval(parse(text = bi)))
},
character(1)
)
Here is another option with str_extract_all from stringr. Extract the numeric part using str_extract_all into a list, convert to numeric, rbind the list elements and cbind it with the first column of 'df'
library(stringr)
cbind(df[1], do.call(rbind,
lapply(str_extract_all(df$b, "[0-9.]+"), as.numeric)))

From Matlab to R: Capture named fields with regular expressions to a dataframe

I want to capture name fields from a list of strings by using a regular expression. In Matlab I did it this way:
strings = {'sn555 ID_O20-5-684_N52_2_Subt2_01.',...
'sn555 ID_O20-5-984_S52_8_Subt10_11.'};
pattern = ['sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])'...
'(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\.'];
ParsedData = regexp(strings,pattern,'names');
The result (converted to a dataset) is:
ParsedData =
serial_number ID Class Sector Point
'555' 'O20-5-684' 'N' '52' '2'
'555' 'O20-5-984' 'S' '52' '8'
Now I want to parse these strings in R and convert the result to a dataframe.
I tried this:
strings <- c("sn555 ID_O20-5-684_N52_2_Subt2_01.",
"sn555 ID_O20-5-984_S52_8_Subt10_11.")
pattern <- paste0('sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])',
'(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\\.');
ParsedData <- gregexpr(pattern,strings, perl = TRUE);
ParsedData
Unfortunately, I'm new to regular expressions in R and the output (ParsedData) is confusing to me. What are your suggestions how to convert the strings to a dataset?
In the past I wrote a helper function to extract capture groups from regular expressions called regcapturedmatches.R.
You can use it with your data like this:
rr <- regcapturedmatches(strings,ParsedData)
rr
# [[1]]
# serial_number X ID Class Sector Point X.1
# [1,] "555" "_" "O20-5-684" "N" "52" "2" "2"
#
# [[2]]
# serial_number X ID Class Sector Point X.1
# [1,] "555" "_" "O20-5-984" "S" "52" "8" "8"
You get a list back with an array with column names. You could turn that into a data.frame with:
do.call(rbind.data.frame, rr)
# serial_number X ID Class Sector Point X.1
# 1 555 _ O20-5-684 N 52 2 2
# 2 555 _ O20-5-984 S 52 8 8

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Extract a number from a string of numbers and text

I have a data.frame in R with a column containing character string of the form {some letters}-{a number}{a letter}, e.g. x <- 'KFKGDLDSKFDSKJJFDI-4567W'. So I want for instance to get a column with the numbers eg '4567' for that particular example/row. Theres only one number but it can be of any reasonable length. How can I extract the number from each row in the data.frame?
Use regular expressions to extract substrings. Use as.numeric to convert the resulting character string to a number:
string = 'KFKGDLDSKFDSKJJFDI-4567W'
as.numeric(regmatches(string, regexpr('\\d+', string)))
# 4567
You can easily use this to create a new column in your data frame:
#data = data.frame(x = rep(string, 10))
transform(data, y = as.numeric(regmatches(x, regexpr('\\d+', x))))
# x y
# 1 KFKGDLDSKFDSKJJFDI-4567W 4567
# 2 KFKGDLDSKFDSKJJFDI-4567W 4567
# 3 KFKGDLDSKFDSKJJFDI-4567W 4567
# 4 KFKGDLDSKFDSKJJFDI-4567W 4567
…
Try this one:
gsub("[a-zA-Z]+-([0-9]+)[a-zA-Z]","\\1", "KFKGDLDSKFDSKJJFDI-4567W")