Extract a number from a string of numbers and text - regex

I have a data.frame in R with a column containing character string of the form {some letters}-{a number}{a letter}, e.g. x <- 'KFKGDLDSKFDSKJJFDI-4567W'. So I want for instance to get a column with the numbers eg '4567' for that particular example/row. Theres only one number but it can be of any reasonable length. How can I extract the number from each row in the data.frame?

Use regular expressions to extract substrings. Use as.numeric to convert the resulting character string to a number:
string = 'KFKGDLDSKFDSKJJFDI-4567W'
as.numeric(regmatches(string, regexpr('\\d+', string)))
# 4567
You can easily use this to create a new column in your data frame:
#data = data.frame(x = rep(string, 10))
transform(data, y = as.numeric(regmatches(x, regexpr('\\d+', x))))
# x y
# 1 KFKGDLDSKFDSKJJFDI-4567W 4567
# 2 KFKGDLDSKFDSKJJFDI-4567W 4567
# 3 KFKGDLDSKFDSKJJFDI-4567W 4567
# 4 KFKGDLDSKFDSKJJFDI-4567W 4567
…

Try this one:
gsub("[a-zA-Z]+-([0-9]+)[a-zA-Z]","\\1", "KFKGDLDSKFDSKJJFDI-4567W")

Related

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

From Matlab to R: Capture named fields with regular expressions to a dataframe

I want to capture name fields from a list of strings by using a regular expression. In Matlab I did it this way:
strings = {'sn555 ID_O20-5-684_N52_2_Subt2_01.',...
'sn555 ID_O20-5-984_S52_8_Subt10_11.'};
pattern = ['sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])'...
'(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\.'];
ParsedData = regexp(strings,pattern,'names');
The result (converted to a dataset) is:
ParsedData =
serial_number ID Class Sector Point
'555' 'O20-5-684' 'N' '52' '2'
'555' 'O20-5-984' 'S' '52' '8'
Now I want to parse these strings in R and convert the result to a dataframe.
I tried this:
strings <- c("sn555 ID_O20-5-684_N52_2_Subt2_01.",
"sn555 ID_O20-5-984_S52_8_Subt10_11.")
pattern <- paste0('sn(?<serial_number>.*) ID(_)(?<ID>.*)_(?<Class>[NS])',
'(?<Sector>.*)_(?<Point>(.*))_[Ss]ubt.*\\.');
ParsedData <- gregexpr(pattern,strings, perl = TRUE);
ParsedData
Unfortunately, I'm new to regular expressions in R and the output (ParsedData) is confusing to me. What are your suggestions how to convert the strings to a dataset?
In the past I wrote a helper function to extract capture groups from regular expressions called regcapturedmatches.R.
You can use it with your data like this:
rr <- regcapturedmatches(strings,ParsedData)
rr
# [[1]]
# serial_number X ID Class Sector Point X.1
# [1,] "555" "_" "O20-5-684" "N" "52" "2" "2"
#
# [[2]]
# serial_number X ID Class Sector Point X.1
# [1,] "555" "_" "O20-5-984" "S" "52" "8" "8"
You get a list back with an array with column names. You could turn that into a data.frame with:
do.call(rbind.data.frame, rr)
# serial_number X ID Class Sector Point X.1
# 1 555 _ O20-5-684 N 52 2 2
# 2 555 _ O20-5-984 S 52 8 8

Difficulty Cleaning a Data Frame Using Regexp in R

I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!
Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.
library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)

how to replace nth character of a string in a column in r

My input is
a<-c("aa_bbb_cc_ddd","ee_fff_gg_hhh")
b<-c("a","b")
df<-data.frame(cbind(a,b))
I want my output to be
a<-c("aa_bbb-cc_ddd","ee_fff-gg_hhh")
b<-c("a","b")
df<-data.frame(cbind(a,b))
please help
If things are as consistent as you show and you want to replace the 7th character then substring may be a good way to go, but you made the column character by wrapping with data.frame without stringsAsFactors = FALSE. You'd need to make the column character first:
df$a <- as.character(df$a)
substring(df$a, 7, 7) <- "-"
df
## a b
## 1 aa_bbb-cc_ddd a
## 2 ee_fff-gg_hhh b
You may use sub,
sub("^([^_]*_[^_]*)_", "\\1-",df$a)
Example:
> a<-c("aa_bbb_cc_ddd","ee_fff_gg_hhh")
> b<-c("a","b")
> df<-data.frame(cbind(a,b))
> df
a b
1 aa_bbb_cc_ddd a
2 ee_fff_gg_hhh b
> df$a <- sub("^([^_]*_[^_]*)_", "\\1-",df$a)
> df
a b
1 aa_bbb-cc_ddd a
2 ee_fff-gg_hhh b
Here's a general way to replace the nth occurrence of _ with -.
n <- 2
# create regex pattern based on n
pat <- paste0("^((?:.*?_){", n - 1, "}.*?)_")
# [1] "^((?:.*?_){1}.*?)_"
# replace character
sub("^((?:.*?_){1}.*?)_", "\\1-", df$a, perl = TRUE)
# [1] "aa_bbb-cc_ddd" "ee_fff-gg_hhh"

Use string comparisons to split a column in R

To the best of my search this question hasn't been asked before.
I have a dataframe column called Product. This column has the company name as well as product model in just one column.
product.df <- data.frame("Product" = c("Company1 123M UG", "Company1 234M-I", "Company2 763-87-U","Company2 777-87", "Company3 Name1 87M", "Company3 Name1 O77M", "Company3 Name1 765-U MP"))
I want to split out the company names and product model number from this single column into two columns. I need a function that can find similar words between rows and classify them as Company names and the rest of the letters as product model number. No two rows as far as i can tell have same model numbers. So in the case above. I would get this answer
new.product.df <- data.frame("CompanyName" = c("Company1", "Company1", "Company2","Company2", "Company3 Name1", "Company3 Name1", "Company3 Name1"), "Model" = c("123M UG", "234M-I", "763-87-U", "777-87", "87M", "O77M", "765-U MP"))
I need a function that can compare two strings and return me similar continuous letters and dissimilar letters.
If you're guaranteed the first word is always a company name, then simply do a fixed split on the first space with max 2 output:
require(stringi)
stri_split_fixed(product.pd[,1], ' ', n=2)
or:
apply(product.df, 2, function(...) { stri_split_fixed(..., ' ', n=2) } )
[1] "Company1" "123M UG"
[1] "Company1" "234M-I"
[1] "Company2" "763-87-U"
[1] "Company2" "777-87"
[1] "Company3" "Name1 87M"
[1] "Company3" "Name1 O77M"
[1] "Company3" "Name1 765-U MP"
Try this
new.product.df <- data.frame(company=
unlist(lapply(strsplit(as.character(product.df$Product), split=" .[0-9]"), function(x) x[1])),
name =
unlist(lapply(strsplit(as.character(product.df$Product), split="[1|2] "), function(x) x[2]))
)
according to your data the separator between company and product is the first space character , so the first step we need to convert this first space character to something else , in this example to __ , later I'll tell you why we need to do this .
this is your actual data
Product
1 Company1 123M UG
2 Company1 234M-I
3 Company2 763-87-U
4 Company2 777-87
5 Company3 Name1 87M
6 Company3 Name1 O77M
7 Company3 Name1 765-U MP
this code to do this kind of conversion
product.df$Product <- sub(product.df$Product , pattern = " " , replacement = "__" ,
perl = T)
the data should be something like this
Product
1 Company1__123M UG
2 Company1__234M-I
3 Company2__763-87-U
4 Company2__777-87
5 Company3__Name1 87M
6 Company3__Name1 O77M
7 Company3__Name1 765-U MP
then use the tidyr library to separate this new data frame
library("tidyr")
new.product.df <- separate( product.df , Product , c("Company" , "Model") , sep = "__")
the reason behind converting space character to __ is that company name also may include space character like companies 123M UG & Name1 87M this will cause error later so the first step in this solution is to avoid this later when separating the column.
of course it will be better if we separated on the first occurrence of space character , but I don't know how because the global modifier is turned on by default for separator regex , so any suggestions are welcome