read csv setting fields with spaces to NA - regex

I have a csv file that looks like this:
A, B, C,
1, 2 1, 3,
3, 1, 0,
4, 1, 0 5,
...
is it possible to set the na.string to assign all fields with space to NA (e.g. something like regex function(x){x[grep(patt="\\ ", x)]<-NA;x}), i.e.
A, B, C,
1, NA, 3,
3, 1, 0,
4, 1, NA,

We can loop over the columns and set it to NA by converting to numeric
df1[] <- lapply(df1, as.numeric)
NOTE: Here, I assumed that the columns are character class. If it is factor, do lapply(df1, function(x) as.numeric(as.character(x)))

Variation on #akrun's answer (which I like).
library(dplyr)
read.csv("test.csv", colClasses="character") %>% mutate_each(funs(as.numeric))
This reads the file assuming all columns are character, then converts all to numeric with mutate_each from dplyr.
Using colClasses="numeric" already in the read call didn't work (and I don't know why :( ), since
> as.numeric("2 1")
[1] NA
From How to read data when some numbers contain commas as thousand separator? we learn that we can make a new function to do the conversion.
setAs("character", "numwithspace", function(from) as.numeric(from) )
read.csv("test.csv", colClasses="numwithspace")
which gives
A B C
1 1 NA 3
2 3 1 0
3 4 1 NA

I don't know how this would translate in r, but I would use the following regex to match fields containing spaces :
[^, ]+ [^, ]+
Which is :
some characters other than a comma or a space ([^, ]+)
followed by a space ()
and some more characters other than a comma or a space ([^, ]+)
You can see it in action here.

Related

Regexp in R to match everything in between first and last occurene of some specified character

I'd like to match everything between the first and last underscore. I use R.
What I have until now is this:
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')
sub('[^_]*_(.*)_[^_]*', x = p.subject, replacement = '\\1', perl = T)
Where 'bla' is any character except an underscore...
The result I'd like would be something like this:
c(NA, NA, bla, bla_bla)
I can't figure it out! Why does the first pattern match? It shouldn't because the pattern must have 2 underscores! Or do I have to use some kind of lookahead expression?
Your help is very welcome!
You can use gsub:
vec <- gsub("(^[^_]+)_?|_?([^_]+$)", "", p.subject)
vec <- ifelse(nchar(vec) == 0 , NA, vec)
vec
[1] NA NA "bla" "bla_bla"
Data:
dput(p.subject)
c("bla_bla", "bla", "bla_bla_bla", "bla_bla_bla_bla")
Here is another option using str_extract. We use regex lookarounds to extract the pattern between the first and the last occurrence of a specified character i.e. _.
library(stringr)
str_extract(p.subject, "(?<=[^_]{1,30}_).*(?=_[^_]+)")
#[1] NA NA "bla" "bla_bla"
NOTE: We didn't use any ifelse.
data
p.subject <- c('bla_bla', 'bla', 'bla_bla_bla', 'bla_bla_bla_bla')

Difficulty Cleaning a Data Frame Using Regexp in R

I am relatively new to R and having difficulty cleaning up a data frame using regex.
One of the columns of that data frame has strings such as:
NUMERO_APPEL
1 NNA
2 VQ-40989
3 41993
4 41993
5 42597
6 VQ-42597
7 DER8
8 40001-2010
I would like to extract the 5 consecutive digits of the strings that have the following format and only the following format, all other strings will be replaced by NAs.
AO-11111
VQ-11111
11111
** Even if Case 8 contains 5 consecutive numbers, it will be replaced by NA as well... Furthermore, a more than or less than 5 digits long number would also be replaced by NA.
Note that the 5 consecutive digits could be any number [0-9], but the characters 'AO-' and 'VQ-' are fixed (i.e. 'AO ' or 'VE-' would be replaced to NA as well.)
This is the code that I currently have:
# Declare a Function that Extracts the 1st 'n' Characters Starting from the Right!
RightSubstring <- function(String, n) {
substr(String, nchar(String)-n+1, nchar(String))
}
# Declare Function to Remove NAs in Specific Columns!
ColRemNAs <- function(DataFrame, Column) {
CompleteVector <- complete.cases(DataFrame, Column)
return(DataFrame[CompleteVector, ])
Contrat$NUMERO_APPEL <- RightSubstring(as.character(Contrat$NUMERO_APPEL), 5)
Contrat$NUMERO_APPEL <- gsub("[^0-9]", NA, Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL <- as.numeric(Contrat$NUMERO_APPEL)
# Efface les Lignes avec des éléments NAs.
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_COMMANDE)
Contrat <- ColRemNAs(Contrat, Contrat$NO_FOURNISSEUR)
Contrat <- ColRemNAs(Contrat, Contrat$NUMERO_APPEL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_INITIAL)
Contrat <- ColRemNAs(Contrat, Contrat$MONTANT_ACTUEL)
}
Thanks in advance. Hope my explanations were clear!
Here is a base R solution which will match 5 digits occurring only in the following three forms:
AO-11111
VQ-11111
11111
I use this regular expression to match the five digits:
^((AQ|VQ)-)?(\\d{5})$
Strings which match begin with an optional AQ- or VQ-, and then are followed by 5 consecutive digits, after which the string must terminate.
The following code substitutes all matching patterns with the 5 digits found, and stores NA into all non-matching patterns.
ind <- grep("^((AQ|VQ)-)?(\\d{5})$", Contrat$NUMERO_APPEL, value = FALSE)
Contrat$NUMERO_APPEL <- gsub("^(((AQ|VQ)-)?(\\d{5}))$", "\\4", Contrat$NUMERO_APPEL)
Contrat$NUMERO_APPEL[-ind] <- NA
For more reading see this SO post.
library(dplyr)
library(stringi)
df %>%
mutate(NUMERO_APPEL.fix =
NUMERO_APPEL %>%
stri_extract_first_regex("[0-9]{5}") %>%
as.numeric)

how to replace nth character of a string in a column in r

My input is
a<-c("aa_bbb_cc_ddd","ee_fff_gg_hhh")
b<-c("a","b")
df<-data.frame(cbind(a,b))
I want my output to be
a<-c("aa_bbb-cc_ddd","ee_fff-gg_hhh")
b<-c("a","b")
df<-data.frame(cbind(a,b))
please help
If things are as consistent as you show and you want to replace the 7th character then substring may be a good way to go, but you made the column character by wrapping with data.frame without stringsAsFactors = FALSE. You'd need to make the column character first:
df$a <- as.character(df$a)
substring(df$a, 7, 7) <- "-"
df
## a b
## 1 aa_bbb-cc_ddd a
## 2 ee_fff-gg_hhh b
You may use sub,
sub("^([^_]*_[^_]*)_", "\\1-",df$a)
Example:
> a<-c("aa_bbb_cc_ddd","ee_fff_gg_hhh")
> b<-c("a","b")
> df<-data.frame(cbind(a,b))
> df
a b
1 aa_bbb_cc_ddd a
2 ee_fff_gg_hhh b
> df$a <- sub("^([^_]*_[^_]*)_", "\\1-",df$a)
> df
a b
1 aa_bbb-cc_ddd a
2 ee_fff-gg_hhh b
Here's a general way to replace the nth occurrence of _ with -.
n <- 2
# create regex pattern based on n
pat <- paste0("^((?:.*?_){", n - 1, "}.*?)_")
# [1] "^((?:.*?_){1}.*?)_"
# replace character
sub("^((?:.*?_){1}.*?)_", "\\1-", df$a, perl = TRUE)
# [1] "aa_bbb-cc_ddd" "ee_fff-gg_hhh"

Regex/ Substring

I have a sequence like this in a list "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
I would like to create a substring like wherever a "K" is present it needs to pull out 6 characters before and 6 characters after "K"
Ex : MSGSRRKATPASR , here -6..K..+6
for the whole sequence..I tried the substring function in R but we need to specify the start and end position. Here the positions are unknown
Thanks
.{6}K.{6}
Try this.This will give the desired result.
See demo.
http://regex101.com/r/dM0rS8/4
use this:
\w{7}(?<=K)\w{6}
this uses positive lookbehind to ensure that there are characters present before K.
demo here: http://regex101.com/r/pK3jK1/2
Using rex may make this type of task a little simpler.
x <- "MSGSRRKATPASRTRVGNYEMGRTLGEGSFAKVKYAKNTVTGDQAAIKILDREKVFRHKMVEQLKREISTMKLIKHPNVVEIIEVMASKTKIYIVLELVNGGELFDKIAQQGRLKEDEARRYFQQLINAVDYCHSRGVYHRDLKPENLILDANGVLKVSDFGLSAFSRQVREDGLLHTACGTPNYVAPEVLSDKGYDGAAADVWSCGVILFVLMAGYLPFDEPNLMTLYKRICKAEFSCPPWFSQGAKRVIKRILEPNPITRISIAELLEDEWFKKGYKPPSFDQDDEDITIDDVDAAFSNSKECLVTEKKEKPVSMNAFELISSSSEFSLENLFEKQAQLVKKETRFTSQRSASEIMSKMEETAKPLGFNVRKDNYKIKMKGDKSGRKGQLSVATEVFEVAPSLHVVELRKTGGDTLEFHKVCDSFYKNFSSGLKDVVWNTDAAAEEQKQ"
library(rex)
re_matches(x,
rex(
capture(name = "amino_acids",
n(any, 6),
"K",
n(any, 6)
)
),
global = TRUE)[[1]]
#> amino_acids
#>1 MSGSRRKATPASR
#>2 GEGSFAKVKYAKN
#>3 GDQAAIKILDREK
#>4 KMVEQLKREISTM
#>5 IEVMASKTKIYIV
#>6 GGELFDKIAQQGR
#>7 VYHRDLKPENLIL
#>8 DANGVLKVSDFGL
#>9 PEVLSDKGYDGAA
#>10 NLMTLYKRICKAE
#>11 WFSQGAKRVIKRI
#>12 LEDEWFKKGYKPP
#>13 AAFSNSKECLVTE
#>14 LENLFEKQAQLVK
#>15 ASEIMSKMEETAK
#>16 LGFNVRKDNYKIK
#>17 GDKSGRKGQLSVA
#>18 HVVELRKTGGDTL
#>19 VCDSFYKNFSSGL
However the above is greedy, each K will only appear in one result.
If you want to output an AA for each K
library(rex)
locs <- re_matches(x,
rex(
"K" %if_prev_is% n(any, 6) %if_next_is% n(any, 6)
),
global = TRUE, locations = TRUE)[[1]]
substring(x, locs$start - 6, locs$end + 6)
#> [1] "MSGSRRKATPASR" "GEGSFAKVKYAKN" "GSFAKVKYAKNTV" "AKVKYAKNTVTGD"
#> [5] "GDQAAIKILDREK" "KILDREKVFRHKM" "EKVFRHKMVEQLK" "KMVEQLKREISTM"
#> [9] "REISTMKLIKHPN" "STMKLIKHPNVVE" "IEVMASKTKIYIV" "VMASKTKIYIVLE"
#>[13] "GGELFDKIAQQGR" "AQQGRLKEDEARR" "VYHRDLKPENLIL" "DANGVLKVSDFGL"
#>[17] "PEVLSDKGYDGAA" "NLMTLYKRICKAE" "LYKRICKAEFSCP" "WFSQGAKRVIKRI"
#>[21] "GAKRVIKRILEPN" "LEDEWFKKGYKPP" "EDEWFKKGYKPPS" "WFKKGYKPPSFDQ"
#>[25] "AAFSNSKECLVTE" "ECLVTEKKEKPVS" "CLVTEKKEKPVSM" "VTEKKEKPVSMNA"
#>[29] "LENLFEKQAQLVK" "KQAQLVKKETRFT" "QAQLVKKETRFTS" "ASEIMSKMEETAK"
#>[33] "KMEETAKPLGFNV" "LGFNVRKDNYKIK" "VRKDNYKIKMKGD" "KDNYKIKMKGDKS"
#>[37] "NYKIKMKGDKSGR" "IKMKGDKSGRKGQ" "GDKSGRKGQLSVA" "HVVELRKTGGDTL"
#>[41] "DTLEFHKVCDSFY" "VCDSFYKNFSSGL" "NFSSGLKDVVWNT"

regex for position matching with OR condition

Newbie to regex and looking for help in creating regexp to seek out following:
The data items consists of six character strings as shown in example below
1) "100100"
2) "110011"
3) "010000"
4) "110011"
5) "111111"
6) "000111"
Need to use regexp to find data with say
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1 in 2nd position: Items 2,4 ad 5 should be matched
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched
Given your samples, these will work:
1 in the 1st position OR 1 in the 4th position: Items 1, 2, 4, 5 and 6 should be matched
1.....|...1...
1 in 2nd position: Items 2,4 ad 5 should be matched
.1....
1 in 5th and 6th position: Items 2, 4, 5 and 6 should be matched
....11
Or if you want to match any of these rules, combine them with the | (or) operator.
Example:
http://regexpal.com/?flags=g&regex=(1.....%7C...1...%7C.1....%7C....11)&input=100100%0A%0A110011%0A%0A010000%0A%0A110011%0A%0A111111%0A%0A000111
If it is always strings with only 1s and 0s, you should treat them as binary numbers and use logical operators to find the matches.
Try this regex
([1][0-1]{2}[1][0-1]{2})|([0-1][1][0-1]{4})|([0-1]{4}[1]{2})
Find the explanation and demo here http://www.regex101.com/r/vD9jE7
Here's an example. Change dots with zeros if necessary. /^(11..|.1.1)11$/
^ # beginning of string
( # either
11.. # 11 and any 2 char
| # or
.1.1 # any char, 1, any char, 1
)
11
$ # end of string