perform gsub in a data frame with 2 columns - regex

I have dataset with 2 columns, I would like to clean up my dataset by using gsub such as
Data_edited_txt2 <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("#\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("[[:punct:]]", "", Data_edited_txt2$text)
I would get an error :" $ operator is invalid for atomic vectors" at the second run of gsub and I noticed the 2nd column will disappear after running the first gsub.
Please advise how to perform all the gsub, but keeping the 2nd column?
structure(list(text = structure(c(1L, 3L, 7L, 4L, 2L, 5L, 6L), .Label = c("#airasia im searching job",
"#AirAsia no flight warning for cebu outbound?", "#shazzr1 #AirAsia never mind.. now everyone can fly.",
"#TigerAir confirmed as having far nastier policies and uncaring customer service than #airasia who I will now fly every time in preference.",
"#Wingmates Since your taxes is HIGHER than other airlines but your service is really BAD because always change and cancel the flight.",
"hai MASwings #Wingmates . Bilakah tempoh promosi anda? Saya ingin terbang ke Palawan dengan bajet yang agak rendah :3",
"One thing I \"like\" about #AirAsia is, DELAY."), class = "factor"),
created = structure(c(3L, 2L, 1L, 7L, 6L, 4L, 5L), .Label = c("2/2/2014 11:30",
"2/2/2014 11:32", "2/2/2014 12:18", "24/2/2014 4:03", "29/3/2014 8:21",
"30/1/2014 16:02", "31/1/2014 8:13"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -7L))

You override the whole data frame instead of only one column. Try this:
Data_edited_txt2$text <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("#\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("[[:punct:]]", "", Data_edited_txt2$text)

Related

Combined Formattable with KableExtra

I am trying to create a table that combines features from Formattable with KableExtra. I have found a number of examples which have helped but doesn't quite do everything I'm trying to achieve.
This is what I've tried so far:
library(KableExtra)
library(Formattable)
df <- structure(list(Income_source = c("A", "B", "C", "C"), Jul = c(1777.01,
0.13, 9587.39, 11364.53), Aug = c(0, 0.09, 9908.78, 9908.87),
Sep = c(5374.6, 0.03, 9859.87, 15234.5)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Example of the Formattable function I'd like to apply. Note the color_tile is applied specifically to each row
formattable(df, lapply(1:nrow(df), function(row) {
area(row, col = 1:nrow(df)) ~ color_tile("transparent", "pink")
}))
The example I found which lets me combine Formattable with KableExtra looks like this:
df %>%
mutate(Jul = formattable::color_tile("transparent", "pink")(Jul),
Aug = formattable::color_tile("transparent", "pink")(Aug),
Sep = formattable::color_tile("transparent", "pink")(Sep)) %>%
select(Name,everything()) %>%
kable("html", escape = F,format.args = list(big.mark = ",",scientific = FALSE)) %>%
kable_classic(full_width = T, html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T)
The problems with this solution is:
1: The color_tile function is applied to columns rather than rows
2: The numeric values drop the commas
The table I'm planning on generating would be updated monthly so that next month the data for October would be presented, followed by November and so forth. As such I'm hoping for a solution that doesn't require me to edit the script each time i.e. mutate the new data. Hopefully that makes sense.
List item

I want a line graph with mean and SD like

I want graph as like as the picture but don't understand how to do this. Actually a have different data for same date how do i made its a line.
library(tidyverse)
library(data.table)
fish <- read_csv(file = "FishF.csv", col_types = cols(DD= col_factor()))
fish1<-fish[is.na(fish)] <- 0
fish
View(fish)
fd<-fish[,c(1,2,3)]
1
newfish <- melt(setDT(fish), id.vars = "DD",
measure.vars = patterns("avg","SE"),
value.name = c("avg","SE"))[ , variable := lvls_revalue(variable, c("C3", "IgM", "IgT", "KHV", "Lyso"))][]
n<- melt(setDT(fd), id.vars = "DD",
measure.vars = patterns("avg","SE"),
value.name = c("avg","SE"))[ , variable := lvls_revalue(variable, c("C3"))][]

R function for pattern matching

I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.
Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.
Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.
I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.
EDIT:
Here is the first 20 lines of the Sanders data frame.
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
And I've written the following function:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.
Okay, here we go. Using this data (shared nicely with dput()):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
I think the following is the code you wrote, but without throwing errors:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
not sure what you mean by this, though:
Also, and more importantly, this function won't exactly get me what I
want, which is to extract the word pairs that match a pattern and then
the pattern that they match. I honestly have no idea how to do that.

Adding two decimal places

I have a column in a dataset as shown below
Col1
----------
249
250.8
251.3
250.33
648
1249Y4
X569X3
4459120
2502420
What I am trying to do is add two decimal places only to number that have only three digits , in other words, numbers that are in hundreds. For example, 249 should be converted to 249.00, 251.3 should be converted to 251.30 so on and not 4459120 or 2502420 or X569X3. The final output should look like this.
Col1
----------
249.00
250.80
251.30
250.33
648.00
1249Y4
X569X3
4459120
2502420
I have looked at many different functions so far none of those work because there are some strings in between the numbers, for example X569X3 and seven digit numbers 2502420
Actual dataset
structure(c(5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 57L, 58L, 59L, 84L, 86L, 87L, 88L, 99L, 100L, 101L,
102L, 103L, 104L, 105L, 106L, 107L, 108L, 110L, 5L, 12L, 14L,
16L, 20L, 24L, 36L, 40L, 44L, 48L, 52L, 56L, 83L, 85L, 75L, 112L,
66L, 68L, 96L, 93L, 77L, 80L, 81L, 70L, 95L, 78L, 109L, 94L,
63L, 67L, 98L, 73L, 79L, 76L, 90L, 111L, 69L, 97L, 64L, 92L,
89L, 82L, 62L, 74L, 60L, 65L, 71L, 91L, 61L, 72L, 4L, 1L, 2L,
3L, 113L), .Label = c("1234X1", "123871", "1249Y4", "146724",
"249", "249.01", "249.1", "249.11", "249.2", "249.21", "249.3",
"249.4", "249.41", "249.5", "249.51", "249.6", "249.61", "249.7",
"249.71", "249.8", "249.81", "249.9", "249.91", "250", "250.01",
"250.02", "250.03", "250.1", "250.11", "250.12", "250.13", "250.22",
"250.23", "250.32", "250.33", "250.4", "250.41", "250.42", "250.43",
"250.5", "250.51", "250.52", "250.53", "250.6", "250.61", "250.62",
"250.63", "250.7", "250.71", "250.72", "250.73", "250.8", "250.81",
"250.82", "250.83", "250.9", "250.91", "250.92", "250.93", "2502110",
"2502111", "2502112", "2502113", "2502114", "2502115", "2502210",
"2502310", "2502410", "2502420", "2502510", "2502610", "2502611",
"2502612", "2502613", "2502614", "2502615", "2506110", "2506120",
"2506130", "2506140", "2506150", "2506160", "251.3", "251.8",
"253.5", "258.1", "275.01", "277.39", "3640140", "3670110", "3670150",
"3748210", "3774410", "3774420", "4459120", "5379670", "5379671",
"6221340", "648", "648.01", "648.02", "648.03", "648.04", "648.8",
"648.81", "648.82", "648.83", "648.84", "7079180", "775.1", "7821120",
"7862120", "X569X3"), class = "factor")
Let's call your vector x:
numbers = !is.na(as.numeric(x))
x.num = x[numbers]
x[numbers] = ifelse(as.numeric(x.num) < 1000,
sprintf("%.2f", as.numeric(x.num)),
x.num)
x
# [1] "249.00" "250.80" "251.30" "250.33" "648.00"
# [6] "1249Y4" "X569X3" "4459120" "2502420"
Use formatC with a selection of only the values you wish to replace.
x <- c("249", "250.8", "251.3", "250.33", "648", "1249Y4", "X569X3", "4459120", "2502420")
sel <- which(as.numeric(x) < 1000)
replace(x, sel, formatC(as.numeric(x[sel]), digits=2, format="f"))
#[1] "249.00" "250.80" "251.30" "250.33" "648.00" "1249Y4" "X569X3"
#[8] "4459120" "2502420"
First, change your dataset to character:
x <- as.character(x)
Then perform the following:
ifelse(grepl("[[:alpha:]]", x) == FALSE & as.numeric(x) < 1000,
sprintf("%.2f", as.numeric(x)), x)
Or if your data is in Col1 in a dataframe:
df %>%
mutate(Col1 = ifelse(grepl("[[:alpha:]]", Col1) == FALSE & as.numeric(as.character(Col1)) < 1000,
sprintf("%.2f", as.numeric(as.character(Col1))), as.character(Col1)))

How to separate the variables of a particular column in a CSV file and write to a CSV file in R?

I have a CSV file like
Market,CampaignName,Identity
Wells Fargo,Gary IN MetroChicago IL Metro,56
EMC,Los Angeles CA MetroBoston MA Metro,78
Apple,Cupertino CA Metro,68
Desired Output to a CSV file with the first row as the headers
Market,City,State,Identity
Wells Fargo,Gary,IN,56
Wells Fargo,Chicago,IL,56
EMC,Los Angeles,CA,78
EMC,Boston,MA,78
Apple,Cupertino,CA,68
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
How to modify the above regular expressions to get the result in R?
New to R, any help is appreciated.
library(stringr)
xx.to.split <- with(xx, setNames(gsub("Metro", "", as.character(CampaignName)), Market))
do.call(rbind, str_match_all(xx.to.split, "(.+?) ([A-Z]{2}) ?"))[, -1]
Produces:
[,1] [,2]
Wells Fargo "Gary" "IN"
Wells Fargo "Chicago" "IL"
EMC "Los Angeles" "CA"
EMC "Boston" "MA"
Apple "Cupertino" "CA"
This should work even if you have different number of Compaign Names in each market. Unfortunately I think base options are annoying to implement because frustratingly there isn't a gregexec, although I'd be curious if someone comes up with something comparably compact in base.
Here is a solution using base R. Split the CampaignName column on the string Metro adding sequential numbers as names. stack turns it into a data frame with columns ind and values which we massage into DF1. Merge that with xx by the sequence numbers of DF1 and the row numbers of xx. Move Market to the front of DF2 and remove ind and CampaignName. Finally write it out.
xx <- read.csv("Campaign.csv", as.is = TRUE)
s <- strsplit(xx$CampaignName, " Metro")
names(s) <- seq_along(s)
ss <- stack(s)
DF1 <- with(ss, data.frame(ind,
City = sub(" ..$", "", values),
State = sub(".* ", "", values)))
DF2 <- merge(DF1, xx, by.x = "ind", by.y = 0)
DF <- DF2[ c("Market", setdiff(names(DF2), c("ind", "Market", "CampaignName"))) ]
write.csv(DF, file = "myfile.csv", row.names = FALSE, quote = FALSE)
REVISED to handle extra columns after poster modified the question to include such. Minor improvements.