Adding two decimal places - regex

I have a column in a dataset as shown below
Col1
----------
249
250.8
251.3
250.33
648
1249Y4
X569X3
4459120
2502420
What I am trying to do is add two decimal places only to number that have only three digits , in other words, numbers that are in hundreds. For example, 249 should be converted to 249.00, 251.3 should be converted to 251.30 so on and not 4459120 or 2502420 or X569X3. The final output should look like this.
Col1
----------
249.00
250.80
251.30
250.33
648.00
1249Y4
X569X3
4459120
2502420
I have looked at many different functions so far none of those work because there are some strings in between the numbers, for example X569X3 and seven digit numbers 2502420
Actual dataset
structure(c(5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L,
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L,
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L,
55L, 56L, 57L, 58L, 59L, 84L, 86L, 87L, 88L, 99L, 100L, 101L,
102L, 103L, 104L, 105L, 106L, 107L, 108L, 110L, 5L, 12L, 14L,
16L, 20L, 24L, 36L, 40L, 44L, 48L, 52L, 56L, 83L, 85L, 75L, 112L,
66L, 68L, 96L, 93L, 77L, 80L, 81L, 70L, 95L, 78L, 109L, 94L,
63L, 67L, 98L, 73L, 79L, 76L, 90L, 111L, 69L, 97L, 64L, 92L,
89L, 82L, 62L, 74L, 60L, 65L, 71L, 91L, 61L, 72L, 4L, 1L, 2L,
3L, 113L), .Label = c("1234X1", "123871", "1249Y4", "146724",
"249", "249.01", "249.1", "249.11", "249.2", "249.21", "249.3",
"249.4", "249.41", "249.5", "249.51", "249.6", "249.61", "249.7",
"249.71", "249.8", "249.81", "249.9", "249.91", "250", "250.01",
"250.02", "250.03", "250.1", "250.11", "250.12", "250.13", "250.22",
"250.23", "250.32", "250.33", "250.4", "250.41", "250.42", "250.43",
"250.5", "250.51", "250.52", "250.53", "250.6", "250.61", "250.62",
"250.63", "250.7", "250.71", "250.72", "250.73", "250.8", "250.81",
"250.82", "250.83", "250.9", "250.91", "250.92", "250.93", "2502110",
"2502111", "2502112", "2502113", "2502114", "2502115", "2502210",
"2502310", "2502410", "2502420", "2502510", "2502610", "2502611",
"2502612", "2502613", "2502614", "2502615", "2506110", "2506120",
"2506130", "2506140", "2506150", "2506160", "251.3", "251.8",
"253.5", "258.1", "275.01", "277.39", "3640140", "3670110", "3670150",
"3748210", "3774410", "3774420", "4459120", "5379670", "5379671",
"6221340", "648", "648.01", "648.02", "648.03", "648.04", "648.8",
"648.81", "648.82", "648.83", "648.84", "7079180", "775.1", "7821120",
"7862120", "X569X3"), class = "factor")

Let's call your vector x:
numbers = !is.na(as.numeric(x))
x.num = x[numbers]
x[numbers] = ifelse(as.numeric(x.num) < 1000,
sprintf("%.2f", as.numeric(x.num)),
x.num)
x
# [1] "249.00" "250.80" "251.30" "250.33" "648.00"
# [6] "1249Y4" "X569X3" "4459120" "2502420"

Use formatC with a selection of only the values you wish to replace.
x <- c("249", "250.8", "251.3", "250.33", "648", "1249Y4", "X569X3", "4459120", "2502420")
sel <- which(as.numeric(x) < 1000)
replace(x, sel, formatC(as.numeric(x[sel]), digits=2, format="f"))
#[1] "249.00" "250.80" "251.30" "250.33" "648.00" "1249Y4" "X569X3"
#[8] "4459120" "2502420"

First, change your dataset to character:
x <- as.character(x)
Then perform the following:
ifelse(grepl("[[:alpha:]]", x) == FALSE & as.numeric(x) < 1000,
sprintf("%.2f", as.numeric(x)), x)
Or if your data is in Col1 in a dataframe:
df %>%
mutate(Col1 = ifelse(grepl("[[:alpha:]]", Col1) == FALSE & as.numeric(as.character(Col1)) < 1000,
sprintf("%.2f", as.numeric(as.character(Col1))), as.character(Col1)))

Related

Combined Formattable with KableExtra

I am trying to create a table that combines features from Formattable with KableExtra. I have found a number of examples which have helped but doesn't quite do everything I'm trying to achieve.
This is what I've tried so far:
library(KableExtra)
library(Formattable)
df <- structure(list(Income_source = c("A", "B", "C", "C"), Jul = c(1777.01,
0.13, 9587.39, 11364.53), Aug = c(0, 0.09, 9908.78, 9908.87),
Sep = c(5374.6, 0.03, 9859.87, 15234.5)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Example of the Formattable function I'd like to apply. Note the color_tile is applied specifically to each row
formattable(df, lapply(1:nrow(df), function(row) {
area(row, col = 1:nrow(df)) ~ color_tile("transparent", "pink")
}))
The example I found which lets me combine Formattable with KableExtra looks like this:
df %>%
mutate(Jul = formattable::color_tile("transparent", "pink")(Jul),
Aug = formattable::color_tile("transparent", "pink")(Aug),
Sep = formattable::color_tile("transparent", "pink")(Sep)) %>%
select(Name,everything()) %>%
kable("html", escape = F,format.args = list(big.mark = ",",scientific = FALSE)) %>%
kable_classic(full_width = T, html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T)
The problems with this solution is:
1: The color_tile function is applied to columns rather than rows
2: The numeric values drop the commas
The table I'm planning on generating would be updated monthly so that next month the data for October would be presented, followed by November and so forth. As such I'm hoping for a solution that doesn't require me to edit the script each time i.e. mutate the new data. Hopefully that makes sense.
List item

Splitting string columns FAST in R

I have a data frame with 107 columns and 745000 rows (much bigger than in my example).
The case is, that I have character type columns which I want to separate, because they seem to contain some type-ish ending at the end of each sequence.
I want to saparate these type-ending parts to new columns.
I have made my own solution, but it seem to be far too slow for iterating through all the 745000 rows 53 times.
So I embed my solution in the following code, with some arbitrary data:
set.seed(1)
code_1 <- paste0(round(runif(5000, 100000, 999999)), "_", round(runif(1000, 1, 15)))
code_2 <- sample(c(paste0(round(runif(10, 100000, 999999)), "_", round(runif(10, 1, 15))), NA), 5000, replace = TRUE)
code_3 <- sample(c(paste0(round(runif(3, 100000, 999999)), "_", round(runif(3, 1, 15))), NA), 5000, replace = TRUE)
code_4 <- sample(c(paste0(round(runif(1, 100000, 999999)), "_", round(runif(1, 1, 15))), NA), 5000, replace = TRUE)
code_type_1 <- rep(NA, 5000)
code_type_2 <- rep(NA, 5000)
code_type_3 <- rep(NA, 5000)
code_type_4 <- rep(NA, 5000)
df <- data.frame(cbind(code_1,
code_2,
code_3,
code_4,
code_type_1,
code_type_2,
code_type_3,
code_type_4),
stringsAsFactors = FALSE)
df_new <- data.frame(code_1 = character(),
code_2 = character(),
code_3 = character(),
code_4 = character(),
code_type_1 = character(),
code_type_2 = character(),
code_type_3 = character(),
code_type_4 = character(),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- strsplit(df[, c(i)], "[_]")
for (j in 1:nrow(df)) {
df_new[c(j), c(i)] <- unlist(temp[j])[1]
df_new[c(j), c(i_t)] <- ifelse(is.na(unlist(temp[j])[1]), NA, unlist(temp[j])[2])
}
print(i)
}
for (i in 1:8) {
df_new[, c(i)] <- factor(df_new[, c(i)])
}
Do anyone have some ideas how to speed things up here?
First we pre-allocate the results data.frame to the desired final length. This is very important; see The R Inferno, Circle 2. Then we vectorize the inner loop. We also use fixed = TRUE and avoid the regex in strsplit.
system.time({
df_new1 <- data.frame(code_1 = character(nrow(df)),
code_2 = character(nrow(df)),
code_3 = character(nrow(df)),
code_4 = character(nrow(df)),
code_type_1 = character(nrow(df)),
code_type_2 = character(nrow(df)),
code_type_3 = character(nrow(df)),
code_type_4 = character(nrow(df)),
stringsAsFactors = FALSE)
for (i in 1:4) {
i_t <- i + 4
temp <- do.call(rbind, strsplit(df[, c(i)], "_", fixed = TRUE))
df_new1[, i] <- temp[,1]
df_new1[, i_t] <- ifelse(is.na(temp[,1]), NA, temp[,2])
}
df_new1[] <- lapply(df_new1, factor)
})
# user system elapsed
# 0.029 0.000 0.029
all.equal(df_new, df_new1)
#[1] TRUE
Of course, there are ways to make this even faster, but this is close to your original approach and should be sufficient.
Here's another way, using gsub inside a custom function in combination with purrr::dmap() - which is equivalent to lapply, but outputs a data.frame instead of a list.
library(purrr)
# Define function which gets rid of everything after and including "_"
replace01 <- function(df, ptrn = "_.*")
dmap(df[,1:4], gsub, pattern = ptrn, replacement = "")
# Because "pattern" is argument we can change it to get 2nd part, then cbind()
test <- cbind(replace01(df),
replace01(df, ptrn = ".*_"))
Note that the output here character columns, you can always convert them to factor if you like.
Another possibility:
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
y <- c(x[,1], x[,2])
y[y==""] <- NA
y
})), colnames(df)) -> df_new
or
setNames(do.call(rbind.data.frame, lapply(1:nrow(df), function(i) {
x <- stri_split_fixed(df[i, 1:4], "_", 2, simplify=TRUE)
c(x[,1], x[,2])
})), colnames(df)) -> df_new
df_new[df_new==""] <- NA
df_new
which is marginally faster:
Unit: milliseconds
expr min lq mean median uq max neval cld
na_after 669.8357 718.1301 724.8803 723.5521 732.9998 790.1405 10 a
na_inner 719.3362 738.1569 766.4267 762.1594 791.6198 825.0269 10 b

R function for pattern matching

I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.
Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.
Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.
I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.
EDIT:
Here is the first 20 lines of the Sanders data frame.
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
And I've written the following function:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.
Okay, here we go. Using this data (shared nicely with dput()):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
Demonstrating use on the original data
with(df, bigram(word = V1, type = V2))
# [1] "american people"
Let's cook up some data with more than one match to make sure it works:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
And back to the original to test out a different pattern:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"
I think the following is the code you wrote, but without throwing errors:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
not sure what you mean by this, though:
Also, and more importantly, this function won't exactly get me what I
want, which is to extract the word pairs that match a pattern and then
the pattern that they match. I honestly have no idea how to do that.

How to get a match when only one letter difference is allowed?

I want to look whether words in my dataset appear in a certain text. When using grepl you only get exact matches. With agrepl it is possible tot do partial matching. However, I don't get the desired results with it.
Example data:
dt <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("weg", "verte", "spiegelend", "spiegeld", "einde", "spiegel", "spiegelende", "weg", "spiegelend", "asfalt", "fata", "morgana")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))
With:
dt <- dt[, .(id, words,
match1=mapply(grepl, words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt"),
match2=mapply(agrepl, words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L)))]
I get:
> dt
id words match1 match2
1: 0 weg TRUE TRUE
2: 0 verte TRUE TRUE
3: 0 spiegelend TRUE TRUE
4: 0 spiegeld FALSE TRUE
5: 0 einde FALSE FALSE
6: 0 spiegel TRUE TRUE
7: 0 spiegelende FALSE TRUE
8: 1 weg TRUE TRUE
9: 1 spiegelend TRUE TRUE
10: 1 asfalt FALSE FALSE
11: 1 fata FALSE FALSE
12: 1 morgana FALSE FALSE
As you can see, the results from grepl and agrepl differ on rows 4 and 7. However, I only want a match when there is at maximum one letter difference. The match in row 4 for match2 should therefore be FALSE. Changing parameters like max.distance or costs doesn't lead to the desired result either. Moreover, both matches on row 6 should be FALSE as well.
For example: for the word "spiegelend" from the text, the word "spiegelende" should give a match (only one letter difference), but the word "spiegeld" (two letters difference) and the word "spiegel" (three letters difference) should not give a match.
The conditions are allowed (but not at the same time):
one letter more (e.g.: "spiegelende" should give a match), or
one letter less (e.g.: "spiegelen" should give a match), or
one spelling error (e.g.: "spiehelend" should give a match)
Any ideas on how to solve this problem?
two ways to solve it, matching the approaches by nongkrong and RHertel:
dt <- cbind(dt[,c("id", "words")],
match1=mapply(grepl, dt$words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt"),
match2=mapply(agrepl, dt$words,
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L)),
match3=mapply(agrepl, paste0("\\b",dt$words,"\\b"),
"hoe komt het dat de weg in de verte soms spiegelend lijkt",
MoreArgs=list(max.distance=1L, fixed=F)),
match4=apply(adist( dt$words, unlist(strsplit("hoe komt het dat de weg in de verte soms spiegelend lijkt", split=" "))),
1, function (x) any(x<=1))
)
match3 uses the word boundary \\b, while match4 uses an edit distance (adist) of <=1 to single words in a vector
I thought about using adist() in this case with the condition < 2. But I'm not sure if it yields the expected output. Does this help?
idx <- which(adist(dt$words,dt2$words) < 2, arr.ind = T)
dt$match <- (dt$words %in% dt2$words[idx[,2]])
#> dt
# id words match
#1 0 weg TRUE
#2 0 verte TRUE
#3 0 spiegelend TRUE
#4 0 spiegeld FALSE
#5 0 einde FALSE
#6 0 spiegel FALSE
#7 0 spiegelende FALSE
#8 1 weg TRUE
#9 1 spiegelend TRUE
#10 1 asfalt FALSE
#11 1 fata FALSE
#12 1 morgana FALSE
data
dt <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("weg", "verte", "spiegelend", "spiegeld", "einde", "spiegel", "spiegelende", "weg", "spiegelend", "asfalt", "fata", "morgana")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))
dt2 <- structure(list(id = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
words = c("hoe", "komt", "het", "dat", "de", "weg", "in", "de", "verte", "soms", "spiegelend", "lijkt")),
.Names = c("id", "words"), row.names = c(NA, -12L), class = c("data.table", "data.frame"))

perform gsub in a data frame with 2 columns

I have dataset with 2 columns, I would like to clean up my dataset by using gsub such as
Data_edited_txt2 <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("#\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("[[:punct:]]", "", Data_edited_txt2$text)
I would get an error :" $ operator is invalid for atomic vectors" at the second run of gsub and I noticed the 2nd column will disappear after running the first gsub.
Please advise how to perform all the gsub, but keeping the 2nd column?
structure(list(text = structure(c(1L, 3L, 7L, 4L, 2L, 5L, 6L), .Label = c("#airasia im searching job",
"#AirAsia no flight warning for cebu outbound?", "#shazzr1 #AirAsia never mind.. now everyone can fly.",
"#TigerAir confirmed as having far nastier policies and uncaring customer service than #airasia who I will now fly every time in preference.",
"#Wingmates Since your taxes is HIGHER than other airlines but your service is really BAD because always change and cancel the flight.",
"hai MASwings #Wingmates . Bilakah tempoh promosi anda? Saya ingin terbang ke Palawan dengan bajet yang agak rendah :3",
"One thing I \"like\" about #AirAsia is, DELAY."), class = "factor"),
created = structure(c(3L, 2L, 1L, 7L, 6L, 4L, 5L), .Label = c("2/2/2014 11:30",
"2/2/2014 11:32", "2/2/2014 12:18", "24/2/2014 4:03", "29/3/2014 8:21",
"30/1/2014 16:02", "31/1/2014 8:13"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -7L))
You override the whole data frame instead of only one column. Try this:
Data_edited_txt2$text <- gsub("(RT|via)((?:\\b\\W*#\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("#\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("[[:punct:]]", "", Data_edited_txt2$text)