R: Using "Gsub" a lost my data frame format - regex

I have this Data Frame:
Campaña Visitas Compras
1 faceBOOKAds-1 524 2
2 FacebookAds-2 487 24
3 fcebookAds-3 258 4
4 Email1 8 7
And i want this:
Campaña Visitas Compras
1 FBAds1 524 2
2 FBAds2 487 24
3 FBAds3 258 4
4 Email1 8 7
1) I've read that "GSUB" would do the work so i've used this:
DataGoogle2 <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
But i get this vector object (As you see i've lost my data.frame format):
[1] "FBAds" "FacebookAds-2" "fcebookAds-3" "Email1" ...
2) Then i try to use: as.data.frame:
DataGoogle2 <- as.data.frame(gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña))
But get this (No data frame format):
1 FBAds
2 fFacebookAds-2
3 fcebookAds-3
4 Email1
How can i get what i need? I know that the replacement method is not so good. What i need the most is not to loose the Data Frame format, but any help with the REGEX part is welcome!

You can use transform (and another regex).
DataGoogle2 <- transform(DataGoogle1, Campaña = sub("(?i)fa?cebook(.*)-(.*)",
"FB\\1\\2", Campaña))
# Campaña Visitas Compras
# 1 FBAds1 524 2
# 2 FBAds2 487 24
# 3 FBAds3 258 4
# 4 Email1 8 7
The functions sub and gsub return a vector. Hence, the information of all other columns is not present in the output. With transform you can modify columns of an existing data frame and return a new one.
In the regex, (?i) starts the non-case sensitive mode. Furthermore, I used sub since I assume that there is never more than one match per string.

When you used the following code:
DataGoogle2 <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
R reads that you want to your data frame to consist of DataGoogle1$Campaña column only and hence you get that output.
In stead, try this:
DataGoogle2$Campaña <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
This way you are saying that you want the COLUMN and not the DATA FRAME to be transformed. In any code, your LHS of expression is as important as your RHS code.
Hope this helps.

You could also do direct replacement on the first column. This replaces the desired parts in the first column only by operating on the first column only. And this will keep the desired data frame structure.
> dat[[1]] <- gsub("f(.*)[-]", "FBAds", dat[[1]], ignore.case = TRUE)
> dat
# Campaña Visitas Compras
# 1 FBAds1 524 2
# 2 FBAds2 487 24
# 3 FBAds3 258 4
# 4 Email1 8 7
...presuming your original data is called dat.

Related

Using ifelse to recode across multiple rows within groups

I need to create a new column based on a pre-existing one and I need that value to be created across all rows of the episode.
episode_id <- c(2,2,56,56,67,67,67)
issue <- c("loose","faulty","broke","faulty","loose","broke","missing")
df <- data.frame(episode_id,issue)
Using ifelse, I can create a new column called "broke" which accurately indicates whether the issue had "bro" in it for each row.
df$broke <- ifelse(grepl("bro",df$issue),1,0)
However, I want it to indicate a "1" for every row with the same episode_id.
So I want it to look like:
I tried group_by, but that was not effective.
group_by is the beginning and you can continue with a mutate() and a any() to convert the presence of broke in each piece to at least one in the group:
library(dplyr)
df %>%
group_by(episode_id) %>%
mutate(broke = as.numeric(any(grepl("bro", issue)))) %>%
ungroup()
# A tibble: 7 × 3
episode_id issue broke
<dbl> <chr> <dbl>
1 2 loose 0
2 2 faulty 0
3 56 broke 1
4 56 faulty 1
5 67 loose 1
6 67 broke 1
7 67 missing 1

how to create combinatorial combination of two files

I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995
I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron

Detect rows in a data frame that are highly similar but not necessarily exact duplicates

I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).
Here is a simple working example.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.
You can use agrep But first you need to concatenate all columns to do the fuzzy search in all columns and not just the first one.
xx <- do.call(paste0,df)
df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
4 Pamdrew 98974 95
Note that for 0.7 you get all rows.
Once rows matched you should extract them from the data.frame and repeat the same process for other rows(row 3 here with the rest of data)...
You could use agrep (or agrepl) for partial (fuzzy) pattern matching.
> df[agrep("Andrew", df$name), ]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
So this shows that rows 1 and 2 are both found when matching "Andrew" Then you could remove the duplicates (only taking the first "Andrew" match) with
> a <- agrep("Andrew", df$name)
> df[c(a[1], rownames(df)[-a]), ]
name id score
1 Andrew 12334 90
3 Adam 34345 83
4 Pamdrew 98974 95
You could use some approximate string distance metric for the names such as:
adist(df$name)
[,1] [,2] [,3] [,4]
[1,] 0 1 4 3
[2,] 1 0 3 4
[3,] 4 3 0 6
[4,] 3 4 6 0
or use a dissimilartiy matrix calculation:
require(cluster)
daisy(df[, c("id", "score")])
Dissimilarities :
1 2 3
2 10
3 22011 22001
4 86640 86630 64629
Extending the solution provided by agstudy (see comments above) I produced the following solution that produced a data frame with each similar row in a data frame next to each other.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew", "Adan"), id = c(12334, 12344, 34345, 98974, 34344), score = c(90, 90, 83, 95, 83))
xx <- do.call(paste0,df) ## concatenate all columns
df3<-df[0,] ## empty data frame for storing loop results
for (i in 1:nrow(df)){ ## produce results for each row of the data frame
df2<-df[agrep(xx[i],xx,max=0.3*nchar(xx[i])),] ##set level of similarity required (less than 30% dissimilarity in this case)
if(nrow(df2) >= 2){df3<-rbind(df3, df2)} ## rows without matches returned themselves...this eliminates them
df3<-df3[!duplicated(df3), ] ## store saved values in df3
}
I am sure there are cleaner ways of producing these results, but this gets the job done.

Removing Percentages from a Data Frame

I have a dataframe that originated from an excel file. It has the usual headers above the columns but some of the columns have % signs in them which I want to remove.
Searching stackoverflow gives some nice code for removing percentages from matrices, Any way to edit values in a matrix in R?, which did not work when I tried to apply it to my dataframe
as.numeric(gsub("%", "", my.dataframe))
instead it just returns a string of "NA"s with a warning message explaining that they were introduced by coercion. When I applied,
gsub("%", "", my.dataframe))
I got the values in "c(...)" form, where the ... represent numbers followed by commas which was reproduced for every column that I had. No % was in evidence; if I could just put this back together ... I'd be cooking.
Any help greatfully received, thanks.
Based on #Arun's comment and imaging how your data.frame looks like:
> DF <- data.frame(X = paste0(1:5,'%'),
Y = paste0(2*(1:5),'%'),
Z = 3*(1:5), stringsAsFactors=FALSE )
> DF # this is how I imagine your data.frame looks like
X Y Z
1 1% 2% 3
2 2% 4% 6
3 3% 6% 9
4 4% 8% 12
5 5% 10% 15
> # Using #Arun's suggestion
> (DF2 <- data.frame(sapply(DF, function(x) as.numeric(gsub("%", "", x)))))
X Y Z
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
I added as.numeric in sapply call for the resulting cols to be numeric, if I don't use as.numeric the result will be factor. Check it out using sapply(DF2, class)

Working with a list of lists of dataframes with different dimensions

I am working with a list of lists of dataframes, like so:
results <- list()
for( i in 1:4 ) {
runData <- data.frame(id=i, t=1:10, value=runif(10))
runResult <- data.frame( id=i, avgValue=mean(runData$value))
results <- c(results,list(list(runResult,runData)))
}
The reason the data looks this way is its essentially how my actual data is generated from running simulations via clusterApply using the new parallel package in R 2.14.0, where each simulation returns a list of some summary results (runResult) and the raw data (runData)
I would like to combine the first dataframe of the second level lists together (they are the same structure), and likewise the second dataframe of the second level lists. This question seemed to be the answer, however all the dataframes have the same structure.
The best method I've found so far is using unlist to make it a list of dataframes, where odd indices and even indices represent dataframes that need to be combined:
results <- unlist(results,recursive=FALSE)
allRunResults <- do.call("rbind", results[seq(1,length(results),2)])
allRunData <- do.call("rbind", results[seq(2,length(results),2)])
I'm certain there's a better way to do this, I just don't see it yet. Can anyone supply one?
Shamelessly stealing a construct from Ben Bolker's excellent answer to this question...
Reduce(function(x,y) mapply("rbind", x,y), results)
[[1]]
id avgValue
1 1 0.3443166
2 2 0.6056410
3 3 0.6765076
4 4 0.4942554
[[2]]
id t value
1 1 1 0.11891086
2 1 2 0.17757710
3 1 3 0.25789284
4 1 4 0.26766182
5 1 5 0.83790204
6 1 6 0.99916116
7 1 7 0.40794841
8 1 8 0.19490817
9 1 9 0.16238479
10 1 10 0.01881849
11 2 1 0.62178443
12 2 2 0.49214165
........
........
........
One option is to extract the give data frame from each piece of the list, then rbind them together:
runData <- do.call(rbind, lapply(results, '[[', 2))
runResult <- do.call(rbind, lapply(results, '[[', 1))
This example gives 2 data frames, but you can recombine them into a single list if you want.