Removing Percentages from a Data Frame - regex

I have a dataframe that originated from an excel file. It has the usual headers above the columns but some of the columns have % signs in them which I want to remove.
Searching stackoverflow gives some nice code for removing percentages from matrices, Any way to edit values in a matrix in R?, which did not work when I tried to apply it to my dataframe
as.numeric(gsub("%", "", my.dataframe))
instead it just returns a string of "NA"s with a warning message explaining that they were introduced by coercion. When I applied,
gsub("%", "", my.dataframe))
I got the values in "c(...)" form, where the ... represent numbers followed by commas which was reproduced for every column that I had. No % was in evidence; if I could just put this back together ... I'd be cooking.
Any help greatfully received, thanks.

Based on #Arun's comment and imaging how your data.frame looks like:
> DF <- data.frame(X = paste0(1:5,'%'),
Y = paste0(2*(1:5),'%'),
Z = 3*(1:5), stringsAsFactors=FALSE )
> DF # this is how I imagine your data.frame looks like
X Y Z
1 1% 2% 3
2 2% 4% 6
3 3% 6% 9
4 4% 8% 12
5 5% 10% 15
> # Using #Arun's suggestion
> (DF2 <- data.frame(sapply(DF, function(x) as.numeric(gsub("%", "", x)))))
X Y Z
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
I added as.numeric in sapply call for the resulting cols to be numeric, if I don't use as.numeric the result will be factor. Check it out using sapply(DF2, class)

Related

Using ifelse to recode across multiple rows within groups

I need to create a new column based on a pre-existing one and I need that value to be created across all rows of the episode.
episode_id <- c(2,2,56,56,67,67,67)
issue <- c("loose","faulty","broke","faulty","loose","broke","missing")
df <- data.frame(episode_id,issue)
Using ifelse, I can create a new column called "broke" which accurately indicates whether the issue had "bro" in it for each row.
df$broke <- ifelse(grepl("bro",df$issue),1,0)
However, I want it to indicate a "1" for every row with the same episode_id.
So I want it to look like:
I tried group_by, but that was not effective.
group_by is the beginning and you can continue with a mutate() and a any() to convert the presence of broke in each piece to at least one in the group:
library(dplyr)
df %>%
group_by(episode_id) %>%
mutate(broke = as.numeric(any(grepl("bro", issue)))) %>%
ungroup()
# A tibble: 7 × 3
episode_id issue broke
<dbl> <chr> <dbl>
1 2 loose 0
2 2 faulty 0
3 56 broke 1
4 56 faulty 1
5 67 loose 1
6 67 broke 1
7 67 missing 1

Plotting categorical variables using a bar diagram/bar chart

data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.

R: Using "Gsub" a lost my data frame format

I have this Data Frame:
Campaña Visitas Compras
1 faceBOOKAds-1 524 2
2 FacebookAds-2 487 24
3 fcebookAds-3 258 4
4 Email1 8 7
And i want this:
Campaña Visitas Compras
1 FBAds1 524 2
2 FBAds2 487 24
3 FBAds3 258 4
4 Email1 8 7
1) I've read that "GSUB" would do the work so i've used this:
DataGoogle2 <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
But i get this vector object (As you see i've lost my data.frame format):
[1] "FBAds" "FacebookAds-2" "fcebookAds-3" "Email1" ...
2) Then i try to use: as.data.frame:
DataGoogle2 <- as.data.frame(gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña))
But get this (No data frame format):
1 FBAds
2 fFacebookAds-2
3 fcebookAds-3
4 Email1
How can i get what i need? I know that the replacement method is not so good. What i need the most is not to loose the Data Frame format, but any help with the REGEX part is welcome!
You can use transform (and another regex).
DataGoogle2 <- transform(DataGoogle1, Campaña = sub("(?i)fa?cebook(.*)-(.*)",
"FB\\1\\2", Campaña))
# Campaña Visitas Compras
# 1 FBAds1 524 2
# 2 FBAds2 487 24
# 3 FBAds3 258 4
# 4 Email1 8 7
The functions sub and gsub return a vector. Hence, the information of all other columns is not present in the output. With transform you can modify columns of an existing data frame and return a new one.
In the regex, (?i) starts the non-case sensitive mode. Furthermore, I used sub since I assume that there is never more than one match per string.
When you used the following code:
DataGoogle2 <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
R reads that you want to your data frame to consist of DataGoogle1$Campaña column only and hence you get that output.
In stead, try this:
DataGoogle2$Campaña <- gsub("faceBOOKAds-1", "FBAds", DataGoogle1$Campaña)
This way you are saying that you want the COLUMN and not the DATA FRAME to be transformed. In any code, your LHS of expression is as important as your RHS code.
Hope this helps.
You could also do direct replacement on the first column. This replaces the desired parts in the first column only by operating on the first column only. And this will keep the desired data frame structure.
> dat[[1]] <- gsub("f(.*)[-]", "FBAds", dat[[1]], ignore.case = TRUE)
> dat
# Campaña Visitas Compras
# 1 FBAds1 524 2
# 2 FBAds2 487 24
# 3 FBAds3 258 4
# 4 Email1 8 7
...presuming your original data is called dat.

Detect rows in a data frame that are highly similar but not necessarily exact duplicates

I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).
Here is a simple working example.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.
You can use agrep But first you need to concatenate all columns to do the fuzzy search in all columns and not just the first one.
xx <- do.call(paste0,df)
df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
4 Pamdrew 98974 95
Note that for 0.7 you get all rows.
Once rows matched you should extract them from the data.frame and repeat the same process for other rows(row 3 here with the rest of data)...
You could use agrep (or agrepl) for partial (fuzzy) pattern matching.
> df[agrep("Andrew", df$name), ]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
So this shows that rows 1 and 2 are both found when matching "Andrew" Then you could remove the duplicates (only taking the first "Andrew" match) with
> a <- agrep("Andrew", df$name)
> df[c(a[1], rownames(df)[-a]), ]
name id score
1 Andrew 12334 90
3 Adam 34345 83
4 Pamdrew 98974 95
You could use some approximate string distance metric for the names such as:
adist(df$name)
[,1] [,2] [,3] [,4]
[1,] 0 1 4 3
[2,] 1 0 3 4
[3,] 4 3 0 6
[4,] 3 4 6 0
or use a dissimilartiy matrix calculation:
require(cluster)
daisy(df[, c("id", "score")])
Dissimilarities :
1 2 3
2 10
3 22011 22001
4 86640 86630 64629
Extending the solution provided by agstudy (see comments above) I produced the following solution that produced a data frame with each similar row in a data frame next to each other.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew", "Adan"), id = c(12334, 12344, 34345, 98974, 34344), score = c(90, 90, 83, 95, 83))
xx <- do.call(paste0,df) ## concatenate all columns
df3<-df[0,] ## empty data frame for storing loop results
for (i in 1:nrow(df)){ ## produce results for each row of the data frame
df2<-df[agrep(xx[i],xx,max=0.3*nchar(xx[i])),] ##set level of similarity required (less than 30% dissimilarity in this case)
if(nrow(df2) >= 2){df3<-rbind(df3, df2)} ## rows without matches returned themselves...this eliminates them
df3<-df3[!duplicated(df3), ] ## store saved values in df3
}
I am sure there are cleaner ways of producing these results, but this gets the job done.

Working with a list of lists of dataframes with different dimensions

I am working with a list of lists of dataframes, like so:
results <- list()
for( i in 1:4 ) {
runData <- data.frame(id=i, t=1:10, value=runif(10))
runResult <- data.frame( id=i, avgValue=mean(runData$value))
results <- c(results,list(list(runResult,runData)))
}
The reason the data looks this way is its essentially how my actual data is generated from running simulations via clusterApply using the new parallel package in R 2.14.0, where each simulation returns a list of some summary results (runResult) and the raw data (runData)
I would like to combine the first dataframe of the second level lists together (they are the same structure), and likewise the second dataframe of the second level lists. This question seemed to be the answer, however all the dataframes have the same structure.
The best method I've found so far is using unlist to make it a list of dataframes, where odd indices and even indices represent dataframes that need to be combined:
results <- unlist(results,recursive=FALSE)
allRunResults <- do.call("rbind", results[seq(1,length(results),2)])
allRunData <- do.call("rbind", results[seq(2,length(results),2)])
I'm certain there's a better way to do this, I just don't see it yet. Can anyone supply one?
Shamelessly stealing a construct from Ben Bolker's excellent answer to this question...
Reduce(function(x,y) mapply("rbind", x,y), results)
[[1]]
id avgValue
1 1 0.3443166
2 2 0.6056410
3 3 0.6765076
4 4 0.4942554
[[2]]
id t value
1 1 1 0.11891086
2 1 2 0.17757710
3 1 3 0.25789284
4 1 4 0.26766182
5 1 5 0.83790204
6 1 6 0.99916116
7 1 7 0.40794841
8 1 8 0.19490817
9 1 9 0.16238479
10 1 10 0.01881849
11 2 1 0.62178443
12 2 2 0.49214165
........
........
........
One option is to extract the give data frame from each piece of the list, then rbind them together:
runData <- do.call(rbind, lapply(results, '[[', 2))
runResult <- do.call(rbind, lapply(results, '[[', 1))
This example gives 2 data frames, but you can recombine them into a single list if you want.