Using dplyr mutate to find position of character in string

Using dplyr mutate to find position of character in string - regex

I have data frame with a column of strings, that of an number id followed by "-" and then a month year. I am trying to parse the string to get the month and year. As a very first step, I used dplyr::mutate() and
regexpr()
regexpr("-",yearid)[1]
to create a new column that shows the position of this "-" character. But seems like regexpr() performs very differently inside a mutate(), than when used separately. It does not seem to update depending on the string, but carries over the string position from previous rows. In the example below I expect the position of "-" character to be 4,4, and 5 in the respective yearid. But I get 4,4, and 4 - so this 4 is not correct. When I run regexpr separately I dont see this issue.
Wondering if I am missing something, and how can I get position of "-" dynamically that is specific for each value of yearid? May be there is an easier way to get January, and 1997.
yearid <- c("50 - January 1995","51 - January 1996","100 - January 1997")
data.df <- data.frame(yearid)
data.df <- mutate(data.df, trimpos = regexpr("-",str_trim(yearid))[1],
pos = regexpr("-",yearid)[1])
> data.df
yearid test1 test2
1 50 - January 1995 4 4
2 51 - January 1996 4 4
3 100 - January 1997 4 4
On the other hand using regexpr as such I get the output as expected:
> regexpr("-",yearid[1])[1]
[1] 4
> regexpr("-",yearid[2])[1]
[1] 4
> regexpr("-",yearid[3])[1]
[1] 5
Finally, I have my sessionInfo() below
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 dplyr_0.4.1 readr_0.2.2.9000
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 knitr_1.10.5 lazyeval_0.1.10.9000 magrittr_1.5 parallel_3.1.1
[7] Rcpp_0.11.6 stringi_0.4-1 tools_3.1.1

The regexpr function from the stringr library returns a vector of positions with two additional attributes attached match.length and useBytes. As mentioned in the comments, this vector can be assigned directly to the data frame. This can be done using the mutate function or without.
library(dplyr)
library(stringr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- data.frame(id_month_year, another_column = 1)
## create new column using mutate
data <- data %>% mutate(pos1 = regexpr("-", data$id_month_year))
## create new column without mutate
data$pos2 <- regexpr("-", data$id_month_year)
print(data)
Here are the new columns:
id_month_year another_column pos1 pos2
1 50 - January 1995 1 4 4
2 51 - January 1996 1 4 4
3 100 - January 1997 1 5 5
I would suggest using the separate function from the tidyr library. Here's an example code snippet:
library(dplyr)
library(tidyr)
id_month_year <- c(
"50 - January 1995",
"51 - January 1996",
"100 - January 1997"
)
data <- tbl_df(data.frame(id_month_year, another_column = 1))
clean <- data %>%
separate(
id_month_year,
into = c("id", "month", "year"),
sep = "[- ]+",
convert = TRUE
)
print(clean)
And here's the resulting clean data frame:
Source: local data frame [3 x 4]
id month year another_column
(int) (chr) (int) (dbl)
1 50 January 1995 1
2 51 January 1996 1
3 100 January 1997 1

Related

Using ifelse to recode across multiple rows within groups

I need to create a new column based on a pre-existing one and I need that value to be created across all rows of the episode.
episode_id <- c(2,2,56,56,67,67,67)
issue <- c("loose","faulty","broke","faulty","loose","broke","missing")
df <- data.frame(episode_id,issue)
Using ifelse, I can create a new column called "broke" which accurately indicates whether the issue had "bro" in it for each row.
df$broke <- ifelse(grepl("bro",df$issue),1,0)
However, I want it to indicate a "1" for every row with the same episode_id.
So I want it to look like:
I tried group_by, but that was not effective.

group_by is the beginning and you can continue with a mutate() and a any() to convert the presence of broke in each piece to at least one in the group:
library(dplyr)
df %>%
group_by(episode_id) %>%
mutate(broke = as.numeric(any(grepl("bro", issue)))) %>%
ungroup()
# A tibble: 7 × 3
episode_id issue broke
<dbl> <chr> <dbl>
1 2 loose 0
2 2 faulty 0
3 56 broke 1
4 56 faulty 1
5 67 loose 1
6 67 broke 1
7 67 missing 1

Plotting categorical variables using a bar diagram/bar chart

data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.

Operation on Regex identified digits in python data frame

I have one data frame with 2 columns 3rd column is output format given below:
DF:
reg value o/p**
2 for $20 11 20/2
4 for $24 12 24/4
2 for $30 13 30/2
Get $10 Cash 14 14
3 for $30 21 30/3
First, I have to match [\d]+ for [$][\d]+ in reg column and then have
to update the value column as 2nd integer of reg divide by the first
integer of reg if no match keeps same value.
My code is:
df["value"]=df["reg"].map(lambda x: (int(re.findall("[\d]+",x)[1]))/int(re.findall("[\d]+",x)[0]) if(re.search(r"[\d]+ for [$][\d]+" , x)) else x)
The code output is correct for match cases only.

Try:
df["value"]=df.apply(lambda x: (int(re.findall("[\d]+",x["reg"])[1]))/int(re.findall("[\d]+",x["reg"])[0]) if(re.search(r"[\d]+ for [$][\d]+" , x["reg"])) else x["value"], axis=1)
output:
reg value
0 2 for $20 10.0
1 4 for $24 6.0
2 2 for $30 15.0
3 Get $10 Cash 14.0
4 3 for $30 10.0
you are picking only reg column that's why you were not able to get value

Detect rows in a data frame that are highly similar but not necessarily exact duplicates

I would like to identify rows in a data frame that are highly similar to each other but not necessarily exact duplicates. I have considered merging all the data from each row into one string cell at the end and then using a partial matching function. It would be nice to be able to set/adjust the level of similarity required to qualify as a match (for example, return all rows that match 75% of the characters in another row).
Here is a simple working example.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew"), id = c(12334, 12344, 34345, 98974), score = c(90, 90, 83, 95))
In this scenario, I would want row 2 to show up as a duplicate of row 1, but not row 4 (It is too dissimilar). Thanks for any suggestions.

You can use agrep But first you need to concatenate all columns to do the fuzzy search in all columns and not just the first one.
xx <- do.call(paste0,df)
df[agrep(xx[1],xx,max=0.6*nchar(xx[1])),]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
4 Pamdrew 98974 95
Note that for 0.7 you get all rows.
Once rows matched you should extract them from the data.frame and repeat the same process for other rows(row 3 here with the rest of data)...

You could use agrep (or agrepl) for partial (fuzzy) pattern matching.
> df[agrep("Andrew", df$name), ]
name id score
1 Andrew 12334 90
2 Andrem 12344 90
So this shows that rows 1 and 2 are both found when matching "Andrew" Then you could remove the duplicates (only taking the first "Andrew" match) with
> a <- agrep("Andrew", df$name)
> df[c(a[1], rownames(df)[-a]), ]
name id score
1 Andrew 12334 90
3 Adam 34345 83
4 Pamdrew 98974 95

You could use some approximate string distance metric for the names such as:
adist(df$name)
[,1] [,2] [,3] [,4]
[1,] 0 1 4 3
[2,] 1 0 3 4
[3,] 4 3 0 6
[4,] 3 4 6 0
or use a dissimilartiy matrix calculation:
require(cluster)
daisy(df[, c("id", "score")])
Dissimilarities :
1 2 3
2 10
3 22011 22001
4 86640 86630 64629

Extending the solution provided by agstudy (see comments above) I produced the following solution that produced a data frame with each similar row in a data frame next to each other.
df<-data.frame(name = c("Andrew", "Andrem", "Adam", "Pamdrew", "Adan"), id = c(12334, 12344, 34345, 98974, 34344), score = c(90, 90, 83, 95, 83))
xx <- do.call(paste0,df) ## concatenate all columns
df3<-df[0,] ## empty data frame for storing loop results
for (i in 1:nrow(df)){ ## produce results for each row of the data frame
df2<-df[agrep(xx[i],xx,max=0.3*nchar(xx[i])),] ##set level of similarity required (less than 30% dissimilarity in this case)
if(nrow(df2) >= 2){df3<-rbind(df3, df2)} ## rows without matches returned themselves...this eliminates them
df3<-df3[!duplicated(df3), ] ## store saved values in df3
}
I am sure there are cleaner ways of producing these results, but this gets the job done.

Removing Percentages from a Data Frame

I have a dataframe that originated from an excel file. It has the usual headers above the columns but some of the columns have % signs in them which I want to remove.
Searching stackoverflow gives some nice code for removing percentages from matrices, Any way to edit values in a matrix in R?, which did not work when I tried to apply it to my dataframe
as.numeric(gsub("%", "", my.dataframe))
instead it just returns a string of "NA"s with a warning message explaining that they were introduced by coercion. When I applied,
gsub("%", "", my.dataframe))
I got the values in "c(...)" form, where the ... represent numbers followed by commas which was reproduced for every column that I had. No % was in evidence; if I could just put this back together ... I'd be cooking.
Any help greatfully received, thanks.

Based on #Arun's comment and imaging how your data.frame looks like:
> DF <- data.frame(X = paste0(1:5,'%'),
Y = paste0(2*(1:5),'%'),
Z = 3*(1:5), stringsAsFactors=FALSE )
> DF # this is how I imagine your data.frame looks like
X Y Z
1 1% 2% 3
2 2% 4% 6
3 3% 6% 9
4 4% 8% 12
5 5% 10% 15
> # Using #Arun's suggestion
> (DF2 <- data.frame(sapply(DF, function(x) as.numeric(gsub("%", "", x)))))
X Y Z
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
I added as.numeric in sapply call for the resulting cols to be numeric, if I don't use as.numeric the result will be factor. Check it out using sapply(DF2, class)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using dplyr mutate to find position of character in string - regex

Related

Using ifelse to recode across multiple rows within groups

Plotting categorical variables using a bar diagram/bar chart

Operation on Regex identified digits in python data frame

Detect rows in a data frame that are highly similar but not necessarily exact duplicates

Removing Percentages from a Data Frame

Categories

Resources