how to change the name of a variable to be called using string manipulation in R? - regex

I am trying to change the name of a variable to be called in r on the fly.
For example, dataframe trades_long_final has many columns "prob_choice1" and "prob_choice2", ... "prob_choiceN" and "col1", "col2", ... "colN".
I want to change the value of each on the fly.
For example,
trades_long_final$"prob_choice1"[1] = 10 and
trades_long_final$"prob_choice2"[1] = 10
works
but not
trades_long_final$gsub("1","2","prob_choice1")[1] = 10
as a way to call trades_long_final$"prob_choice2"[1] by substituting the 1 in prob_choice1 with a 2 because I get the error
Error: attempt to apply non-function
I need this to work because I need to loop over the columns using something like trades_long_final$gsub("i","2","prob_choicei")[1] in a loop for all i.
Thank you so much for your help. It must be a command I don't know how to use...

Instead of using $, you can use [ to change the variable name and assign the value in one line.
trades_long_final[,gsub("1","2","prob_choice1")][1] <- 10
But, it is not clear why you need to do this. Simply
trades_long_final[1, "prob_choice2"] <- 10
would be easier. From the description, "prob_choice2" is already a column in the dataset. So, it is confusing.
data
set.seed(24)
trades_long_final <- data.frame(prob_choice1 =runif(10),
prob_choice2=rnorm(10), col1=rnorm(10,10), col2=rnorm(10,30))

as akrun said, the way to do it was to use [ so that the way I did it was:
for (k in 1:numtrades){
trades_long_final[[paste("prob_choice", k, sep="")]] =
{some complex procedure...}
}

Related

How to remove an part of values in R

I have got a data frame like this:
ID A B
1 x5.11 2,34
2 x5.57 5,36
3 x6,13 0,45
I would like to remove the 'x' of all values of the column A. How might I best accomplish this in R.
Thanks!
I have found a very easy way:
data.frama$A <- gsub("x", "", data.frame$A)

How to subtract unkown strings from list in python

I am trying to write a program that you say to it, from now on call me Jason, then will convert it into a list and subtract everything but Jason from the list. I managed to make this but, i want it to subtract words that aren't in there but would be able to if they were there.
You haven't posted any code, so here is how I would do it.
names = set(['John','Jason','Jim'])
callme = 'Jason'
names.intersection(set([callme]))
Alternatively, with iterators
names = ['John','Jason','Jim']
callme = ['Jason']
[N for N in names if N in callme]

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

Importing unfriedly formatted data in Excel and forcing messy values as column names

I'm trying to import some publicly available life outcomes data using the code below:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE)
Naturally, the imported data frame doesn't look good:
I would like to amend my column names using the code below:
# Clean column names
names(simd.sg.xls) <- make.names(names = as.character(simd.sg.xls[1,]),
unique = TRUE,allow_ = TRUE)
But it produces rather unpleasant results:
> names(simd.sg.xls)
[1] "X1" "X1.1" "X771" "X354" "X229" "X74" "X67" "X33" "X19" "X1.2"
[11] "X6" "X1.3" "X8" "X7" "X7.1" "X6506" "X21" "X1.4" "X6158" "X6506.1"
[21] "X6506.2" "X6506.3" "X6263" "X6506.4" "X6468" "X1010" "X815" "X99" "X58" "X65"
[31] "X60" "X6506.5" "X21.1" "X1.5" "X6173" "X5842" "X6506.6" "X6506.7" "X6263.1" "X6506.8"
[41] "X6481" "X883" "X728" "X112" "X69" "X56" "X54" "X6506.9" "X21.2" "X1.6"
[51] "X6143" "X5651" "X6506.10" "X6506.11" "X6263.2" "X6506.12" "X6480" "X777" "X647" "X434"
[61] "X518" "X246" "X436" "X6506.13" "X21.3" "X1.7" "X6136" "X5677" "X6506.14" "X6506.15"
[71] "X6263.3" "X6506.16" "X660" "X567" "X480" "X557" "X261" "X456"
My question is if there is a way to neatly force the values from the first row to the column names? As I'm doing a lot of data I'm looking for solution that would be easily reproducible, I can accommodate a lot of violation to the actual strings to get syntactically correct names but ideally I would avoid faffing around with elaborate regular expressions as I'm often reading files like the one linked here and don't wan to be forced to adjust the rules for each single import.
It looks like the problem is that the header is on the second line, not the first. You could include a skip=1 argument but a more general way of dealing with this using read.xls seems to be to use the pattern and header arguments which force the first line which matches the pattern string to be treated as the header. Your code becomes:
require(gdata)
# Source SIMD12 data zone level data
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
pattern="DATAZONE", header=TRUE)
UPDATE
I don't get the warning messages you do when I execute the code. The messages refer to an issue with locale. The locale settings on my system are:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Yours are probably different. Locale data could be OS dependent. I'm using Windows 8.1. Also I'm using Strawberry Perl; you appear to be using something else. So some possible reasons for the discrepancy in warning messages but nothing more specific.
On the second question in your comment, to read the entire file, and convert a particular row ( in this case, row 2) to column names, you could use the following code:
simd.sg.xls <- read.xls(xls = "http://www.gov.scot/Resource/0044/00447385.xls",
sheet = "Quick Lookup", verbose = TRUE,
header=FALSE, stringsAsFactors=FALSE)
names(simd.sg.xls) <- make.names(names = simd.sg.xls[2,],
unique = TRUE,allow_ = TRUE)
simd.sg.xls <- simd.sg.xls[-(1:2),]
All data will be of character type so you'll need to convert to factor and numeric as necessary.

Stata: Efficient way to replace numerical values with string values

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?
I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )
You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html
Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used