Changing character variable to numeric variable - sas

Hello i have a table which has numerous columns i.e name,age,country, gender etc, i'm a student who is trying to use SAS for a module. There has purposely been some wrong data placed in columns - For example Gender is either Male or Female but i'm essentially looking to turn male or female into either 1 or 2 i.e 1=Male/2=Female but when i use the code that i think is correct to do this i get the error expecting a format name (error 22-322) as well as (76-322) syntax error, statement will be ignored
More than happy to send any data or examples of additional code if that helps.
If someone could help me this would be massively appreciated! Thanks in advance.

Related

Lookup Functions not giving exact result

I have a little problem getting the result i wanted from the data that i have.
I tried index match match and vlook up match but it doesnt give me the result that i wanted.
Please see the sample data that I have. DATA 1
I wanted to get the # of hours based from this table TABLE
Can someone help me get the correct syntax for this please?
Tried several things to go around the with formulas but ended up either just 1 result for the whole column or an error. Thanks
up for this.

How can I resolve INDEX MATCH errors caused by discrepancies in the spelling of names across multiple data sources?

I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!
to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))

Extract all numeric values in Columns in R

I am currently working with a large data set using R. So, I have a column called "Offers". This column contains text describing 'promotions' that companies offer on their products. I am trying to extract numeric values from these. While, for most cases, I am able to do so well using a combination of regex and functions in R packages, I am unable to deal with a couple of specific cases of text shown below. I would really appreciate any help on these.
"Buying this ensures Savings of $50. Online Credit worth 35$ is also available. So buy soon!"
1a. I want to get both the numeric values out but in 2 different columns. How
do I go about that?
1b. For another problem that I have to solve, I only need to take the value associated with the credit. It is always the case that for texts like above, the second numeric value in the text, if it exists, is the one associated with the credit.
"Get 50% off on your 3 night stay along with 25 credits, offer available on 3 December 2016"
(How should I only take the value associated with credits?)
Note: Efficiency would be important as well because I am dealing with about 14 million rows.
I have tried looking online for a solution but have not found anything very satisfactory.
I am not 100% sure about what you want but this may help you.
A <- "do 50% and whatever 23"
B <- gregexpr("\\d+",A)[[1]]
firstNum <- substr(A,B[1],B[1]+attr(B,"match.length")[1]-1)
secondNum <- substr(A,B[2],B[2]+attr(B,"match.length")[2]-1)
Hope this helps.

Replace cases in one dataset using cases from another file

I have a master data file that contains responses from English, German, and French respondents. The open-ended responses (OER) were sent to translators and they sent us back a file with the original OER and English translation of those. Now I want to replace the "empty" columns reserved for English translation in my master data with the new information.
My approach was:
Create a loop in the translation file:
foreach var of varlist *englishtranslation* {
rename `var' new_`var'
}
Then merge new_`var' into master data using respondent ID.
Replace non-missing cases in blank cols using info in new_`var'.
Drop new_`var'.
However, Stata keeps saying that the new variable names new_`var' are invalid:
You attempted to rename q12_v1_995_oe_englishtranslation to
new_q12_v1_995_oe_englishtranslation. That is an invalid Stata
variable name.
Do you have any recommendation on fixing that error or on another approach?
Many thanks,
EL
Edit: I understand that the variable name length limit is 32 and that variable has exactly 32 characters, hence the error when I tried to rename it. But I need to come up with a systematic way to name these variables because multiple people work on it and I don't want to mess with the agreed organization of the dataset.
Your new name has 36 characters. There's a limit of 32 (with Stata 12 and 13, at least).
An example reproducing your error:
clear
set more off
set obs 1
gen q12_v1_995_oe_englishtranslation = 99
gen new_q12_v1_995_oe_englishtranslation = 10
Solution: make the name shorter.
See help varname for details.
Edit
On your question about renaming:
Try:
rename *englishtranslation *engtrans
See help rename and help rename group for details.

Stata command to add all choices, those made and those not made

UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings