This might seem painfully easy, but say I create a Stata local macro called example:
local example "blah1 blah2 blah3"
I want to get just blah2 using the numerical index, in a way that might look like
example[2]
in another language. How does one do this in Stata?
If you run this script you will see four ways of getting the second word.
local example "blah1 blah2 blah3"
* 1
tokenize "`example'"
di "`2'"
* 2
local example2 : word 2 of `example'
di "`example2'"
* 3
di "`: word 2 of `example''"
* 4
mata : words = tokens(st_local("example"))
mata : words[2]
In Stata (as opposed to Mata) the syntax words[2] is perfectly legal so long as words is a variable, meaning a column in the dataset, but a local macro in Stata is not a variable in that sense.
clear
set obs 3
gen words = "blah" + strofreal(_n)
di words[2]
Related
Not sure if this possible in SAS; although I'm slowly learning pretty much anything is possible in SAS...
I have a data-set of 600 patients and within that data-set I have a comment variable. The comment variable contains a few sentences each patient stated about his/her care. So for example, the data set looks like this:
ID Comment
1 Today we have great service. everyone was really nice.
2 The customer service team did not know what they were talking about and was rude.
3 Everyone was very helpful 5 stars.
4 Not very helpful at all.
5 Staff was nice.
6 All the people was really nice.
Lets say I identify a number of key words I'm interested in; for example nice, rude and helpful.
Is there a way to pull 2 strings that come before these words and produce a frequency table?
WORD Frequency
Was Really Nice 2
And Was Rude 1
Was Very Helpful 1
Not very helpful 1
I have a code written already which will help me to identify the key words, this code creates a count of the freq of each word within the comment variable.
data PG_2 / view=PG_2;
length word $20;
set PG_1;
do i = 1 by 1 until(missing(word));
word = upcase(scan(COMMENT, i));
if not missing(word) then output;
end;
keep word;
run;
proc freq data=PG_2 order=freq;
table word / out=wordfreq(drop=percent);
run;
Have you looked at the perl regular expression (PRX) functions in SAS. I think they might solve your issue.
You can use RegEx capture groups to pull out the two words directly before your keyword using prxparse and prxposn. The below should grab any two words before the word nice in the comment variable and add them to the firstTwoStrings variable.
data firstTwoStrings;
length firstTwoStrings $200;
retain re;
if _N_ = 1 then
re = prxparse('/(\w+ \w+) nice/'); /*change 'nice' to your desired keyword*/
set comments;
if prxmatch(re, COMMENT) then
do;
firstTwoStrings = prxposn(re, 1, COMMENT);
end;
run;
I have a table that looks similar to this:
A | B
1234|A1B2C
1124|$1n7
1342|*6675
1189|966
I need to create a column C where it takes the data from column B and replaces all non numeric characters with a "9" and makes each one 5 characters long by adding 0's to the front. It should come out like this:
91929
09197
96675
00966
Any assistance would be much appreciated, Thank you!
Edit: Sorry first time posting on any forum like this and got a bit ahead of myself, I created the table using SQL to pull data from 3 other tables and am a bit more familiar with SQL than SAS, which I have only been using for a few weeks. I have tried using COMPRESS but as I read more about that it seem like it only removes the values, so I tried TRANWRD but from what I was able to figure out I would need to create an entry for each letter and symbol that could appear, ie.
data Work.temp;
str = b;
Alpha=tranwrd(str, "a", "9");
Alpha=tranwrd(str, "b", "9");
put Alpha;
run;
so then I researched some more and found SAS replace character in ALL columns
based on that I used this code:
data temp;
set work.temp;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","9"));
end;
drop i;
run;
That just returns:
|Str|B|Alpha|
|---.|-.|.-------|
(sorry about the bad formatting there, spent 30 min trying to figure out how to make the table look right with spaces but kept coming out wrong. Please imagine the -'s are spaces)
again any help would be appreciated, Thank you!
try this.
data test;
input var1 $5.;
datalines;
A1B2C
$1n7
*6675
966
;
run;
data test1;
set test;
length var2 $5.;
regex = prxparse ("s/[^0-9|\s]/9/"); /*holds the regular expression you want to use to substitute the non-number characters*/
var2 = prxchange (regex, -1, var1); /*use this function to substitute all instances of the pattern*/
var3 = put (input (var2, best5.), z5.); /*use input and put to pad the front of the variable with 0s*/
run;
Good luck.
Keeping only the digits is simple. Use the modifiers on the COMPRESS() function.
c=compress(b,,'kd');
Padding on the left with zeros there are a number of ways to do that.
You could convert the digits to a number then write it back to a string use the Z format.
c=put(input(c,??5.),Z5.);
You could add the zeros. Using IF statement:
if length(c) < 5 then c=repeat('0',5-length(c)-1)||c ;
Or using SUBSTRN() function.
c=substrn('00000',1,5-length(c))||c;
Or have some fun with the REVERSE() function.
c=reverse(substr(reverse(cats('00000',c)),1,5));
I can't seem to find a way of changing individual values in Stata.. Say if I have a variable called height which has 20 observations, I can
dis height[20] /*displays the 20th observation of height*/
How can I likewise change say the 20th observation?
You could use the Data Editor. Otherwise the command line syntax is replace ... in #. See the help for replace. If you keep a log either kind of change will be documented as a replace statement.
clear
set obs 10
gen y = _n
replace y = 42 in 7
I have the following data structure. Within each group, some observations have missing value. I do know that each group has only one non-missing value (10 for group 1 and 11 for group 2 in this case). The location of the missing observations are random within the group (i.e. can't fill in missing values with the previous / following value).
How to fill the missing values with the one non-missing value by group?
group value
1 .
1 10
1 .
2 11
2 .
2 11
My current solution is a loop, but I suspect there's some clever bysort that I can use.
levelsof group, local(lm_group)
foreach group in `lm_group' {
levelsof value if group == `group', local(lm_value)
replace value = `lm_value' if group == `group'
}
If you know that the non-missing values are constant within group, then you can get there in one with
bysort group (value) : replace value = value[_n-1] if missing(value)
as the missing values are first sorted to the end and then each missing value is replaced by the previous non-missing value. Replacement cascades downwards, but only within each group.
For documentation, see this FAQ
To check that there is at most one distinct non-missing value within each group, you could do this:
bysort group (value) : assert (value == value[1]) | missing(value)
More personal note. It's nice to see levelsof in use, as I first wrote it, but the above is better.
I think the xfill command is what you are looking for.
To install xfill, copy-paste the following into Stata and follow instructions:
net from http://www.sealedenvelope.com/
After that, the rest is easy:
xfill value, i(group)
You can read up about xfill here
The clever bysort-answer you were looking for was:
bysort group: egen new_value=max(cond(!missing(value), value, .)
The cond-function checks if the first argument is true and returns value if is and . if it is not.
FWIW I could not get Nick's bysort solution to work, no clue why. I followed the suggested syntax from the FAQ he linked instead and got it to work, though. The generic form is:
gsort id -myvar
by id: replace myvar = myvar[_n-1] if myvar == .
EDIT: fixed the errant reference to "time" in the previous iteration of this post (and added the if missing condition). The current code should be a functioning generic solution.
I want to use columns 'A' and 'B' to create column 'Result' which is content of A repeated B number of times
A B Result
z 3 zzz
az 2 azaz
Tried using Result=repeat(A,B) which didn't work out. Is there something I missed while using the repeat statement?
The REPEAT function returns a character value consisting of the first argument repeated n times, Thus, the first argument appears n + 1 times in the result.
So, you have to subtract 1 from B to get the result that you want.
Try
Result=repeat(A,int(B)-1)
It is simple in R! . Sorry I did not look for the tag, but here is how R does it
Try the function makeNstr() from package Hmisc
>require(Hmisc)
>df <- data.frame(A = c("a","az"), B = c(3,2))
>Result <- makeNstr(df$A,df$B)
>df <- cbind(df,Result)
>df
A B Result
1 a 3 aaa
2 az 2 azaz
Hope you find it useful