How to fill in missing values by group? - stata

I have the following data structure. Within each group, some observations have missing value. I do know that each group has only one non-missing value (10 for group 1 and 11 for group 2 in this case). The location of the missing observations are random within the group (i.e. can't fill in missing values with the previous / following value).
How to fill the missing values with the one non-missing value by group?
group value
1 .
1 10
1 .
2 11
2 .
2 11
My current solution is a loop, but I suspect there's some clever bysort that I can use.
levelsof group, local(lm_group)
foreach group in `lm_group' {
levelsof value if group == `group', local(lm_value)
replace value = `lm_value' if group == `group'
}

If you know that the non-missing values are constant within group, then you can get there in one with
bysort group (value) : replace value = value[_n-1] if missing(value)
as the missing values are first sorted to the end and then each missing value is replaced by the previous non-missing value. Replacement cascades downwards, but only within each group.
For documentation, see this FAQ
To check that there is at most one distinct non-missing value within each group, you could do this:
bysort group (value) : assert (value == value[1]) | missing(value)
More personal note. It's nice to see levelsof in use, as I first wrote it, but the above is better.

I think the xfill command is what you are looking for.
To install xfill, copy-paste the following into Stata and follow instructions:
net from http://www.sealedenvelope.com/
After that, the rest is easy:
xfill value, i(group)
You can read up about xfill here

The clever bysort-answer you were looking for was:
bysort group: egen new_value=max(cond(!missing(value), value, .)
The cond-function checks if the first argument is true and returns value if is and . if it is not.

FWIW I could not get Nick's bysort solution to work, no clue why. I followed the suggested syntax from the FAQ he linked instead and got it to work, though. The generic form is:
gsort id -myvar
by id: replace myvar = myvar[_n-1] if myvar == .
EDIT: fixed the errant reference to "time" in the previous iteration of this post (and added the if missing condition). The current code should be a functioning generic solution.

Related

Jmeter remove brackets from results with regex

I have tis jdcb call results
[{string_agg=1268#jm_eo_2283,1343#jm_eo_2333}]
so i use this regex code to remove brackets (?<=[=]).*\d
In theory i should get only
jm_eo_2283,1343#jm_eo_2333
i have 2022-08-30 13:35:49,291 ERROR o.a.j.e.RegexExtractor: Error in pattern: '(?<=[=]).*\d'
Any idea what is going on?? Is a way to get the result without brackets??
According to the JDBC Request sampler documentation:
If the Variable Names list is provided, then for each row returned by a Select statement, the variables are set up with the value of the corresponding column (if a variable name is provided), and the count of rows is also set up. For example, if the Select statement returns 2 rows of 3 columns, and the variable list is A,,C, then the following variables will be set up:
A_#=2 (number of rows)
A_1=column 1, row 1
A_2=column 1, row 2
C_#=2 (number of rows)
C_1=column 3, row 1
C_2=column 3, row 2
So you should be able to access:
first row as ${eo_id_1}
second row as ${eo_id_2}
etc.
More information: Using JDBC Sampler in JMeter
You can use
=[0-9]*#(.*[0-9])
See the regex demo. Keep the template field set to $1$, it will extract the value captured with the first parenthesized pattern part.
Details:
= - a = sign
[0-9]* - zero or more digits
# - a # char
(.*[0-9]) - Group 1 ($1$): any zero or more chars other than line break chars as many as possible, and then a digit.

How count an variable's frequency in specific column in sas

I would like to creare a variable along with each subject id, variable is ci_em_ti = COUNT “Impaired” values among the following variables: bvmdrt_cutoff, craftivmmt_cutoff, craftpimmt_cutoff, craftvdelt_cutoff, craftpdelt_cutoff, nlairt_cutoff, nlsdt_cutoff, nlldt_cutoff
How should I do this in SAS?
I tried
countc(cats(of bvmdrt_cutoff, craftivmmt_cutoff, craftpimmt_cutoff, craftvdelt_cutoff, craftpdelt_cutoff, nlairt_cutoff, nlsdt_cutoff), "Impaired")`
but it done not work
The function COUNTC() counts the number of times any of the listed characters appear. By searching for Impaired you are searching for the characters: adeiImpr. So one value of "Missing" will contribute 2 into the count since it has two lowercase i's and "Normal" will count as 3 because the letters r,m and a. "Imparied" will count as 8 since all of the characters are in the search list.
The function COUNT() will search for the number of times a substring occurs so you might try that.
Are you sure your values are character strings? If instead they are numbers with a user defined format attached the CATS() function will not use the formatted values. So you will need to search for the codes instead of the decodes.
PS There is no need to add the OF keyword when there is only one variable in the list. Either remove the OF or remove the commas.
You say count in a column but then your function is actually counting for several columns but a single row. Since you haven't provided usable data, I'll use SASHELP.HEART instead.
This shows how to display your values in each column.
proc freq data=sashelp.heart;
table chol_status bp_status weight_status smoking_status;
run;

Changing single values of variables

I can't seem to find a way of changing individual values in Stata.. Say if I have a variable called height which has 20 observations, I can
dis height[20] /*displays the 20th observation of height*/
How can I likewise change say the 20th observation?
You could use the Data Editor. Otherwise the command line syntax is replace ... in #. See the help for replace. If you keep a log either kind of change will be documented as a replace statement.
clear
set obs 10
gen y = _n
replace y = 42 in 7

R: grepl select first charachter on a string

I apologize in advance, this might be a repeat question. However, I just spent the two last hours over stackoverflow, and can't seem to find a solution.
I want to use grepl to detect rows that begin with a digit, that's what I tried to use but It didn't give me the rigt answer:
grep.numeric=as.data.frame(grepl("^[:digit:]",df_mod$name))
I guess that the problem is from the regular expression "^[:digit:]", but I couldn't figure it out.
UPDATE
My dataframe looks like this, It's huge, but below is an example:
ID mark name
1 whatever name product
2 whatever 10 product
3 whatever 250 product
4 another_mark other product
I want to detect products which their names begin with a number.
UPDATE 2
applying grep.numeric=grepl("^[[:digit:]]",df_mod$name) on the example below give me the right answer which is:
grep.numeric
[1] FALSE TRUE TRUE FALSE
But, what drive me crazy is when I pply this fuction to my real dataframe:
grep.numeric=grepl("^[[:digit:]]",df_mod[217,]$nom)
give me this result:
grep.numeric
[1] FALSE
But actually, what I have is this :
df_mod[217,]$nom
[1] 100 lipo 30 gélules
Please help me.
Apparently, some of your values have leading spaces, so you could either modify your regex to (or something similar)
grepl("^\\s*[[:digit:]]", df_mod$name)
Or use the built in trimws function
grepl("^[[:digit:]]", trimws(df_mod$name))

Regex Optional Groups Captures with Values

I'm having difficulties extracting irregular data using Regex. I attempted to use Lookheads however when the value doesn't exist the entire match returns false. The data set is consistent all the way until I reach the characters starting with RXX. The RXX are unique identifiers (groups) and the numeric values in between each set of Rxx's is what I would like to capture and assigned them to group names.
The Rxx values are random from R01 to R15 and 1 to all 15 could exist in the string.
The string values could vary from
12*000000000**S304JB01811*8*0*8*4*4*34R0332R152~~~
12*000000000**S304JB01811*9*0*4*3*4*224R023R032R10234R1325~~~
I'm able to extract the values and assign a group name until I reach the Rxx
My attempt are extracting the values are as follow
S304JB0...(?<Total1>[\d]+).(?<Total2>[\d]+).(?<Total3>[\d]+).(?<Total4>[\d]+).(?<Total5>[\d]+).(?<Total6>[\d]+).(?<Total7>[\d]+)
Which gives me what I want below
Total1 `1`
Total2 `8`
Total3 `0`
Total4 `8`
Total5 `4`
Total6 `4`
Total7 `34`
Capturing the R03 value and assigning it to Row is achieved below but if the value R03 doesn't exist in the string then the entire match returns false
(?<Row3>(R03)[\d]+)
Looking how I can make these regex statements optional allowing me to return the following
Total1 `1`
Total2 `8`
Total3 `0`
Total4 `8`
Total5 `4`
Total6 `4`
Total7 `34`
Row1 `32`
Row15 `2`
S304JB0...(?<Total1>[\d]+).(?<Total2>[\d]+).(?<Total3>[\d]+).(?<Total4>[\d]+).(?<Total5>[\d]+).(?<Total6>[\d]+).(?<Total7>[\d]+)(?<Row3>(R03)[\d]+)(?<Row4>(R04)[\d]+) ------> (?<Row15>(R15)[\d]+)
Thanks for your help
-Edited
Thanks for the quick reply Jorge
The input data will be
12*000000000**S304JB01811*8*0*8*4*4*34R0332R152~~~
The output will be 9 captured groups results
Group | Result
Total1 = 1
Total2 = 8
Total3 = 0
Total4 = 8
Total5 = 4
Total6 = 4
Total7 = 34
Row1 = 32
Row15 = 2
My example is shared below with input and
https://regex101.com/r/wG3aM3/68
Hopefully this helped to clarify things
D.
I'm certain this would be easier parsing char by char and storing each value.
As for the regex question, basically what you want to do is create all the groups, just like you've already tried, but you also want to make them optional, because not all groups might be there.
You can make the group optional with a construct like:
(?:R01(?<Row1>\d+))?
So you should add one of each to get the values in different capture groups. Notice I used the construct (?:non-capturing) which is exactly the same as a group, but it doesn't create a backreference. You can read about it here.
Edit: One more thing. You're using a . to allow any delimiter. However, performance-wise it would be better to use something like \D (anything except digits). In case of failure, it saves the regex engine quite a few backtracking steps.
This would be the whole expression, assuming the Rxx groups are always ordered.
S304JB0...(?<Total1>\d+)\D(?<Total2>\d+)\D(?<Total3>\d+)\D(?<Total4>\d+)\D(?<Total5>\d+)\D(?<Total6>\d+)\D(?<Total7>\d+)(?:R01(?<Row1>\d+))?(?:R02(?<Row2>\d+))?(?:R03(?<Row3>\d+))?(?:R04(?<Row4>\d+))?(?:R05(?<Row5>\d+))?(?:R06(?<Row6>\d+))?(?:R07(?<Row7>\d+))?(?:R08(?<Row8>\d+))?(?:R09(?<Row9>\d+))?(?:R10(?<Row10>\d+))?(?:R11(?<Row11>\d+))?(?:R12(?<Row12>\d+))?(?:R13(?<Row13>\d+))?(?:R14(?<Row14>\d+))?(?:R15(?<Row15>\d+))?
DEMO