Grouping values in DB SAS - sas

Help to solve the problem, please. I have some ideas, but none of them gives the desired result. DB have
Site Num Pres Began Start A B C
01 101 yes no yes 1 1 3
01 101 no yes yes 2 1 7
01 102 yes yes no 1 2 1
DB want (txt-file)
Site Num Pres Began Start Quantity
01 101 yes no yes 1
1
3
01 101 no yes yes 2
1
7
01 102 yes yes no 1
2
1
If you have any thoughts on this, I will be very grateful!!!

Related

Removing entire panel with missing values

I'm working on a panel dataset, which has missing values for four variables (at the start, end and in-between of panels). I would like to remove the entire panel which has missing values.
This is the code I have tried to use so far:
bysort BvD_ID YEAR: drop if sum(!missing(REV_LAY,EMP_LAY,FX_ASSET_LAY,MATCOST_LAY))==0
This piece of code successfully removes all observations with missing values in any of the four variables but it retains observations with non-missing values.
Example data:
Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
In the above sample data, I want to drop panel Firm_ID = 001 completely.
You can do something like:
clear
input Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
001 2001 80 25 120
001 2002 75 . 122
001 2003 82 32 128
002 2001 40 15 45
002 2002 42 18 48
002 2003 45 20 50
end
generate index = _n
bysort Firm_ID (index): generate todrop = sum(missing(REV_LAY, EMP_LAY, FX_ASSET_LAY))
by Firm_ID: drop if todrop[_N]
list Firm_ID Year REV_LAY EMP_LAY FX_ASSET_LAY
+-----------------------------------------------+
| Firm_ID Year REV_LAY EMP_LAY FX_ASS~Y |
|-----------------------------------------------|
1. | 2 2001 40 15 45 |
2. | 2 2002 42 18 48 |
3. | 2 2003 45 20 50 |
+-----------------------------------------------+

SAS retain Statement to drag values down

Following is an example of the data I have
data testretain;
input SUBJ visit parameter value vistype$ basevalue$;
cards;
01 1 1 152 screen .
01 1 2 22 screen .
01 1 3 1000 screen .
01 2 1 154 random YES
01 2 2 23 random YES
01 2 3 1005 random YES
01 3 1 155 visit .
01 3 2 21 visit .
01 3 3 1003 visit .
;
run;
I want to make sure that the value if the basevalue is YESgets carried over to each visit
so that it looks like the following- This is how I want the output to look like
SUBJ visit parameter value vistype$ basevalue$ BASE;
01 1 1 152 screen .
01 1 2 22 screen .
01 1 3 1000 screen .
01 2 1 154 random YES 154
01 2 2 23 random YES 23
01 2 3 1005 random YES 1005
01 3 1 155 visit . 154
01 3 2 21 visit . 23
01 3 3 1003 visit . 1005
I tried the following code;
data testretain1;
set testretain;
if basevalue='YES' then BASE=value;
retain BASE;
run;
However it doesn't seem to work. The 1005 value gets dragged on to every observation.
Sort the data so that all the results for the same parameter are together then you can easily use RETAIN to solve this.
data want;
set have ;
by subject parameter visit;
if first.parameter then BASE=.;
if basevalue='YES' then BASE=value;
retain BASE;
run;

How to read regex2dfa output

I am messing around with this regex2dfa library -> https://github.com/kpdyer/regex2dfa, using the command ./regex2dfa -r "(abc+)+"
This returns
0 2 97 97
1 3 99 99
2 1 98 98
3 2 97 97
3 3 99 99
3
Looking at this
https://lambda.uta.edu/cse5317/spring01/notes/node8.html
and using the DFA generated here for the regex (abc+)+
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
I can't seem to figure out how to go from the diagram, to the transition table(?) that the regex2dfa tool is outputting.
What am I missing?

excel, vba or regex to copy values downwards based on repeated values

I have the following records:
62
STARTHERE 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
STARTHERE 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
STARTHERE 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
STARTHERE 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
STARTHERE 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
STARTHERE 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
63
STARTHERE 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
STARTHERE 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
STARTHERE 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
STARTHERE 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
STARTHERE 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
STARTHERE 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
STARTHERE 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
STARTHERE 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
I don't know if this is possible in excel, vba in excel or even through regex. I want to fill the lowest numerical value (e.g. 62) and replace the lower rows with values "STARTHERE" up until the next numerical value (63). Right now, it's done manually but I was thinking if there is a way of doing this mechanically. Through excel formula, VBA, or regex, as these are what I'm familiar with. So that I can get below, it's okay also that the 62 with blank value to the right are stripped but I'm fine even if it's not:
62
62 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
62 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
62 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
62 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
62 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
62 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
62
62 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
63 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
63 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
63 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
63 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
65 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
65 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
65 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
Many thanks!
I assume this data is from an Excel spreadsheet, with both the numerical values and the value "STARTHERE" are on the first column (column A). The other data are on column B, C, etc.
Basically, I will loop through the first column from the top to the bottom row. If the value within the selector cell is not a number, it will be equal to the one right above it. If it is, then we skip to the next cell.
Sub help()
ActiveSheet.Columns(1).NumberFormat = "0"
For i = 1 To ActiveSheet.UsedRange.Rows.count
If Not Information.IsNumeric(Cells(i, 1)) Then Cells(i, 1).value = Cells(i - 1, 1).value
Next i
End Sub

Keep unique IDs depending on frequency of occurrence

Dataset description:
I have a highly unbalanced panel dataset, with some unique panelist IDs appearing only once, while others appear as much as 4,900 times. Each observation reflects an alcohol purchase associated with a unique product identifier (UPC). If my panelist purchased two separate brands (hence, two different UPCs) in the same day, same store, two distinct observations are created. However, seeing that these purchases were made on the same day and same store, I could safely assume that it was just one trip. Similarly, another panelist who also has 2 observations associated with the same store BUT different days of purchase (or vice versa) is assumed to make 2 store visits.
Task:
I would like to explore qualities of those people who purchased alcohol a certain number of times in the whole period. Thus, I need to identify panelists who made only 1) 1 visit, 2) 2 visits, 3) between 5 and 10 visits, 4) between 50 and 100 visits, etc.
I started by trying to identify panelists who made only 1 visit by tagging them by panelist id, day, and store. However, the program also tags the first occurrence of those who appear twice or more.
egen tag = tag(panid day store)
I also tried collapse but realized that it might not be the best solution because I want to keep my observations "as is" without aggregating any variables.
I will appreciate if you can provide me insight on how to identify such observations.
UPDATE:
panid units dollars iri_key upc day tag
1100560 1 5.989 234140 00-01-18200-00834 47 1
1101253 1 13.99 652159 00-03-71990-09516 251 1
1100685 1 20.99 652159 00-01-18200-53030 18 1
1100685 1 15.99 652159 00-01-83783-37512 18 0
1101162 1 19.99 652159 00-01-34100-15341 206 1
1101162 1 19.99 652159 00-01-34100-15341 235 1
1101758 1 12.99 652159 00-01-18200-43381 30 1
1101758 1 6.989 652159 00-01-18200-16992 114 1
1101758 1 11.99 652159 00-02-72311-23012 121 1
1101758 2 21.98 652159 00-02-72311-23012 128 1
1101758 1 19.99 652159 00-01-18200-96550 223 1
1101758 1 12.99 234140 00-04-87692-29103 247 1
1101758 1 20.99 234140 00-01-18200-96550 296 1
1101758 1 12.99 234140 00-01-87692-11103 296 0
1101758 1 12.99 652159 00-01-87692-11103 317 1
1101758 1 19.99 652159 00-01-18200-96550 324 1
1101758 1 12.99 652159 00-02-87692-68103 352 1
1101758 1 12.99 652159 00-01-87692-32012 354 1
Hi Roberto, thanks for the feedback. This is a small sample of the dataset.
In the first part of this particular example, we can safely assume that all three ids 1100560, 1101253, and 1100685 visited a store only once, i.e. made only one transaction each. The first two panelists obviously have only one record each, and the third panelist purchased 2 different UPCs in the same store, same day, i.e. in the same transaction.
The second part of the example has two panelists - 1101162 and 1101758 - who made more than one transaction: two and eleven, respectively. (Panelist 1101758 has 12 observations, but only 11 distinct trips.)
I would like to identify an exact number of distinct trips (or transactions) panelists of my dataset made:
panid units dollars iri_key upc day tag total#oftrips
1100560 1 5.989 234140 00-01-18200-00834 47 1 1
1101253 1 13.99 652159 00-03-71990-09516 251 1 1
1100685 1 20.99 652159 00-01-18200-53030 18 1 1
1100685 1 15.99 652159 00-01-83783-37512 18 0 1
1101162 1 19.99 652159 00-01-34100-15341 206 1 2
1101162 1 19.99 652159 00-01-34100-15341 235 1 2
1101758 1 12.99 652159 00-01-18200-43381 30 1 11
1101758 1 6.989 652159 00-01-18200-16992 114 1 11
1101758 1 11.99 652159 00-02-72311-23012 121 1 11
1101758 2 21.98 652159 00-02-72311-23012 128 1 11
1101758 1 19.99 652159 00-01-18200-96550 223 1 11
1101758 1 12.99 234140 00-04-87692-29103 247 1 11
1101758 1 20.99 234140 00-01-18200-96550 296 1 11
1101758 1 12.99 234140 00-01-87692-11103 296 0 11
1101758 1 12.99 652159 00-01-87692-11103 317 1 11
1101758 1 19.99 652159 00-01-18200-96550 324 1 11
1101758 1 12.99 652159 00-02-87692-68103 352 1 11
1101758 1 12.99 652159 00-01-87692-32012 354 1 11
Bottom line, I guess, is - as long as panelist, iri_key, and day are the same, this would count as 1 trip. The total number of trips per panelists will depend on an additional number of distinct panelist, iri_key, and day combinations.
I'm not sure I understand exactly what you want, but here's my guess:
clear all
set more off
*----- example data -----
input ///
id code day store
1 1 86 1
1 1 45 1
1 3 45 1
1 3 4 4
2 1 86 1
2 1 45 1
2 3 45 1
end
format day %td
list, sepby(id)
*----- what you want? -----
egen tag = tag(id day store)
bysort id: egen totvis = total(tag)
bysort id store: egen totvis2 = total(tag)
list, sepby(id)
which will result in:
+--------------------------------------------------------+
| id code day store tag totvis totvis2 |
|--------------------------------------------------------|
1. | 1 3 05jan1960 4 1 3 1 |
2. | 1 1 15feb1960 1 1 3 2 |
3. | 1 3 15feb1960 1 0 3 2 |
4. | 1 1 27mar1960 1 1 3 2 |
|--------------------------------------------------------|
5. | 2 1 15feb1960 1 1 2 2 |
6. | 2 3 15feb1960 1 0 2 2 |
7. | 2 1 27mar1960 1 1 2 2 |
+--------------------------------------------------------+
This means person 1 made a total of 3 visits (considering all stores), and of those, 1 was to store 4 and 2 to store 1. Person 2 made 2 visits, both to store 1.