Stata: Reshaping dataset – bringing variable labels to variable values - stata

I have a dataset containing different product values generated in each simulation, with the following layout:
+------------+-------+-------+-------+
| simulation | v1 | v2 | v3 |
+------------+-------+-------+-------+
| 1 | 0,500 | 0,400 | 0,300 |
| 2 | 0,900 | 0,800 | 0,800 |
| 3 | 0,100 | 0,200 | 0,300 |
+------------+-------+-------+-------+
The variable names v1, v2, v3 are labelled as product ids and are not displayed at the header of the dataset. I need to reshape this dataset to long format so it would like this:
+------------+----+----------+-------+
| simulation | id | label | value |
+------------+----+----------+-------+
| 1 | v1 | 01020304 | 0,500 |
| 1 | v2 | 01020305 | 0,400 |
| 1 | v3 | 01020306 | 0,300 |
| 2 | v1 | 01020304 | 0,900 |
| 2 | v2 | 01020305 | 0,800 |
| 2 | v3 | 01020306 | 0,800 |
| 3 | v1 | 01020304 | 0,100 |
| 3 | v2 | 01020305 | 0,200 |
| 3 | v3 | 01020306 | 0,300 |
+------------+----+----------+-------+
The standard code reshape long v , i(simulation) j(_count) is not applicable in this case, as I need to reshape the variable labels and keep them in the dataset as variable values. Was wondering if there exists a way to make this kind of transposition with variable labels?

Only one idea seems needed here. If your variable labels would otherwise disappear, save them in a local macro before a reshape and then apply them as value labels afterwards. The FAQ cited earlier gives the flavour.
Sandpit to play in:
input simulation v1 v2 v3
simulat~n v1 v2 v3
1. 1 0.500 0.400 0.300
2. 2 0.900 0.800 0.800
3. 3 0.100 0.200 0.300
4. end
label var v1 "01020304"
label var v2 "01020305"
label var v3 "01020306"
Sample code:
forval j = 1/3 {
local labels `labels' `j' "`: var label v`j''"
}
reshape long v, i(simulation)
(note: j = 1 2 3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 9
Number of variables 4 -> 3
j variable (3 values) -> _j
xij variables:
v1 v2 v3 -> v
-----------------------------------------------------------------------------
rename v value
label def label `labels'
rename _j label
gen id = label
label val label label
list
+----------------------------------+
| simula~n label value id |
|----------------------------------|
1. | 1 01020304 .5 1 |
2. | 1 01020305 .4 2 |
3. | 1 01020306 .3 3 |
4. | 2 01020304 .9 1 |
5. | 2 01020305 .8 2 |
|----------------------------------|
6. | 2 01020306 .8 3 |
7. | 3 01020304 .1 1 |
8. | 3 01020305 .2 2 |
9. | 3 01020306 .3 3 |
+----------------------------------+

Related

Conditional count Measure

I have data looking like this:
| ID |OpID|
| -- | -- |
| 10 | 1 |
| 10 | 2 |
| 10 | 4 |
| 11 |null|
| 12 | 3 |
| 12 | 4 |
| 13 | 1 |
| 13 | 2 |
| 13 | 3 |
| 14 | 2 |
| 14 | 4 |
Here OpID 4 means 1 and 2.
I would like to count the different occurrences of 1, 2 and 3 in OpID of distinct ID.
If the counts of OpID having 1 would be 4, 2 would be 4, 3 would be 2.
If ID has OpID of 4 but already has data of 1, 2 it wouldn't be counted. But if 4 exists and only 1 (2) is there, count for 2 (1) would be incremented.
The expected output would be:
|OpID|Count|
| 1 | 4 |
| 2 | 4 |
| 3 | 2 |
(Going to be using the results in a column chart)
Hope this makes sense...
edit: there are other columns too and an ID and OpID can be duplicated hence need to do a groupby clause before.

SAS: replicate rows and iterate over columns

I'd like to replicate and modify specific rows in the table.
before:
xyz_id | letter | Col_1 | Col_2|
1 | Z | V1 | W1 |
2 | Z | V2 | W2 |
3 | Z | V3 | W3 |
after:
xyz_id | letter | Col_1 | Col_2|
1 | A | V1.1 | W1.1 |
1 | B | V1.1 | W1.1 |
1 | C | V1.1 | W1.1 |
2 | A | V2.1 | W2.1 |
2 | B | V2.1 | W2.1 |
2 | C | V2.1 | W2.1 |
3 | A | V3.1 | W3.1 |
3 | B | V3.1 | W3.1 |
3 | C | V3.1 | W3.1 |
I've prepared the following code:
data test2;
set test;
array letters {8} $8 _temporary_ ('A', 'B', 'C');
array weights {8} _temporary_ (1,2,3);
array nvars {2} Col_1 Col_2;
do i = 1 to 8;
letter = letters(i);
do j=1 to 2;
nvar{j} = nvar{j} * weights(i);
end;
output;
end;
drop i;
run;
but it doesn't work. Any suggestions?
You reference the nvar array in the j loop, but the array is called nvars.
Col_1 and Col_2 are characters (e.g. 'V1'), so you cannot multiply these by a number and expect a valid result.
Your letters and weights arrays do not have enough values defined for the number of elements (3 vs 8).

Create column to classify rows based on realted tables DAX PowerBI

I have simplified my problem to solve. Lets suppose I have three tables. One containing data and specific codes that identify objects lets say Apples.
+-------------+------------+-----------+
| Data picked | Color code | Size code |
+-------------+------------+-----------+
| 1-8-2018 | 1 | 1 |
| 1-8-2018 | 1 | 3 |
| 1-8-2018 | 2 | 2 |
| 1-8-2018 | 2 | 3 |
| 1-8-2018 | 2 | 2 |
| 1-8-2018 | 3 | 3 |
| 1-8-2018 | 4 | 1 |
| 1-8-2018 | 4 | 1 |
| 1-8-2018 | 5 | 3 |
| 1-8-2018 | 6 | 1 |
| 1-8-2018 | 6 | 2 |
| 1-8-2018 | 6 | 2 |
+-------------+------------+-----------+
And i have two related helping tables to help understand the codes (their relationships are inactive in the model due to ambiguity with other tables in the real case).
+-----------+--------+
| Size code | Size |
+-----------+--------+
| 1 | Small |
| 2 | Medium |
| 3 | Large |
+-----------+--------+
and
+------------+----------------+-------+
| Color code | Color specific | Color |
+------------+----------------+-------+
| 1 | Light green | Green |
| 2 | Green | Green |
| 3 | Semi green | Green |
| 4 | Red | Red |
| 5 | Dark | Red |
| 6 | Pink | Red |
+------------+----------------+-------+
Lets say that I want to create an extra column in the original table to determine which apples are class A and class B given that medium green Apples are class A and large Red apples are class B, the other remain blank as the example below.
+-------------+------------+-----------+-------+
| Data picked | Color code | Size code | Class |
+-------------+------------+-----------+-------+
| 1-8-2018 | 1 | 1 | |
| 1-8-2018 | 1 | 3 | |
| 1-8-2018 | 2 | 2 | A |
| 1-8-2018 | 2 | 3 | |
| 1-8-2018 | 2 | 2 | A |
| 1-8-2018 | 3 | 3 | |
| 1-8-2018 | 4 | 1 | |
| 1-8-2018 | 4 | 1 | |
| 1-8-2018 | 5 | 3 | B |
| 1-8-2018 | 6 | 1 | |
| 1-8-2018 | 6 | 2 | |
| 1-8-2018 | 6 | 2 | |
+-------------+------------+-----------+-------+
What's the proper DAX to use given the relationships are initially inactive. Preferably solvable without creating any further additional columns in any table. I already tried codes like:
CALCULATE (
"A" ;
FILTER ( 'Size Table' ; 'Size Table'[Size] = "Medium");
FILTER ( 'Color Table' ; 'Color Table'[Color] = "Green")
)
And many variations on the same principle
Given that the relationships are inactive, I'd suggest using LOOKUPVALUE to match ID values on the other tables. You should be able to create a calculated column as follows:
Class =
VAR Size = LOOKUPVALUE('Size Table'[Size],
'Size Table'[Size code], 'Data Table'[Size code])
VAR Color = LOOKUPVALUE('Color Table'[Color],
'Color Table'[Color code], 'Data Table'[Color code])
RETURN SWITCH(TRUE(),
(Size = "Medium") && (Color = "Green"), "A",
(Size = "Large") && (Color = "Red"), "B", BLANK())
If your relationships are active, then you don't need the lookups:
Class = SWITCH(TRUE(),
(RELATED('Size Table'[Size]) = "Medium") &&
(RELATED('Color Table'[Color]) = "Green"),
"A",
(RELATED('Size Table'[Size]) = "Large") &&
(RELATED('Color Table'[Color]) = "Red"),
"B",
BLANK())
Or a bit more elegantly written (especially for more classes):
Class =
VAR SizeColor = RELATED('Size Table'[Size]) & " " & RELATED('Color Table'[Color])
RETURN SWITCH(TRUE(),
SizeColor = "Medium Green", "A",
SizeColor = "Large Red", "B",
BLANK())

Pandas: consecutive rows' value change comparison

I have a Dataframe with date as index:
Index | Opp id | Pipeline_Type |Amount
20170104 | 1 | Null | 10
20170104 | 2 | Sou | 20
20170104 | 3 | Inf | 25
20170118 | 1 | Inf | 12
20170118 | 2 | Null | 27
20170118 | 3 | Inf | 25
Now I want to calculate number of records(Opp id) for which Pipeline type has changed or amount has changed (+/-diff). Above no of records will be 2 for pipeline_type as well as for amount.
Please help me frame the solution.

Stata: Cumulative number of new observations

I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.