Transposing variables - stata

Is there an easy way to transpose my variables in Stata?
From:
-.48685038 -.13912173 -.91550094 -.96246505
-1.4760038 1.2873173 -.22300169 .25329232
-.01091149 -.58777297 .49454963 2.2842488
-.01376025 -.03060045 -.26231077 .32238093
.51557881 -2.1968436 .36612388 -.40590465
To:
-.48685038 -1.4760038 -.01091149 -.01376025 .51557881
-.13912173 1.2873173 -.58777297 -.03060045 -2.1968436
-.91550094 -.22300169 .49454963 -.26231077 .36612388
-.96246505 .25329232 2.2842488 .32238093 -.40590465
My understanding is that I have to create a matrix first:
mkmat *, matrix(data)
matrix data = data'
svmat data

Try xpose:
. webuse xposexmpl, clear
. list
+--------------------------------+
| county year1 year2 year3 |
|--------------------------------|
1. | 1 57.2 11.3 19.5 |
2. | 2 12.5 8.2 28.9 |
3. | 3 18 14.2 33.2 |
+--------------------------------+
. xpose, clear varname
. list
+-------------------------------+
| v1 v2 v3 _varname |
|-------------------------------|
1. | 1 2 3 county |
2. | 57.2 12.5 18 year1 |
3. | 11.3 8.2 14.2 year2 |
4. | 19.5 28.9 33.2 year3 |
+-------------------------------+

Related

How to extracting all values that contain part of particular number and then deleting them?

How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;

Pandas: consecutive rows' value change comparison

I have a Dataframe with date as index:
Index | Opp id | Pipeline_Type |Amount
20170104 | 1 | Null | 10
20170104 | 2 | Sou | 20
20170104 | 3 | Inf | 25
20170118 | 1 | Inf | 12
20170118 | 2 | Null | 27
20170118 | 3 | Inf | 25
Now I want to calculate number of records(Opp id) for which Pipeline type has changed or amount has changed (+/-diff). Above no of records will be 2 for pipeline_type as well as for amount.
Please help me frame the solution.

Generating a variable only including the top 4 firms with largest sales

My question is very related to the question below:
Calculate industry concentration in Stata based on four biggest numbers
I want to generate a variable only including the top 4 firms with largest sales and exclude the rest.
In other words the new variable will only have values of the 4 firms with largest sales in a given industry for a given year and the rest will be .
Consider this:
webuse grunfeld, clear
bysort year (invest) : gen largest4 = cond(_n < _N - 3, ., invest)
sort year invest
list year largest4 if largest4 < . in 1/40, sepby(year)
+-----------------+
| year largest4 |
|-----------------|
7. | 1935 39.68 |
8. | 1935 40.29 |
9. | 1935 209.9 |
10. | 1935 317.6 |
|-----------------|
17. | 1936 50.73 |
18. | 1936 72.76 |
19. | 1936 355.3 |
20. | 1936 391.8 |
|-----------------|
27. | 1937 74.24 |
28. | 1937 77.2 |
29. | 1937 410.6 |
30. | 1937 469.9 |
|-----------------|
37. | 1938 51.6 |
38. | 1938 53.51 |
39. | 1938 257.7 |
40. | 1938 262.3 |
+-----------------+
If you had missing values, they would sort to the end of each block and mess up the results.
So you need a trick more:
generate OK = !missing(invest)
bysort OK year (invest) : gen Largest4 = cond(_n < _N - 3, ., invest) if OK
sort year invest
list year Largest4 if Largest4 < . in 1/40, sepby(year)
With this example, which you can run, there are no missing values and the results are the same.

Stata: Reshaping dataset – bringing variable labels to variable values

I have a dataset containing different product values generated in each simulation, with the following layout:
+------------+-------+-------+-------+
| simulation | v1 | v2 | v3 |
+------------+-------+-------+-------+
| 1 | 0,500 | 0,400 | 0,300 |
| 2 | 0,900 | 0,800 | 0,800 |
| 3 | 0,100 | 0,200 | 0,300 |
+------------+-------+-------+-------+
The variable names v1, v2, v3 are labelled as product ids and are not displayed at the header of the dataset. I need to reshape this dataset to long format so it would like this:
+------------+----+----------+-------+
| simulation | id | label | value |
+------------+----+----------+-------+
| 1 | v1 | 01020304 | 0,500 |
| 1 | v2 | 01020305 | 0,400 |
| 1 | v3 | 01020306 | 0,300 |
| 2 | v1 | 01020304 | 0,900 |
| 2 | v2 | 01020305 | 0,800 |
| 2 | v3 | 01020306 | 0,800 |
| 3 | v1 | 01020304 | 0,100 |
| 3 | v2 | 01020305 | 0,200 |
| 3 | v3 | 01020306 | 0,300 |
+------------+----+----------+-------+
The standard code reshape long v , i(simulation) j(_count) is not applicable in this case, as I need to reshape the variable labels and keep them in the dataset as variable values. Was wondering if there exists a way to make this kind of transposition with variable labels?
Only one idea seems needed here. If your variable labels would otherwise disappear, save them in a local macro before a reshape and then apply them as value labels afterwards. The FAQ cited earlier gives the flavour.
Sandpit to play in:
input simulation v1 v2 v3
simulat~n v1 v2 v3
1. 1 0.500 0.400 0.300
2. 2 0.900 0.800 0.800
3. 3 0.100 0.200 0.300
4. end
label var v1 "01020304"
label var v2 "01020305"
label var v3 "01020306"
Sample code:
forval j = 1/3 {
local labels `labels' `j' "`: var label v`j''"
}
reshape long v, i(simulation)
(note: j = 1 2 3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 3 -> 9
Number of variables 4 -> 3
j variable (3 values) -> _j
xij variables:
v1 v2 v3 -> v
-----------------------------------------------------------------------------
rename v value
label def label `labels'
rename _j label
gen id = label
label val label label
list
+----------------------------------+
| simula~n label value id |
|----------------------------------|
1. | 1 01020304 .5 1 |
2. | 1 01020305 .4 2 |
3. | 1 01020306 .3 3 |
4. | 2 01020304 .9 1 |
5. | 2 01020305 .8 2 |
|----------------------------------|
6. | 2 01020306 .8 3 |
7. | 3 01020304 .1 1 |
8. | 3 01020305 .2 2 |
9. | 3 01020306 .3 3 |
+----------------------------------+

PROC TABULATE WITH TOTAL

I am doing reports with proc tabulate, but unable to add total in a report.
Example
+--------+------+----------+--------+---+---+---+
| Shop | Year | Month | Family | A | B | C |
+--------+------+----------+--------+---+---+---+
| raoas | 2006 | january | TA12 | 5 | 6 | 0 |
| taba | 2008 | january | TS01 | 0 | 1 | 1 |
| suptop | 2008 | april | TZ05 | 0 | 0 | 1 |
| taba | 2006 | December | TA12 | 5 | 6 | 0 |
| raoas | 2008 | january | TA15 | 0 | 2 | 0 |
| sup | 2008 | april | TQ05 | 0 | 1 | 1 |
+--------+------+----------+--------+---+---+---+
code
proc tabulate data=REPORTDATA_T6 format=12.;
CLASS YEAR;
var A C;
table (A C)*SUM='',YEAR=''
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;
output
YEAR 2006 2008 2009
A 800 766 813
C 854 832 812
I tried with... table(A C)*sum,year all... it will sum up for all the years but I want by year.
I tried with all the possible ways and tried... table(A C)*sum all,year. It will give number of observations ie N.. Thanx JON CLEMENTS But I dont want to add as TOTAL VARIABLE in the table, becoz this is a sample data but the number of variables are more then 10, some time I need to change variables, So, every time i dont want to add new variable as total.
I'm not sure if it's possible to do what you want in one step using only original data. Keyword ALL works only for summing up categories of CLASS-variables, but you want to sum up two different variables.
But it's easy enough with interim step, creating dataset where A, B, C variables will become categories of one variable:
data REPORTDATA_T6;
input Shop $ Year Month $ Family $ A B C;
datalines;
raoas 2006 january TA12 5 6 0
taba 2008 january TS01 0 1 1
suptop 2008 april TZ05 0 0 1
taba 2006 December TA12 5 6 0
raoas 2008 january TA15 0 2 0
sup 2008 april TQ05 0 1 1
;
run;
proc sort data=REPORTDATA_T6; by Shop Year Month Family; run;
proc transpose data=REPORTDATA_T6 out=REPORTDATA_T6_long;
var A B C;
by Shop Year Month Family;
run;
proc tabulate data=REPORTDATA_T6_long;
class _NAME_ YEAR;
var COL1;
table (_NAME_ all)*COL1=' '*SUM=' ', YEAR=' '
/box = 'YEAR';
TITLE 'FORECAST SUMMARY';
run;