Reading and halving a sas data set - sas

I have to read a data set of 50 numbers from a text file. It's all in a row with a space delimiter and in multiple uneven lines. for example:
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15
16 17 18 19 20 21
Etc.
The first 25 numbers belong to group 1, and the 2nd 25 belong to group 2. So I need to make a group variable (binary either 1 or 2), a count number (1 to 25), and a value variable which is holding the value of the number.
I am stuck on how to split the data in half when reading it. I tried to use truncover but it did not work.

Try something like this, replacing the datalines keyword with the path to your file:
data groups;
infile datalines;
format number 8. counter 2. group 1.; * Not mandatory, used here to order variables;
retain group (1);
input number ##;
counter + 1;
if counter = 26 then do;
group = 2;
counter = 1;
end;
datalines;
192 105 435 448 160 499 184 246 388 190 316
139 146 147 192 231 449 101 216 342 399 352 122 418
280 400 187 352 321 180 425 500 320 179 105
232 105 323 132 106 255 449
186 135 472 174 119 255
308 350
run;

Related

SAS transpose data into long form

I have the following dataset and hoping to transpose it into long form:
data have ;
input Height $ Front Middle Rear ;
cards;
Low 125 185 126
Low 143 170 136
Low 150 170 129
Low 138 195 136
Low 149 162 147
Medium 141 176 128
Medium 137 161 133
Medium 145 167 148
Medium 150 165 145
Medium 130 184 141
High 129 157 149
High 141 152 137
High 148 186 138
High 130 164 126
High 137 176 138
;
run;
Here Height is low, medium and high. Location is front, middle and rear. Numerical values are prices by location and height of a book on a bookshelf.
I'm hoping to transpose the dataset into long form with columns:
Height, Location and Price
The following code only allows me to transpose Location into long form. How should I transpose Height at the same time?
data bookp;
set bookp;
dex = _n_;
run;
proc sort data=bookp;
by dex;
run;
proc transpose data=bookp
out=bookpLong (rename=(col1=price _name_= location )drop= _label_ dex);
var front middle rear;
by dex;
run;
I think you just need to include HEIGHT in the BY statement.
First let's convert your example data into a SAS dataset.
data have ;
input Height $ Front Middle Rear ;
cards;
Low 125 185 126
Low 143 170 136
Medium 141 176 128
Medium 137 161 133
High 129 157 149
High 141 152 137
;
Now let's add an identifier to uniquely identify each row. Note that if you are really reading the data with a data step you could do this in the same step that reads the data.
data with_id ;
row_num+1;
set have;
run;
Now we can transpose.
proc transpose data=with_id out=want (rename=(_name_=Location col1=Price));
by row_num height ;
var front middle rear ;
run;
Results:
Obs row_num Height Location Price
1 1 Low Front 125
2 1 Low Middle 185
3 1 Low Rear 126
4 2 Low Front 143
5 2 Low Middle 170
6 2 Low Rear 136
7 3 Medium Front 141
8 3 Medium Middle 176
9 3 Medium Rear 128
10 4 Medium Front 137
11 4 Medium Middle 161
12 4 Medium Rear 133
13 5 High Front 129
14 5 High Middle 157
15 5 High Rear 149
16 6 High Front 141
17 6 High Middle 152
18 6 High Rear 137

does this code generate random numbers?

a = 100
for b in range(10,a):
c = b%10
if c == 0:
c += 3
c = c*b
print c
I was trying to make a random generator without using random function and I made this, does it generate random numbers?
Short Answer:
No.
Your code will print
30 11 24 39 56 75 96 119 144 171 60 21 44 69 96 125 156 189 224 261 90 31 64 99 136 175 216 259 304 351 120 41 84 129 176 225 276 329 384 441 150 51 104 159 216 275 336 399 464 531 180 61 124 189 256 325 396 469 544 621 210 71 144 219 296 375 456 539 624 711 240 81 164 249 336 425 516 609 704 801 270 91 184 279 376 475 576 679 784 891
every time.
Computers and programs like these are deterministic. If you sat down with a pen and paper you could tell me exactly which of these number would occur, when they would occur.
Random number generation is difficult, what I would recommend is using time to (seem to) randomize the output.
import time
print int(time.time() % 10)
This will give you a "random" number between 0 and 9.
time.time() gives you the number of milliseconds since (I believe) epoch time. It's a floating point number so we have to cast to an int if we want a "whole" integer number.
Caveat: This solution is not truly random, but will act in a much more "random" fashion.

Trying to get rid of commas in numbers with regex in R gives strange output

I'm new to R. This is my data (using dplyr):
> withCommas
Source: local data frame [326 x 1]
NA
1 16,244,600
2 8,227,103
3 5,959,718
4 3,428,131
5 2,612,878
6 2,471,784
7 2,252,664
8 2,014,775
9 2,014,670
10 1,841,710
.. ...
Classes ‘tbl_df’ and 'data.frame': 326 obs. of 1 variable:
$ : Factor w/ 207 levels ""," 1,008 "," 1,129 ",..: 40 178 143 100 66 63 61 58 57 16 ...
I'm trying to get rid of the commas (so the first row should be 16244600). So I tried the following:
#1st try
noCommas <- gsub("([0-9]+)\\,([0-9])", "\\1\\2", withCommas)
#2nd try
noCommas <- gsub(",", "", withCommas)
In all cases, I got this output:
[1] "c(40 178 143 100 66 63 61 58 57 16 14 11 9 6 4 182 176 174 170 161 148 147 139 137 136 134 118 117 116 114 113 109 107 105 95 93 92 90 89 88 87 84 83 78 75 74 73 72 71 70 56 55 49 47 43 42 39 28 25 24 23 190 188 181 172 165 163 162 160 153 152 151 150 149 146 145 144 138 132 131 130 129 128 127 126 125 124 115 112 111 110 106 98 97 96 94 86 85 82 81 80 77 76 69 68 54 52 51 50 46 45 44 41 \n38 37 36 35 34 33 32 31 30 29 27 26 22 21 20 19 18 17 187 186 185 184 183 179 177 169 168 167 166 159 158 157 156 155 142 141 140 122 121 120 119 104 103 102 101 99 67 65 64 62 60 59 15 13 12 10 8 7 5 3 2 189 180 175 173 173 171 164 154 135 133 108 91 79 53 48 123 1 191 191 191 191 191 1 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 191 1 206 1 205 200 202 198 201 196 \n195 204 194 199 193 203 197 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1)"
This seems very strange to me as I don't understand where the numbers are coming from. Any help appreciated.
Edit:
Only the first 225 rows of the variable withCommas have values. After that, the values of the column are empty.
Source: http://data.worldbank.org/data-catalog/GDP-ranking-table
CSV: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
What about this solution? I think that main problems arise because a data frame is a list and gsub is expecting a character string and so passing it to that function cause to apply the function to the lists and not the elements that are part of the lists themselves. That's the reason for an apply function. Or of course, if the columns is one, passing just that column as a vector with ddf$column_with_commas as the solution provided by other users.
as.data.frame(apply(ddf, 2, function(x) as.numeric(gsub(",", "", x))))
NA.
1 16244600
2 8227103
3 5959718
4 3428131
5 2612878
6 2471784
7 2252664
8 2014775
9 2014670
10 1841710
Data
ddf <- structure(list(NA. = structure(c(2L, 10L, 9L, 8L, 7L, 6L, 5L,
4L, 3L, 1L), .Label = c("1,841,710", "16,244,600", "2,014,670",
"2,014,775", "2,252,664", "2,471,784", "2,612,878", "3,428,131",
"5,959,718", "8,227,103"), class = "factor")), .Names = "NA.", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
data <- read.table(header=F, text="1 16,244,600
2 8,227,103
3 5,959,718
4 3,428,131
5 2,612,878
6 2,471,784
7 2,252,664
8 2,014,775
9 2,014,670
10 1,841,710 ")
colnames(data) <- c("a","b")
data$b <- as.numeric(gsub(",", "", data$b))
Output:
a b
1 1 16244600
2 2 8227103
3 3 5959718
4 4 3428131
5 5 2612878
6 6 2471784
7 7 2252664
8 8 2014775
9 9 2014670
10 10 1841710

Creating statistical data from a table

I have a table with 20 columns of measurements. I would like 'convert' the table into a table with 20 rows with columns of Avg, Min, Max, StdDev, Count types of information. There is another question like this but it was for the 'R' language. Other question here.
I could do the following for each column (processing the results with C++):
Select Count(Case When [avgZ_l1] <= 0.15 and avgZ_l1 > 0 then 1 end) as countValue1,
Count(case when [avgZ_l1] <= 0.16 and avgZ_l1 > 0.15 then 1 end) as countValue2,
Count(case when [avgZ_l1] <= 0.18 and avgZ_l1 > 0.16 then 1 end) as countValue3,
Count(case when [avgZ_l1] <= 0.28 and avgZ_l1 > 0.18 then 1 end) as countValue4,
Avg(avgwall_l1) as avg1, Min(avgwall_l1) as min1, Max(avgZ_l1) as max1,
STDEV(avgZ_l1) as stddev1, count(*) as totalCount from myProject.dbo.table1
But I do not want to process the 50,000 records 20 times (once for each column). I thought there would be away to 'pivot' the table onto its side and process the data at the same time. I have seen examples of the 'Pivot' but they all seem to pivot on a integer type field, Month number or Device Id. Once the table is converted I could then fetch each row with C++. Maybe this is really just 'Insert into ... select ... from' statements.
Would the fastest (execution time) approach be to simply create a really long select statement that returns all the information I want for all the columns?
We might end up with 500,000 rows. I am using C++ and SQL 2014.
Any thoughts or comments are welcome. I just don't want have my naive code to be used as a shining example of how NOT to do something... ;)...
If your table looks the same as the code that you sent in r then the following query should work for you. It selects the data that you requested and pivots it at the same time.
create table #temp(ID int identity(1,1),columnName nvarchar(50));
insert into #temp
SELECT COLUMN_NAME as columnName
FROM myProject.INFORMATION_SCHEMA.COLUMNS -- change myProject to the name of your database. Unless myProject is your database
WHERE TABLE_NAME = N'table1'; --change table1 to your table that your looking at. Unless table1 is your table
declare #TableName nvarchar(50) = 'table1'; --change table1 to your table again
declare #loop int = 1;
declare #query nvarchar(max) = '';
declare #columnName nvarchar(50);
declare #endQuery nvarchar(max)='';
while (#loop <= (select count(*) from #temp))
begin
set #columnName = (select columnName from #temp where ID = #loop);
set #query = 'select t.columnName, avg(['+#columnName+']) as Avg ,min(['+#columnName+']) as min ,max(['+#columnName+'])as max ,stdev(['+#columnName+']) as STDEV,count(*) as totalCount from '+#tablename+' join #temp t on t.columnName = '''+#columnName+''' group by t.columnName';
set #loop += 1;
set #endQuery += 'union all('+ #query + ')';
end;
set #endQuery = stuff(#endQuery,1,9,'')
Execute(#endQuery);
drop table #temp;
It creates a #temp table which stores the values of your column headings next to an ID. It then uses the ID when looping though the number of columns that you have. It then generates a query which selects what you want and then unions it together. This query will work on any number of columns meaning that if you add or remove more columns it should give the correct result.
With this input:
age height_seca1 height_chad1 height_DL weight_alog1
1 19 1800 1797 180 70
2 19 1682 1670 167 69
3 21 1765 1765 178 80
4 21 1829 1833 181 74
5 21 1706 1705 170 103
6 18 1607 1606 160 76
7 19 1578 1576 156 50
8 19 1577 1575 156 61
9 21 1666 1665 166 52
10 17 1710 1716 172 65
11 28 1616 1619 161 66
12 22 1648 1644 165 58
13 19 1569 1570 155 55
14 19 1779 1777 177 55
15 18 1773 1772 179 70
16 18 1816 1809 181 81
17 19 1766 1765 178 77
18 19 1745 1741 174 76
19 18 1716 1714 170 71
20 21 1785 1783 179 64
21 19 1850 1854 185 71
22 31 1875 1880 188 95
23 26 1877 1877 186 106
24 19 1836 1837 185 100
25 18 1825 1823 182 85
26 19 1755 1754 174 79
27 26 1658 1658 165 69
28 20 1816 1818 183 84
29 18 1755 1755 175 67
It will produce this output:
avg min max stdev totalcount
age 20 17 31 3.3 29
height_seca1 1737 1569 1877 91.9 29
height_chad1 1736 1570 1880 92.7 29
height_DL 173 155 188 9.7 29
weight_alog1 73 50 106 14.5 29
Hope this helps and works for you. :)

How to generate random even numbers between a given interval?

I want to get a set of random even numbers between 50 and 100, and this is what I wrote:
int x;
x=(2*(50+rand()%(100-50+1)));
when I output this, I get
186
166
112
190
150
160
146
104
194
168
194
178
102
200
192
130
168
134
146
184
136
which are not in between 50 and 100...why?
thanks for helping me!
Your computation is wrong, you ask for 2 times a number between 50 and 100.
Go with
x = 2 * ( rand() % 25 ) + 50
int x;
x=50+(2*(rand()%(26)));
X = rand()%(upper-lower+1) + lower
In your case : x = rand()%51 + 50