I need advice on how to split up a dataset efficiently (around 7 million rows and 280 columns).
My dataset contains the columns 'department' and 'classid' which are not unique.
I would like to split my dataset depending on the department variable and the maximum number of observations (100k). Another limitation is shown by the following example:
Ex 1:
math_1 - 10k rows
math_2 - 80k rows
math_3 - 20k rows
Result 1:
math_1 + math_2 -> 90.000 rows - OK
math_3 -> 20.000 rows - OK
Ex. 2:
math_1 - 90k rows
math_2 - 80k rows
math_3 - 10k rows
Result 2.1:
math_1 + math_2 -> 100k rows (90k from math1, 10k from math2) -> not OK
math_2 + math_3 -> 80k rows (70k from math_2, 10k from math_3 -> not OK
math_2 is split across 2 tables although it would fit into one, so it should be split like this:
Result 2.2:
math_1 -> 90k rows -> OK
math_2 + math_3 -> 90k rows -> OK
Even if math_2 would not fit into one table, I don't want it to be mixed with rows from another original table.
I tried to solve it with hash tables but am simply running out of memory because of the huge amount of columns.
Not sure what hashes have to do here.
I would first summarize the data by Department and ClassID. Put the counts into a table. Then you can go down the table and create a new variable, called group. If the total > X amount then group + 1, else group is the same. This creates a variable that tells you your file structure.
Then use that data set with the groups to build your table split. I would recommend a CALL EXECUTE or DOSUBL to split the data into the subsets.
7 million with the max of 90K would be 8 + data sets....but it'll be a nightmare to work with to understand where you need to go get your data because it's not designed logically. So you'll always need to reference this table anyways.
data have;
input department $ classID $ num_records;
cards;
A math1 500
A math2 500
A math3 200
A math4 100
;
run;
data groups;
set have;
retain running_total;
running_total=sum(running_total, num_records);
if running_total >= 500 then do; group+1; running_total=num_records;
end;
run;
Use this with the links above to create the subsets if really, really desired.
Create a test dataset to play with:
data test; set original(keep=department classid); run;
Use PROC TABULATE to get an overview of departments and classids.
Use PROC SORT; BY department classid; to sort your data.
Write SAS-Code to write SAS-Code to split your data:
data _null__;
put 'data classid1; set original; if classid="math_1"; run;';
So the code for splitting looks like this:
data classid1;
set original;
if classid="math_1";
run;
data classid2;
set original;
if classid="math_2";
run;
Related
Suppose I have the following database:
DATA have;
INPUT id date gain;
CARDS;
1 201405 100
2 201504 20
2 201504 30
2 201505 30
2 201505 50
3 201508 200
3 201509 200
3 201509 300
;
RUN;
I want to create a new table want where the average of the variable gain is grouped by id and by date. The final database should look like this:
DATA want;
INPUT id date average_gain;
CARDS;
1 201405 100
2 201504 25
2 201505 40
3 201508 200
3 201509 250
I tried to obtain the desired result using the code below but it didn't work:
PROC sql;
CREATE TABLE want as
SELECT *,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
It's the asterisk that's causing the issue. That will resolve to id, date, gain, which is not what you want. ANSI SQL would not allow this type of functionality so it's one way in which SAS differs from other SQL implementation.
There should be a note in the log about remerging with the original data, which is essentially what's happening. The summary values are remerged to every line.
To avoid this, list your group by fields in your query and it will work as expected.
PROC sql;
CREATE TABLE want as
SELECT id, date,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
I will say, in general, PROC MEANS is usually a better option because:
calculate for multiple variables & statistics without need to list them all out multiple times
can get results at multiple levels, for example totals at grand total, id and group level
not all statistics can be calculated within PROC MEANS
supports variable lists so you can shortcut reference long lists without any issues
I have a table of sales from multiple stores with the value of sales in dollars and the date and the corresponding store.
In another table I have the store name and the expected sales amount of each store.
I want to create a column in the main first table that evaluates the efficiency of sales based on the other table..
In other words, if store B made 500 sales today, I want to check with lookup table to see the target then use it to divide and obtain the efficiency then graph the efficiency of each store.
Thanks.
I tried creating some measures and columns but stuck with circular dependencies
I expect to add one column to my main table to an integer 0 to 100 showing the efficiency.
You can merge the two tables. In the query editor go to Merge Querires > Merge Query As New. Chose your relationship (match it by the column StoreName) and merge the two tables. You will get something like this (just a few of your sample data):
StoreName ActualSaleAmount ExpectedAmount
a 500 3000
a 450 3000
b 370 3500
c 400 5000
Now you can add a calculated column with your efficency:
StoreName ActualSaleAmount ExpectedAmount Efficency
a 500 3000 500/3000
a 450 3000 450/3000
b 370 3500 370/3500
c 400 5000 400/5000
This would be:
Efficency = [ActualSaleAmount] / [ExpectedAmount]
I have a sas dataset with columns shiyas1,shiyas2,shiyas3 in it. That dataset has some other columns also. I want to combine all the columns with header with shiyas in it.
We can't use cats(shiyas1,shiyas2,shiyas3) because similar datasets have columns upto shiyas10. As I am generating general sas code, we cannot use cats(shiyas1,shiyas2 .... shiyas10).
So how can we do this?
When I tried to use cats(shiyas1,shiyas2 .... shiyas10), eventhough my dataset have columns upto shiyas3, it created columns shiyas4 to shiyas10 with . filled in them.
SO one solution is to combine shiyas till the dataset have or to delete the unnecessary shiyas columns...
Pls help me.
Use variable list.
data have;
input (shiyas1-shiyas3) (:$1.);
cards;
1 2 3
;
data want;
set have;
length cat_shiyas $ 100 /*large enough to hold the content*/
;
cat_shiyas=cats(of shiyas:);
run;
Use the of statement (which lets you read across a row, similar to arrays) with the : wildcard operator. This will concatenate all columns beginning with 'shiyas'
cats(of shiyas:)
I have a large dataset in SAS which I know is almost sorted; I know the first and second levels are sorted, but the third level is not. Furthermore, the first and second levels contain a large number of distinct values and so it is even less desirable to sort the first two columns again when I know it is already in the correct order. An example of the data is shown below:
ID Label Frequency
1 Jon 20
1 John 5
2 Mathieu 2
2 Mathhew 7
2 Matt 5
3 Nat 1
3 Natalie 4
Using the "presorted" option on a proc sort seems to only check if the data is sorted on every key, otherwise it does a full sort of the data. Is there any way to tell SAS that the first two columns are already sorted?
If you've previously sorted the dataset by the first 2 variables, then regardless of the sortedby information on the dataset, SAS will take less CPU time to sort it *. This is a natural property of most decent sorting algorithms - it's much less work to sort something that's already nearly sorted.
* As long as you don't use the force option in the proc sort statement, which forces it to do redundant sorting.
Here's a little test I ran:
option fullstimer;
/*Make sure we have plenty of rows with the same 1 + 2 values, so that sorting by 1 + 2 doesn't imply that the dataset is already sorted by 1 + 2 + 3*/
data test;
do _n_ = 1 to 10000000;
var1 = round(rand('uniform'),0.0001);
var2 = round(rand('uniform'),0.0001);
var3 = round(rand('uniform'),0.0001);
output;
end;
run;
/*Sort by all 3 vars at once*/
proc sort data = test out = sort_all;
by var1 var2 var3;
run;
/*Create a baseline dataset already sorted by 2/3 vars*/
/*N.B. proc sort adds sortedby information to the output dataset*/
proc sort data = test out = baseline;
by var1 var2;
run;
/*Sort baseline by all 3 vars*/
proc sort data = baseline out = sort_3a;
by var1 var2 var3;
run;
/*Remove sort information from baseline dataset (leaving the order of observations unchanged)*/
proc datasets lib = work nolist nodetails;
modify baseline (sortedby = _NULL_);
run;
quit;
/*Sort baseline dataset again*/
proc sort data = baseline out = sort_3b;
by var1 var2 var3;
run;
The relevant results I got were as follows:
SAS took 8 seconds to sort the original completely unsorted dataset by all 3 variables.
SAS took 4 seconds to sort by 3/3 starting from the baseline dataset already sorted by 2/3 variables.
SAS took 4 seconds to sort by 3/3 starting from the same baseline dataset after removing the sort information from it.
The relevant metric from the log output is the amount of user CPU time.
Of course, if the almost-sorted dataset is very large and contains lots of other variables, you may wish to avoid the sort due to the write overhead when replacing it. Another approach you could take would be to create a composite index - this would allow you to do things involving by group processing, for example.
/*Alternative option - index the 2/3 sorted dataset on all 3 vars rather than sorting it*/
proc datasets lib = work nolist nodetails;
/*Replace the sort information*/
modify baseline(sortedby = var1 var2);
run;
/*Create composite index*/
modify baseline;
index create index1 = (var1 var2 var3);
run;
quit;
Creating an index requires a read of the whole dataset, as does the sort, but only a fraction of the work involved in writing it out again, and might be faster than a 2/3 to 3/3 sort in some situations.
I have a file with 25 rows like:
Model Cena (zl) Nagrywanie fimow HD Optyka - krotnosc zoomu swiatlo obiektywu przy najkrotszej ogniskowej Wielkosc LCD (cale)
Lumix DMC-LX3 1699 tak 2.5 2 3
Lumix DMC-GH1 + LUMIX G VARIO HD 14-140mm/F4.0-5.8 ASPH./MEGA O.I.S 5199 tak 10 4 3
And I wrote:
DATA lab_1;
INFILE 'X:\aparaty.txt' delimiter='09'X;
INPUT Model $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film;
f_skal = MAX(Cena - 1500, Optyka - 10, Wielkosc_LCD - 1, Nagr_film - 1) + 1/1000*(Cena - 1500 + Optyka - 10 + Wielkosc_LCD - 1 + Nagr_film - 1);
*rozw = MIN(f_skal);
*rozw = f_skal[,<:>];
PROC SORT;
BY DESCENDING f_skal;
PROC PRINT DATA = lab_1;
data _null_;
set lab_1;
FILE 'X:\aparatyNOWE.txt'; DLM='09'x;
PUT Model= $ Cena Nagrywanie $ Optyka Wielkosc_LCD Nagr_film f_skal;
RUN;
I need to find the lowest value of f_skal and I don't know how because min(f_skal) doesn't work.
In a data step, the min function only looks at one row at a time - if you feed it several variables, it will give you the minimum value out of all of those variables for that row, but you can't use it to look at values across multiple rows (unless you get data from multiple rows into 1 row first, e.g. via use of retain / lag).
One way of calculating statistics in SAS across a whole dataset is to use proc means / proc summary, e.g.:
proc summary data = lab1;
var f_skal;
output out = min_val min=;
run;
This will create a dataset called min_val in your work library, and the value of f_skal in that dataset will be the minimum from anywhere in the dataset lab1.
If you would rather create a macro variable containing the minimum value, so that you can use it in subsequent code, one way of doing that is to use proc sql instead:
proc sql noprint;
select min(f_skal) into :min_value from lab1;
quit;
run;
%put Minimum value = &min_value;
In proc sql the behaviour of min is different - here it compares values across rows, the way you were trying to use it.