I am trying to merge two tables. table A has an id column, a date column, and an amount value for every date in a period
Table B has both id and date, but also other columns with details. However, there is only one entry any time there is a change in the details, so I do not know how to merge with normal joins. I want that for every entry in A, the details are populated as of the latest day available in B for that ID before the date in A.
Table A
| ID | date | amount |
| 1 | 01JAN| 56 |
| 1 | 02JAN| 54 |
| 1 | 03JAN| 23 |
| 1 | 04JAN| 43 |
Table B
| ID | date | details|
| 1 | 01JAN| x |
| 1 | 03JAN| y |
Wanted Output
Table A
| ID | date | amount | details |
| 1 | 01JAN| 56 | x |
| 1 | 02JAN| 54 | x |
| 1 | 03JAN| 23 | y |
| 1 | 04JAN| 43 | y |
for the jan2 entry, the latest available details as of that date is 'x', for jan3 it is y
Thank you in advance for any guidance you could provide
This will work for the question you have asked literally:
data want;
retain details_last;
merge table1 table2;
by ID date;
if not missing(details) then details_last = details;
else details = details_last;
drop details_last;
run;
But this will only work if your data meets the conditions that you have presented like the date ranges in table B should always fall within the date ranges in table A and not outside (i.e. only interpolation, no extrapolation).
Related
How do you extract all values containing part of a particular number and then delete them?
I have data where the ID contains different lengths and wants to extract all the IDs with a particular number. For example, if the ID contains either "-00" or "02" or "-01" at the end, pull to be able to see the hit rate that includes those—then delete them from the ID. Is there a more effecient way in creating this code?
I tried to use the substring function to slice it to get the result, but there is some other ID along with the specified position.
Code:
Proc sql;
Create table work.data1 AS
SELECT Product, Amount_sold, Price_per_unit,
CASE WHEN Product Contains "Pen" and Lenghth(ID) >= 9 Then ID = SUBSTR(ID,1,9)
WHEN Product Contains "Book" and Lenghth(ID) >= 11 Then ID = SUBSTR(ID,1,11)
WHEN Product Contains "Folder" and Lenghth(ID) >= 12 Then ID = SUBSTR(ID,1,12)
...
END AS ID
FROM A
Quit;
Have:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229-01 | Book | 20 | 5 |
| ABC134475472 02 | Folder | 29 | 7 |
| AB-1235674467-00 | Pencil | 26 | 1 |
| 69598346-02 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Wanted the final result:
+------------------+-----------------+-------------+----------------+
| ID | Product | Amount_sold | Price_per_unit |
+------------------+-----------------+-------------+----------------+
| 123456789 | Pen | 30 | 2 |
| 63495837229 | Book | 20 | 5 |
| ABC134475472 | Folder | 29 | 7 |
| AB-1235674467 | Pencil | 26 | 1 |
| 69598346 | Correction pen | 15 | 1.50 |
| 6970457688 | Highlighter | 15 | 2 |
| 584028467 | Color pencil | 15 | 10 |
+------------------+-----------------+-------------+----------------+
Just test if the string has any embedded spaces or hyphens and also that the last word when delimited by space or hyphen is 00 or 01 or 02 then chop off the last three characters.
data have;
infile cards dsd dlm='|' truncover ;
input id :$20. product :$20. amount_sold price_per_unit;
cards;
123456789 | Pen | 30 | 2 |
63495837229-01 | Book | 20 | 5 |
ABC134475472 02 | Folder | 29 | 7 |
AB-1235674467-00 | Pencil | 26 | 1 |
69598346-02 | Correction pen | 15 | 1.50 |
6970457688 | Highlighter | 15 | 2 |
584028467 | Color pencil | 15 | 10 |
;
data want;
set have ;
if indexc(trim(id),'- ') and scan(id,-1,'- ') in ('00' '01' '02') then
id = substrn(id,1,length(id)-3)
;
run;
Result
amount_ price_
Obs id product sold per_unit
1 123456789 Pen 30 2.0
2 63495837229 Book 20 5.0
3 ABC134475472 Folder 29 7.0
4 AB-1235674467 Pencil 26 1.0
5 69598346 Correction pen 15 1.5
6 6970457688 Highlighter 15 2.0
7 584028467 Color pencil 15 10.0
There may be other solutions but you have to use some string functions. I used here the functions substr, reverse (reverting the string) and indexc (position of one of the characters in the string):
data have;
input text $20.;
datalines;
12345678
AB-142353 00
AU-234343-02
132453 02
221344-09
;
run;
data want (drop=reverted pos);
set have;
if countw(text) gt 1
then do;
reverted=strip(reverse(text));
pos=indexc(reverted,'- ')+1;
new=strip(reverse(substr(reverted,pos)));
end;
else new=text;
run;
Given the following table have, I would like to delete the records that satisfy the conditions based on the to_delete table.
data have;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Basket|30SEP21:00:00:00
111|Basket|31DEC20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|31DEC20:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
data to_delete;
infile datalines delimiter="|";
input id :8. item :$8. datetime : datetime18.;
format datetime datetime18.;
datalines;
111|Basket|30SEP20:00:00:00
111|Backpack|31MAY22:00:00:00
222|Basket|30JUN20:00:00:00
;
+-----+----------+------------------+
| id | item | datetime |
+-----+----------+------------------+
| 111 | Basket | 30SEP20:00:00:00 |
| 111 | Backpack | 31MAY22:00:00:00 |
| 222 | Basket | 30JUN20:00:00:00 |
+-----+----------+------------------+
In the past, I used to operate with the catx() function to concatenate the conditions in a where statement, but I wonder if there is a better way of doing this
proc sql;
delete from have
where catx('|',id,item,datetime) in
(select catx('|',id,item,datetime) from to_delete);
run;
+-----+--------+------------------+
| id | item | datetime |
+-----+--------+------------------+
| 111 | Basket | 30SEP21:00:00:00 |
| 111 | Basket | 31DEC20:00:00:00 |
| 222 | Basket | 31DEC20:00:00:00 |
+-----+--------+------------------+
Please note that it should allow the have table to have more columns than the table to_delete.
You can use except from to compute difference set of two sets:
proc sql;
create table want as
select * from have except select * from to_delete
;
quit;
I have a dataset where I wish to reflect the totals from a custom SQL query I performed in Tableau. Here is some sample data:
1. I first performed a custom query that was a join, unpivot and placed my data into groups
Size Tb Val type Group Sum_AVG SKU Last_Refreshed
270 90.5 Free_Space_TB Group2 90.5 Excel 9/1/2020
270 179.5 Used Group2 179.5 Excel 9/1/2020
814 701 Free_Space_TB Group1 701 Gris 8/1/2020
814 112 Used Group1 112 Gris 8/1/2020
2. Then I aggregated the data by taking the sum of one group and the average of the other group (and final summed these groups values)
The data is being aggregated like this: (SUM_AVG)
zn(sum(if [Group]= 'Group1' then [Val] end))
+
zn(avg(if [Group] = 'Group2' then [Val] end))
The view looks like this
Here is the custom query output
Here is my view
The avail and used appear when I hover over, but how would I include the total?
This is the calculation I am using (thanks to help from a SO member):
{SUM({Fixed [type]: ZN(sum(if [Group]= 'Group1' then [Val] end))})
+
sum({Fixed [type]: zn(avg(if [Group] = 'Group2' then [Val] end))})}
I am doing something wrong, because it is totaling up across all the column(s), (I have more columns in the full dataset) when I just want the total for each column.
(Used was created from using a custom query)
Any assistance is appreciated.
In my opinion, this you can do without changing the underlying view. WINDOW_SUM is a table calculation and is always dependent on view/context generated. Therefore, I always prefer LOD calculations which do not depend on context.
I think you should proceed like this. As always I have changed the sample data to include sufficient details
Data used
| Id | Avail | group | used | Date |
|----|-------|--------|------|------------|
| A | 5 | Group1 | 5 | 20-01-2020 |
| A | 20 | Group1 | 20 | 20-01-2020 |
| B | 10 | Group2 | 10 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| A | 10 | Group1 | 10 | 20-01-2020 |
| A | 10 | Group1 | 10 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| B | 5 | Group2 | 5 | 20-01-2020 |
| A | 5 | Group1 | 5 | 20-02-2019 |
| A | 20 | Group1 | 20 | 20-02-2019 |
| B | 10 | Group2 | 10 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| A | 10 | Group1 | 10 | 20-02-2019 |
| A | 10 | Group1 | 10 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
| B | 5 | Group2 | 5 | 20-02-2019 |
Step-1 Pivot generated in tableau as earlier.
Step-2 Calculated field sum-avg also generated as discussed.
step-3 View generated
Step-4 Add another field total
{FIXED [Date], [Group]: sum(
{FIXED [Date], [Group], [type]: zn(sum(if [Group]= 'Group1' then [val] end))}
+
{Fixed [Date], [Group], [type]: zn(avg(if [Group] = 'Group2' then [val] end))}
)}
Step-5 Add this field to details on marks card. See the GIF here
the code used in tooltip is mentioned below. Obviously, you can tweak it as per taste.
Under the <Group> , <AGG(Sum_Avg)> was <type> out of total <SUM(Total)> SKU on <YEAR(Date)>
This solution works:
1.Create a calculated field:
WINDOW_SUM([SUM_AVG])
2.Drag newly computed field to the view
3.Right click ‘Edit Table Calculation’
4.Specify and compute using [Last_Refreshed] and [type]
This will allow you to compute across cells, giving you your desired result
I have a database with 3 columns. ID, Date and amount. It is ordered by ID and Date. All I want to do is to add a row after the latest occurrence of every ID with the same ID, Date = Date + 1 Month and Amount = 0.
As an Illustration I want to go from this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
B | 01FEB| 0 |
B | 01MAR| 1 |
to this:
id | Date |amount |
A | 01JAN| 1 |
A | 01FEB| 1 |
A | 01MAR| 0 | <- ADD THIS ROW
B | 01FEB| 0 |
B | 01MAR| 1 |
B | 01APR| 0 |<- ADD THIS ROW
I know I should use intxn but beyond that I don't really know what to do. I appreciate any input.
Assuming that the DATE variable has actual date values in it you just need to output twice on the last observation in each group.
data want;
set have;
by id;
output;
if last.id then do;
date=intnx('month',date,1,'b');
amount=0;
output;
end;
run;
I'm using a dataset which is something like :
+----------+--------+-------+
| Variable | Level | Value |
+----------+--------+-------+
| sexe | men | 10 |
| | female | 20 |
| age | 0-20 | 5 |
| | 20-40 | 5 |
| | 40-60 | 10 |
| | >60 | 10 |
+----------+--------+-------+
And I would like to fulfill the "blank" cells using the previous non-blank cell to obtain something like this.
+----------+--------+-------+
| Variable | Level | Value |
+----------+--------+-------+
| sexe | men | 10 |
| sexe | female | 20 |
| age | 0-20 | 5 |
| age | 20-40 | 5 |
| age | 40-60 | 10 |
| age | >60 | 10 |
+----------+--------+-------+
I tried various possibilities in DATA step mostly with the LAG() function. The idea was to read the previous row when the cell was empty and fill with that.
DATA test;
SET test;
IF variable = . THEN DO;
variable = LAG1(variable);
END;
RUN;
And I obtained
+----------+--------+-------+
| Variable | Level | Value |
+----------+--------+-------+
| | men | 10 |
| sexe | female | 20 |
| | 0-20 | 5 |
| age | 20-40 | 5 |
| | 40-60 | 10 |
| | >60 | 10 |
+----------+--------+-------+
The problem was the good string is not always just one row upper. But I don't understand why SAS put blank in the first and 3d line. It didn't have to modify this line because I said "If variable = .".
I know how to do this in Python or in R with some for loop but I didn't find good solution in SAS.
I tried to put the string inside a variable with "CALL SYMPUT" and also with "RETAIN" but it didn't work too.
There must be a simple and elegant way to do this. Any idea?
You can't use LAG inside an IF and get that result - LAG doesn't actually work the way you think. RETAIN is the correct way I'd say:
DATA test;
SET test;
retain _variable;
if not missing(variable) then _variable=variable;
else variable=_variable;
drop _variable;
RUN;
Lag doesn't actually go to the previous record and get its value; what it does is set up a queue, and each time LAG is called it takes off a record from the front and adds a record to the back. This means that if LAG is inside a conditional block, it won't execute for the false condition, and you don't get your queue. You can use IFN and IFC functions, which evaluate both true and false conditions regardless of the boolean, but in this case RETAIN is probably easier.