Hash Join not behaving as required

Hash Join not behaving as required - sas

I am using a hash join on some sample data to join a small table on a larger one. In this example '_1080544_27_08_2016' is the larger table and '_2015_2016_playerlistlookup' the smaller one. Here is my code:
data both(drop=rc);
declare Hash Plan
(dataset: 'work._2015_2016_playerlistlookup'); /* declare the name Plan for hash */
rc = plan.DefineKey ('Player_ID'); /* identify fields to use as keys */
rc = plan.DefineData ('Player_Full_Name',
'Player_First_Name', 'Player_Last_Name',
'Player_ID2'); /* identify fields to use as data */
rc = plan.DefineDone (); /* complete hash table definition */
do until (eof1) ; /* loop to read records from _1080544_27_08_2016 */
set _1080544_27_08_2016 end = eof1;
rc = plan.add (); /* add each record to the hash table */
end;
do until (eof2) ; /* loop to read records from _2015_2016_playerlistlookup */
set _2015_2016_playerlistlookup end = eof2;
call missing(Player_Full_Name,
Player_First_Name, Player_Last_Name); /* initialize the variable we intend to fill */
rc = plan.find (); /* lookup each plan_id in hash Plan */
output; /* write record to Both */
end;
stop;
run;
This is producing a table that has the same numbers of rows as the smaller, lookup table. What I would like to see if a table the same size as the larger one with the additional fields from the lookup table joined on via the primary key.
The larger table has repeating primary keys. That is to say the primary key is not unique (based on row number for example).
Can someone please tell me what I need to amend in the code?
Thanks

You are loading both datasets into your hash object - the small one when you declare it, and then the large one as well in your first do-loop. This makes no sense to me, unless you have lookup values already populated for some but not all of the rows in your large dataset, and you are trying to carry them over between rows.
You are then looping through the lookup dataset and producing 1 output row for each row of that dataset.
It is unclear exactly what you are trying to do here, as this is not a standard use case for hash objects.
Here's my best guess - if this isn't what you're trying to do, please post sample input and intended output datasets.
data want;
set _1080544_27_08_2016;
if 0 then set _2015_2016_playerlistlookup;
if _n_ = 1 then do;
declare Hash Plan(dataset: 'work._2015_2016_playerlistlookup');
rc = plan.DefineKey ('Player_ID');
rc = plan.DefineData ('Player_Full_Name', 'Player_First_Name', 'Player_Last_Name', 'Player_ID2');
rc = plan.DefineDone ();
end;
call missing(Player_Full_Name, Player_First_Name, Player_Last_Name);
rc = plan.find();
drop rc;
run;

Related

Is there a SAS function to delete negative and missing values from a variable in a dataset?

Variable name is PRC. This is what I have so far. First block to delete negative values. Second block is to delete missing values.
data work.crspselected;
set work.crspraw;
where crspyear=2016;
if (PRC < 0)
then delete;
where ticker = 'SKYW';
run;
data work.crspselected;
set work.crspraw;
where ticker = 'SKYW';
where crspyear=2016;
where=(PRC ne .) ;
run;

Instead of using a function to remove negative and missing values, it can be done more simply when inputting or outputting the data. It can also be done with only one data step:
data work.crspselected;
set work.crspraw(where = (PRC >= 0 & PRC ^= .)); * delete values that are negative and missing;
where crspyear = 2016;
where ticker = 'SKYW';
run;
The section that does it is:
(where = (PRC >= 0 & PRC ^= .))
Which can be done for either the input dataset (work.crspraw) or the output dataset (work.crspselected).
If you must use a function, then the function missing() includes only missing values as per this answer. Hence ^missing() would do the opposite and include only non-missing values. There is not a function for non-negative values. But I think it's easier and quicker to do both together simultaneously without a function.

You don't need more than your first test to remove negative and missing values. SAS treats all 28 missing values (., ._, .A ... .Z) as less than any actual number.

WSO2 DAS SiddhiQL: Initialize an event table and update it after

I have some related questions :
Fisrt, I want to initialize an event table with default values and 100 rows like on this picture:
Second, once the initialization is done I would like to update this table. How can I execute in the same execution plan a query2 once the query1 execution is finished?
To finish, I have an event with 'altitude' attribute. In my execution plan, for each event, I want to increment count1 of every row of my event table where the num column is smaller than the alitude. I tried it but that doesn't increment count of all rows.
FROM inputStream JOIN counterTable
SELECT count1+1 as count1, altitude as tempNum
update counterTable on counterTable.count1 < tempNum;
FROM inputStream JOIN counterTable
SELECT counterTable.num as theAltitude, counterTable.count1 as countAltitude
INSERT INTO outputStream;

If you want to initialize each time the execution plan gets deployed, you should use an in-memory event table (as shown below). Otherwise, you can simply use an RDBMS event table, where it's already been initialized.
Queries will run in the order that they have defined, but that process will occur per each event (not as a batch, i.e if there two queries which consume from inputStream, when event 1 comes into inputStream it goes to query 1, then to query 2, and then only event 2 will get consumed..).
Refer to the below snippet;
/* Define the trigger to be used with initialization */
define trigger triggerStream at 'start';
/* Define streams */
define stream inputStream (altitude int);
define stream outputStream (theAltitude long, countAltitude int);
/* Define table */
define table counterTable (num long, count1 int, count2 int, tempNum int);
/* Iterate and generate 100 events */
from triggerStream[count() < 100]
insert into triggerStream;
/* Using above 100 events, initialize the event table */
from triggerStream
select count() as num, 0 as count1, 0 as count2, 0 as tempNum
insert into counterTable;
/* Perform the update logic here */
from inputStream as i join counterTable as c
on c.count1 < i.altitude
select c.num, (c.count1 + 1) as count1, c.count2, altitude as tempNum
insert into updateStream;
from updateStream
insert overwrite counterTable
on counterTable.num == num;
/* Join the table and get the updated results */
from inputStream join counterTable as c
select c.num as theAltitude, c.count1 as countAltitude
insert into outputStream;

Table values can be initialized as follows.
#info(name='initialize table values')
from inputStream[count()==1]
select 1 as id, 0 as counter1, 0 as counter2, 0 as counter3
insert into counterTable;

DynamoDBMapper: Query on Multiple Conditions

Assume my data structure has several data members
class Data {
#DynamoDBHashKey
#DynamoDBAutoGeneratedKey
String id;
String id1;
String id2;
....
}
Now I want to do query based on id1 and id2, like SQL where id1 ="id1" and id2 = "id2". I know somehow I should make them as global secondary key like either way below:
Make id1 and id2 as different indexes like this:
#DynamoDBIndexHashKey(globalSecondaryIndexName = "INDEX_ID1")
String id1;
#DynamoDBIndexHashKey(globalSecondaryIndexName = "INDEX_ID2")
String id2;
//And then do query it by any of the index name:
DynamoDBQueryExpression<Data> query = new DynamoDBQueryExpression<Data>()
.withIndexName("INDEX_ID1") //or INDEX_ID2 here
.withConsistentRead(false) //GSIs do not support consistent reads
.withKeyConditionExpression("id1 = :id1 AND id2 = :id2")
.withExpressionAttributeValues(ImmutableMap.of(":id1", new AttributeValue(id1), ":id2", new AttributeValue(id2)));
Make id1 and id2 as partition key and sort key under same index name:
#DynamoDBIndexHashKey(globalSecondaryIndexName = "INDEX_ID1_ID2")
String id1;
#DynamoDBIndexRangeKey(globalSecondaryIndexName = "INDEX_ID1_ID2")
String id2;
//And then do query it by the only index name:
DynamoDBQueryExpression<Data> query = new DynamoDBQueryExpression<Data>()
.withIndexName("INDEX_ID1_ID2")
.withConsistentRead(false) //GSIs do not support consistent reads
.withKeyConditionExpression("id1 = :id1 AND id2 = :id2")
.withExpressionAttributeValues(ImmutableMap.of(":id1", new AttributeValue(id1), ":id2", new AttributeValue(id2)));
Which way is right or better?
Besides, if I want to query based on more than two conditions (say there is a id3), then how can I do that?

Make id1 and id2 as different indexes like this
In this, you can't query on both id1 and id2, you can only query on one. That is because DynamoDB doesn't allow you to use two different indexes simultaneously.
Make id1 and id2 as partition key and sort key under same index name
This will work for SQL-like where id1 ="id1" and id2 = "id2" queries. But the approach won't scale if you add another id, because you can only have one partition-key and one sort-key.
As mentioned in a comment above, you can Scan for more filters and conditions. But if you think you'll need lot more complexities in future, DynamoDB might not be the right tool for you.

Using RecordRef to Work with Multiple Tables

I have a group of tables that I need to the integer key from and I would like to be able to pass in any of them into a single and get the next value for the key.
I believe that RecordRef is the way to do this, but the code so far doesn't seem quite right.
I am trying to build a function that will take a table record and then return an integer value, that integer value will be the next record for the primary key. IE: if the last record's key is is 62825 the function will return 62826.
FunctionA
BEGIN
Id := GetNextId(SalesRecord); //Assignment not allowed
END;
FunctionB
BEGIN
Id := GetNextId(CreditMemoRecord); //Assignment not allowed
END;
GetNextId(pTableReference: RecordRef) rNextId : Integer
BEGIN
CASE pTableReference.NUMBER OF
DATABASE::SalesRecord: BEGIN
//Find last Record
pTableReference.FINDLAST;
lFieldRef := pTableReference.FIELD(1); //Set to the PK field
END;
DATABASE::CreditMemoRecord: BEGIN
//Find last Record
pTableReference.FINDLAST;
lFieldRef := pTableReference.FIELD(10); //Set to the PK field
END;
... //do more here
END; //CASE
EVALUATE(rNextId,FORMAT(lFieldRef.VALUE)); //Get the integer value from FieldRef
rNextId := rNextId + 1; //Add one for the next value
EXIT(rNextId); //return the value
END;
With this code I am getting the error "Assignment is not allowed for this variable." on the Function Call to GetNextId
Idea of the Table Structure:
Table - SalesRecord
FieldId, Fieldname, Type, Description
1 id integer PK
2 text1 text(30)
3 text2 text(30)
4 dec1 decimal
5 dec2 decmial
Table - CreditMemoRecord
FieldId, Fieldname, Type, Description
10 id integer PK
20 text1 text(30)
30 text2 text(30)
40 dec1 decmial
50 dec2 decmial

Just put function like this in both tables
GetNextId() rNextId : Integer
BEGIN
RESET;
FINDLAST;
EXIT(id+1);
END;
an then call it from record variable
FunctionA
BEGIN
Id := SalesRecord.GetNextId();
END;
FunctionB
BEGIN
Id := CreditMemoRecord.GetNextId();
END;
This is common practice I believe.

You mean "GetNextValue" get next record? I don't quite understand your use-case.
If you want to pass in a generic record, then you'll want to use the VARIANT data type. This is a wildcard type that will accept Records from any table, and allow you to return records from any table.
This is untested, but hopefully give you an idea of how they could work;
LOCAL NextRecord(VAR RecVariant : Variant)
IF RecVariant.ISRECORD THEN BEGIN
RecRef.GETTABLE(RecVariant);
// RecRef.NUMBER is useful for Database::"Customer" style comparisons
RecRef.NEXT;
RecRef.SETTABLE(RecVariant); // Might not be necessary
END;

Error surrounding use of scan(&varlist) + Comparison of macro variables

As a follow up to this question, for which my existing answer appears to be best:
Extracting sub-data from a SAS dataset & applying to a different dataset
Given a dataset dsn_in, I currently have a set of macro variables max_1 - max_N that contain numeric data. I also have a macro variable varlist containing a list of variables. The two sets of macros are related such that max_1 is associated with scan(&varlist, 1), etc. I am trying to do compare the data values within dsn_in for each variable in varlist to the associated comparison values max_1 - max_N. I would like to output the updated data to dsn_out. Here is what I have so far:
data dsn_out;
set dsn_in;
/* scan list of variables and compare to decision criteria.
if > decision criteria, cap variable */
do i = 1 by 1 while(scan(&varlist, i) ~= '');
if scan("&varlist.", i) > input(symget('max_' || left(put(i, 2.))), best12.) then
scan("&varlist.", i) = input(symget('max_' || left(put(i, 2.))), best12.);
end;
run;
However, I'm getting the following error, which I don't understand. options mprint; shown. SAS appears to be interpreting scan as both an array and a variable, when it's a SAS function.
ERROR: Undeclared array referenced: scan.
MPRINT(OUTLIERS_MAX): if scan("var1 var2 var3 ... varN", i) > input(symget('max_'
|| left(put(i, 2.))), best12.) then scan("var1 var2 var3 ... varN", i) =
input(symget('max_' || left(put(i, 2.))), best12.);
ERROR: Variable scan has not been declared as an array.
MPRINT(OUTLIERS_MAX): end;
MPRINT(OUTLIERS_MAX): run;
Any help you can provide would be greatly appreciated.

The specific issue you have here is that you place SCAN on the left side of an equal sign. That is not allowed; SUBSTR is allowed to be used in this fashion, but not SCAN.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hash Join not behaving as required - sas

Related

Is there a SAS function to delete negative and missing values from a variable in a dataset?

WSO2 DAS SiddhiQL: Initialize an event table and update it after

DynamoDBMapper: Query on Multiple Conditions

Using RecordRef to Work with Multiple Tables

Error surrounding use of scan(&varlist) + Comparison of macro variables

Categories

Resources