Can I set up relations between data set in SAS? - sas

I am moving from Relational Database and new to SAS. Now I need to import several CSV files into SAS, and there are relationships between those files. Just like the relationships between tables in database. I am wondering, does the same concept exist in SAS such as the foreign key which I need to set up, or should I just import those files directly regardless of the relationships because no such things in SAS?

Since the concept of a foreign key exists in your head, it also exists in SAS. But it is (generally) not "a supported attribute" that you'd actually use to tag data fields. SAS is low overhead in terms of having to do much upfront data definition, especially for ad hoc work.
Just import your files as they are.
And coming from relational databases, you should probably look at "proc SQL" as the fastest & best way of using your data manipulation skills in SAS.

In SAS, you have the concept of Integrity Constraint, as mjsqu referred to in comments. It is how you enforce relations between datasets. It is quite simple to use, and the syntax should be relatively familiar to someone coming from a strong SQL background.
Integrity constraints include:
Check (list of valid values written into the dataset)
Not Null (may not have a missing value)
Unique (may not have duplicate values)
Primary Key
Foreign Key (also known as a "referential constraint", as it is the only one that checks another table)
Here's an example of some of SAS's Integrity Constraints in action:
proc sql;
create table employees (
employee_id num primary key,
name char(16),
gender char(1),
dob num format=date9.,
division num not null);
create table division_table (
division num primary key,
division_name char(16)
);
alter table employees
add constraint gender check(gender in ('M','F'))
add constraint division foreign key(division) references division_table
on delete restrict on update restrict;
*this insert fails because of Division not containing a value for 1;
insert into employees (employee_id,name,gender,dob,division) values (1,'Johnny','M','06JAN1945'd,1);
insert into division_table (division,division_name) values (1,'Division 1');
*Now it succeeds;
insert into employees (employee_id,name,gender,dob,division) values (1,'Johnny','M','06JAN1945'd,1);
quit;
You can also add constraints in PROC DATASETS, which will be more familiar for SAS users (and is probably slightly easier syntax for those unfamiliar with SQL).

Related

What is best practice to deal with missing data according to Kimball?

I have a data base with the following tables:
Customers, Invoices, Salesman, Target.
The ones concerned about my question are Customers, Invoices.
There are customersIDs used in the Invoices but doesn't exist in the Customers table.
If I used only the customers from Customers Table, my customer dimension would be incomplete.
My solution is to append these IDs from Invoices to Customers and fill other columns in the Customers table with nulls.
I don't know if this is the best approche according to Kimball?
also, if it is a good solution, how can I add accomplish it with Power bi desktop?
Customers table: "generated Data"
Invoice table:
..... just a sample the table is thousands of rows.
There's two points here:
Firstly, (in import mode at least) PBI already creates the "blank row" for items present in your fact table but missing from your dimension table for precisely this scenario. If you don't need the granularity of each individual missing customer id, then you don't need to do anything.
Secondly, if you need to to retain that granularity then your approach is the correct one. The way to do this in Power Query is as follows:
Create a new query which takes your customer dimension table and does a left outer join on customer id with your invoice fact table.
Expand the newly joined table but retain only the new customer id column.
Remove all columns apart from the new customer id column.
Remove duplicates
You now have a list of missing customer ids. Ensure the column name is the same as the column name of you customer id in the customer dimension table. Append this to the original customer dimension query and the nulls will be filled in automatically for the missing columns.
Please keep in mind that It is Kimball, not Kimble.
There are 4 steps of DWH Methodology:
1) Understand Business Process (What your process is actually measuring?)
2) Deciding the grain (It means what every row in your fact table actually represents?)
3) Deciding Dimensions (Ask Where-What-Who-Where-How-HowMany-HowMuch to your grain declaration formed together with business processing)
4) Define Facts (Metrics)
According to this order, You define Dimension tables before building your fact tables: If your dimension table , Customer table in this case, is missing in terms of customers available in your fact table, My biggest biggest advice according to the DWH Dimensional Modeling is to set your customer table right!!! Define every piece of customer in your dimension table!!!! Then populate your Fact table with records:
[Customer ID] in Customer Table : PRIMARY KEY
[CustomerID] in Invoice Table : FOREIGN KEY
SQL and Power BI reacts very differently in your problem:
1) Power BI has no referential integrity concept: It adds a blank row to your dimension table in such a case.
2) SQL gives referential integrity error, and you can't even add rows to your fact table. I support SQL in this case personally!!!!
Finally: Use some ETL tool(SSIS, Talend, ODI or even Power Query) to make your dimension table as accurate as possible:
For example:
Do not leave any column value as null!
If an unknown date exists, put a default date value like '1900-12-31'
If an unknown textual property, put in keywords 'unknown','not available' etc..
Because dimensional table are sources of querying in SQL statements; and different SQL Vendors (SQL Server, Oracle, MySQL) has to deal with NULL values in a different way, and this cause problems in terms of performance wise!!

Compound Sort Key vs. Sort Key

Let me ask other question about redshift sortkey.
We're planning to set the sortkey with the columns frequently used in WHERE statement.
So far, the best combination for our system seems to be:
DISTSTYLE EVEN + COMPOUND SORTKEY + COMPRESSED Column (except for First SortKey column)
Just wondering which can be more better, simple SORTKEY or COMPOUND SORTKEY for our BI tables which can have diversified queries according to users' analysis.
For example, we set the compound sortkey according to frequency in several queries' WHERE statement as follows.
COMPOUND SORTKEY
(
PURCHASE_DATE <-- set as first sort key since it's date column.
STORE_ID,
CUTOMER_ID,
PRODUCT_ID
)
But sometimes it can be queried only 'PRODUCT ID' in actual queries, not with other listed sort keys, nor queried different from COMPOUND KEY order.
In that case, may I ask 'COMPOUND SORTKEY' can be useless or simple SORT KEY can be more effective ...?
I'd be so grateful if you would tell me about your idea and experiences.
The simple rules for Amazon Redshift are:
Use DISTKEY on the column that is most frequently used with JOIN
Use SORTKEY on the column(s) that is most frequently used with WHERE
You are correct that the above compound sort key would only be used if PURCHASE_DATE is included in the WHERE.
An alternative is to use Interleaved Sort Keys, which give equal weighting to many columns and can be used where different fields are often used in the WHERE. However, Interleaved Sort Keys are much slower to VACUUM and are rarely worth using.
So, aim to use SORTKEY on most of your queries, but don't worry too much about the other queries unless you are having some particular performance problems.
See: Redshift Sort Keys - Choosing Best Sort Style | Hevo Blog
Your compound sort key looks sensible to me. It's important to understand that Redshift sort keys are not an index which is used or not used. The sort key is used to physically arrange the data on disk.
The query optimizer "uses" the sort key by looking at the "zone map" (min and max values) for each block during query execution. This happens for all columns regardless of whether they are in the sort key.
Secondary columns in a compound sort key can still be very effective at reducing the data that has to be scanned from disk, especially when the column values are low cardinality.
See this previous example for a query to check on sort key effectiveness: Is my sort key being used?
Please review our guide for designing tables effectively: "Amazon Redshift Engineering’s Advanced Table Design Playbook". The guide discusses the correct use of Interleaved sort keys but note that they should only be used in very specific circumstances.

Reading (even joining) a very large (1.1bn row) table in Enterprise Guide from Teradata

Hopefully you guys can help with what I'm hoping is quite a simple question for those in the know!
I live (well, work) in SAS Enterprise Guide and am trying to perform a simple left join against a table in Teradata.
The table is extremely large (700+ columns, 1.1bn rows) and so far I have been connecting via a LIBNAME statement at the top of my program, followed by the usual PROC SQL to read the data.
The issue I am having is its is extremely slow. I performed the join successfully using 90 rows on the left table and it took 3 hours to complete. The real table I want to use has something like 15,000 rows.
I have tried to connect via the SQL Pass-Through method, but this throws a hosts file error, which I can't fix due to corporate security limitations.
Has anyone had any experience performing this kind of task?
I should mention that I can run a simple select * query in Teradata SQL Assistant is just over 1 minute (16,666,666 obs/s!) so the limitation must be somewhere between SAS/Teradata, or even SAS itself.
I'm sorry I haven't posted actual code snippets as they're on my work machine but this has been bugging me for ages so thought I'd see if I'm missing any tricks.
Thanks in advance for your help.
So you're joining a SAS data set to a Teradata table and want to return the matching records. You'll want to use SAS's DBMASTER= data set option. It designates which of the tables is larger. By telling SAS this, it knows which table to move.
Here I assume librefs have already been assigned and that the Teradata table is larger--more obs--than the SAS data set.
proc sql threads; select tdTable.* from sastables.sasTable1, td.tdTable(dbmaster=yes)
where tdTable.idNum=sasTable1.idNum; quit;
If by chance your SAS data set is larger, you'll want to use the MULTI_DATASRC_OPT= option. Either google these terms or look in the SAS/Access to Relational Databases manual. It's pretty good.
Good luck.
Have you considered creating a volatile table in Teradata? Since this is created in your spool allocation you shouldn't need explicit permissions to create the table. Once created you can load the SAS data set into the Volatile table and collect statistics on the table's join columns and filter columns. This will help the optimizer understand the demographics about your "small" table. The volatile table will only persist for the duration of your session and is not accessible across multiple sessions.
Then rewrite your SAS code to push-down the SQL to Teradata joining the large table to your volatile table. The results can be returned to SAS and loaded into another data set.
CREATE VOLATILE TABLE MyTable, NO FALLBACK
( ColA SMALLINT NOT NULL,
ColB VARCHAR(10) NOT NULL
) PRIMARY INDEX (ColA)
ON COMMIT PRESERVE ROWS /* This is important */
;
The primary index is how Teradata distributes the data and accesses the data. Tables distributed on the same column will join "AMP local" and will not require a redistribution. This is not always possible, as your primary index selection has to consider even distribution as well as access path. The primary index does not have to be unique, but can be.
Hope this helps.

insert data in sqlite3 when array could be of different lengths

Coming off a NLTK NER problem, I have PERSONS and ORGANIZATIONS, which I need to store in a sqlite3 db. The obtained wisdom is that I need to create separate TABLEs to hold these sets. How can i create a TABLE when len(PERSONs) could vary for each id. It can even be zero. The normal use of:
insert into table_name values (?),(t[0]) will return a fail.
Thanks to CL.'s comment, I figured out the best way is to think rows in a two-column table, where the first column is id INT, and the second column contains person_names. This way, there will be no issue with varying lengths of PERSONS list. of course, to link the main table with the persons table, the id field has to REFERENCE (foreign keys) to the story_id (main table).

Select for Teradata Primary/Foreign key relationship

Im in the process of learning to properly pull appropriate metadata from a Teradata database and a large part of what I need is to pull all existing primary/foreign keys within a database. I am still very much a beginner with Teradata as well as big data in general, so a simplified explanation would be nice.
A simplified version of a select statement would also be incredibly helpful. Thanks in advance.
Foreign Keys: dbc.All_RI_ParentsV[X]
PK/Unique: dbc.IndicesV[X]. Unique Indexes got a UniqueFlag Y, if it was defined as a PK in the Create Table IndexType will be P. Multi-column indexes got one row per column all sharing the same IndexNumber, 1 is always the PI.
But as Teradata is a DWH you might have tables without defined PK and you will hardly find any defined FKs.