Please, help me to interpret the SAS code (I am pretty new to sas and trying to rewrite the code from sas to stata).
PROC SQL;
CREATE TABLE sample AS
SELECT a.*, LOG(1 + b.pcyld) as u
FROM sample AS a LEFT JOIN bondterm AS b
ON a.compdate = b.compdate AND
a.t_debt_round = b.roundmat;
QUIT;
Does it mean: from dataframe - "bondterm", generate r_debt = LOG(1 + b.pcyld). Then merge this database by compdate and t_debt_round (sample dataframe) = roundmat (bondterm). Am I right?
The source tables are:
sample
bondterm
The data is left joined, so all records from SAMPLE are included, but if a record is in bondterm and not in sample it's not included.
compdate (in both)
debt_round from Sample joins to roundmat from
bondterm
Fields selected:
all from sample
calculate new field u (log(1+ pcyld))
Output table:
sample (same name as input which is usually not recommended)
Related
I'm trying to save the results from a forecast(dataset) into a historical dataset in SAS servers.
I already have the path of the historical dataset but what I'm trying to do here is to append the results if they don't exist or replace them if they already exist on the historical dataset.
Below is how the table that I want to append/replace looks:
:Agency
:Forecast_Week
:Date Fc
:SubAgency
:Value
New
12/26/22
12/27/22
One
3243262
New
12/26/22
12/28/22
One
3242355
New
12/26/22
12/29/22
Two
3225142
New
12/26/22
12/30/22
Two
3234235
So, if the records for the Agency, Forecast Week, Date, SubAgency already exists I want to replace them with the new values but if they don't exist in the historical dataset I want to append them.
Do you know how I can do this?
I did something very similar not that long ago:
proc sql;
create table temp as
select *
from table_old
where forecast_week NOT IN(select forecast_week from table_new)
;
quit;
/* Append the updated and new values */
data table_old;
set temp
table_new
;
run;
I hope this helps
I asked a similar question in this post maybe you can get some inspiration there:
SAS EG append new data and overwrite already existing rows
So I'd like make a query that shows all the datasets from a project, and the number of tables in each one. My problem is with the number of tables.
Here is what I'm stuck with :
SELECT
smt.catalog_name as `Project`,
smt.schema_name as `DataSet`,
( SELECT
COUNT(*)
FROM ***DataSet***.INFORMATION_SCHEMA.TABLES
) as `nbTable`,
smt.creation_time,
smt.location
FROM
INFORMATION_SCHEMA.SCHEMATA smt
ORDER BY DataSet
The view INFORMATION_SCHEMA.SCHEMATA lists all the datasets from the project the query is executed, and the view INFORMATION_SCHEMA.TABLES lists all the tables from a given dataset.
The thing is that the view INFORMATION_SCHEMA.TABLES needs to have the dataset specified like this give the tables informations : dataset.INFORMATION_SCHEMA.TABLES
So what I need is to replace the *** DataSet*** by the one I got from the query itself (smt.schema_name).
I am not sure if I can do it with a sub query, but I don't really know how to manage to do it.
I hope I'm clear enough, thanks in advance if you can help.
You can do this using some procedural language as follows:
CREATE TEMP TABLE table_counts (dataset_id STRING, table_count INT64);
FOR record IN
(
SELECT
catalog_name as project_id,
schema_name as dataset_id
FROM `elzagales.INFORMATION_SCHEMA.SCHEMATA`
)
DO
EXECUTE IMMEDIATE
CONCAT("INSERT table_counts (dataset_id, table_count) SELECT table_schema as dataset_id, count(table_name) from ", record.dataset_id,".INFORMATION_SCHEMA.TABLES GROUP BY dataset_id");
END FOR;
SELECT * FROM table_counts;
This will return something like:
From
https://cloud.google.com/bigquery/docs/partitioned-tables:
you can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD
This enables me to do:
SELECT count(*) FROM `xxx.xxx.xxx_*`
and query across all the shards. Is there a special notation that queries only the latest shard? For example say I had:
xxx_20180726
xxx_20180801
could I do something along the lines of
SELECT count(*) FROM `xxx.xxx.xxx_{{ latest }}`
to query xxx_20180801?
SINGLE QUERY INSPIRED BY Mikhail Berlyant:
SELECT count(*) as c FROM `XXX.PREFIX_*` WHERE _TABLE_SUFFIX IN ( SELECT
SUBSTR(MAX(table_id), LENGTH('PREFIX_') + 2)
FROM
`XXX.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'PREFIX_%')
If you do care about cost (meaning how many tables will be scaned by your query) - the only way to do so is to do in two steps like below
First query
#standardSQL
SELECT SUBSTR(MAX(table_id), LENGTH('PREFIX') + 1)
FROM `xxx.xxx.__TABLES_SUMMARY__`
WHERE table_id LIKE 'PREFIX%'
Second Query
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '<result of first query>'
so, if result of first query is 20180801 so, second query will obviously look like below
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '20180801'
If you don't care about cost but rather need just result - you can easily combine above two queries into one - but - again - remember - even though result will be out of last table - cost will be as you query all table that match xxx.xxx.PREFIX_*
Forgot to mention (even though it should be obvious): of course when you have only COUNT(1) in your SELECT - the cost will be 0(zero) for both options - but in reality - most likely you will have something more valuable than just count(1)
I know this is a kind of an old thread but I was surprised why no one offers an answer using Variables.
"Héctor Neri" already mentioned this in the comments but I thought might be better to have an actual answer with a sample code posted.
#standardSQL
DECLARE SHARD_DATE STRING;
SET SHARD_DATE=(
SELECT MAX(REPLACE(table_name,'{TABLE}_',''))
FROM `{PRJ}.{DATASET}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{TABLE}_20%'
);
SELECT * FROM `{PRJ}.{DATASET}.{TABLE}_*`
WHERE _TABLE_SUFFIX = SHARD_DATE
Make sure to replace {PRJ}, {DATASET}, and {TABLE} values with your table location.
If you run this on BigQuery Web UI, you will see this message:
WARNING: Could not compute bytes processed estimate for script.
But you can see that variable properly reduce the table scan to the latest partition and does not cause any extra cost after running the script.
I have two table with exactly the same column headers and one row each. I have the code to concatenate them which works fine.
data concatenation;
set CURR_CURR CURR_30;
run;
However, there is no index in the output to say which row corresponds to which table.
I've tried using 'create index' and 'index create' already but they don't work syntactically. Simply I'd just want to add a column of strings and move it to the front of all the other columns in the data set.
INDSNAME option on the SET statement + variable to store the information.
If you set the length statement ahead of your SET statement it will create it as the first column.
Just a note that this isn't the same as an 'index'. An index in SAS has a different meaning which isn't what you're trying to create here.
data concatenation;
length dset source $50.;
set CURR_CURR CURR_30 indsname=source;
dset=source;
run;
Reeza's answer is very similar to something I figured out that worked as well. Here's my version as an alternative.
data concatenation;
length id $ 10;
set CURR_CURR (in=a) CURR_30 (in=b);
if a then id = 'curr_curr';
else if b then id = 'curr_30';
run;
I’m wondering if someone can help with a coding problem I have.
Background – I have a project that imports some files and uses the data in those files to perform projections. The contents of the files determines some aspects of the size of the output that follows. Simply, values in data loaded in drives the size and shape of the tables that follow, and this can vary.
The following code is an example of the problem.
The data loaded will have a variable year start (note wf2009, 2009 is the first year) and variable range (this example goes from 2009 to 2030, but this will vary too).
proc summary data= labeled_proj_data_hc;
class jurisdiction specialty measure;
types jurisdiction*specialty*measure;
VAR wf2009--wf2030;
output out= sum_labeled_proj_data_hc
sum(wf2009) = y2009
sum(wf2010) = y2010
sum(wf2011) = y2011
sum(wf2012) = y2012;
run;
Where I’m not sure how to proceed is:
sum(wf2009) = y2009
sum(wf2010) = y2010
sum(wf2011) = y2011
sum(wf2012) = y2012;
In the sequence of lines calling for the sum of their respective columns, how can I make this dynamic so that the start year is populated from a variable and it increments yearly until the last year which is also variable.
Has anyone solved a similar problem.
Cheers,
Is renaming the variables necessary? If not then you can use the : wildcard operator to access all variables that begin with 'wf', then just put SUM= in the output statement, which will preserve the original names.
So your proc summary would look like this.
proc summary data= labeled_proj_data_hc;
class jurisdiction specialty measure;
types jurisdiction*specialty*measure;
VAR wf: ;
output out= sum_labeled_proj_data_hc
sum=;
run;