In SAS Data Integration Studio, under the 'Action' menu, there is an option to 'Propagate Columns'. What is the use of this?
The reads the metadata on the input table and pushes the column definitions into the down stream nodes.
To 'propagate columns' means to move columns from source to target or from target to source in the mapping. You can use action -> propogate only when you select a transformation in the job. then you can propagate columns from source to target for that sas di transformation.
Related
I have two datasets i.e. One is Benchmark dataset and second is Independent dataset. I tested Benchmark dataset using 10-fold cross validation in weka. Can we test Independent dataset in weka?
If you are using Weka GUI:
In Weka explorer - first load your benchmark dataset in the Preprocess tab (use the open file... button). Then go to the Classify tab and there under the test options select Supplied test set press Set... and then open file... and close.
Finally, select the algorithm you want and press Start.
Notice that the train and test files must have the same structure.
i need to use a append object after a series of join that have a conditional run... So the join step may be not execute if the condition is not verified and his work physical dataset will not be created.
The problem is that the append step take an error if one o more input physical dataset are not created.
Is there a smart way to create a physical empty table from a metadata structure of the works table of the joins or to use the append with some non-created datasets?
The create table with the list of all field is not a real solution because i've to replicate it per 8 different joins and then replicate the job 10 times...
Thanks to all
Roberto
Thank you for your comments.
What you should do:
Amend your conditional node so that it would on positive condition to create a global macro variable with value of MAX. On negative condition to create the same variable with value of 0.
Replace offending SQL step with "CREATE TABLE" node
In the options for "CREATE TABLE", specify macro variable for "MAXIMUM OUTPUT ROWS (OUTOBS)". See the picture below for example of those options.
So now when your condition is not met, you will always end up with an empty table. When condition is met, the step executes normally.
I must say my version of DI Studio is a bit old. In my version SQL node doens't allow passing macro variables to SQL options, only integers can be typed in. Check if your version allows it because if it does, then you can amend existing SQL step and avoid replacing it with another node.
One more thing, you will get a warning when OUTOBS options is less then the resulting would be dataset.
Let me know if you have any questions.
See the picture for create table options:
At the end i've created another step that extract 0 row from the source table by the condition 1=0 in the where tab. In this way i have a empty table that i can use with a data/set in the post sql of the conditional run if the work table of the join does not exist.
This is not a solution but a valid work around.
I'm using Pentaho DI (kettle) and not sure what's the best way to do the following:
From a downloaded csv file, check if a column exists, and based on that select the right next step.
There are 3 possible options.
Thanks,
Isaac
You did not mention possible options, so I'll just provide you with a sketch showing how to check if a column exists in a file.
For this you will need a CSV file input step and Metadata structure of stream step which will read the metadata of the incoming stream.
For a sample csv file with 3 columns named col1, col2 and col3 you get every column in a separate row with its name as a value in Fieldname column in Metadata step.
Then depending on your needs you could use for example Filter Rows or Switch / Case step for further processing.
I am using the SAS Enterprise Miner 13.2.
I have a SAS table as a data source. In this table i have a binary variable D_TYP ( "I" and "P" ) and other categorical variables.
I want to split the data by D_TYP so i got two tables. One with all "I" and the other with "P". The problem i don’t know how.
I have been looking in the taskbar and i tried Filter and Data Partition. I can probably use SAS Code to split the Data but i think there is an other way with the taks.
You could use two filter nodes to do the job, with one filtering out I and the another filtering out P. The resulted data set should only consist of one type of the binary variable. In case you are not familiar with the filter node, click on the option Class Variable at properties panel and apply User specified filter. You have to manually select the group by clicking on its corresponding bar.
One of the critical SAS dataset is left open from SAS Enterprise Guide, by our offshore associate. We are depending on that dataset for many updates through various jobs. I tried searching for an option from various sites to unlock the dataset, but of no use. Kindly provide any suggestion. Thanks.
Depending on some of the specifics of your situation, another option is to prevent anyone from locking it in the first place using a PW= dataset option like:
data myImportantTable(PW=pass123);
x=1;output;
run;
Then you could create a view that allows EG users to click and see the underlying data, but does not LOCK the original dataset:
proc sql;
CREATE VIEW myImportantTable_view AS
SELECT * FROM myImportantTable(read=pass123)
;quit;
Now INSERTS, UPDATES etc will work even if the view is opened by a user in EG:
*This will work even if view is opened in EG;
proc sql;
INSERT INTO myImportantTable(PW=pass123) VALUES(101)
;quit;
Note that this is not a good option if you've got a lot of different INSERT/UPDATE statements spread throughout your program - each of them would need the (PW=...) dataset option added to them in order to work.
Use the SYSTASK command to execute a mv (move) or cp (copy) UNIX command to replace the existing data set. If you need to move or copy more than one data set at a time you can use the * wildcard but you must also use the SHELL option.
There is an option in SAS Enterprise Guide. Under Tools --> Options --> Data --> Performance. There is a check box, "Close Data Grid after period of inactivity (in minutes)" This was even if the data grid is opened after 'n' minutes, it will be available for others to update.