Simple data update with talend - sql-update

I have a job in talend which migrates some data from one database to another... At the end of data migration, I should update the date of last extraction in the source database with SYSDATE so it could be used as a criteria for the next extraction. The SQL query would be something like :
UPDATE MIGR_FOLLOWUP SET LAST_EXTR = SYSDATE WHERE SYSTEM = 'TARGET3'
I'd like to do that update in talend, and I guess it should be a component triggered by OnSubjobOK, but I just can't seem to understand how to do this in a simple manner... The only way I could possibly think of is using both tOracleInput and tOracleOutput components, in order to first extract the wanted row and then update it, but it really doesn't sound like a good manner to do this...
Can anyone point me out on how to do this?
Thanks!!

You can run arbitrary SQL by using the database row components such as tOracleRow.
If you linked this with an on subjob ok link from your main migration then once your main migration completes successfully it would update the LAST_EXTR field with the current time.
Alternatively you could update this using a tOracleOutput component but you would need to have Talend define the date-time stamp using something like Talend.getCurrentDate in a tMap or tFixedFlowInput.

Related

Great Expectations - Run Validation over specific subset of a PostgreSQL table

I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:
SELECT * FROM jobs WHERE current_batch = <input_batch>.
I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.
Any help or guidance would be much appreciated.
Thanks,
I completely understand your confusion because I am working with GE too and the documentation is not really clear.
First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.
I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.
To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.
Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1
name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
- batch_kwargs:
table: pso
schema: test
query: SELECT p AS c,
[ ... ]
AND lsr = c)
datasource: my_database
data_asset_name: test.pso
expectation_suite_names:
- exp_suite1
Hope this helps! Feel free to ask if you have any doubts :)
I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.
Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.
Let's have view like this:
CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;
And checkpoint be configured something like this:
name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
batch_request:
datasource_name: my_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: table_to_check_1_today
Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook
Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.

PDI - Update field value in Logging tables

I'm trying create a transformation that can change field value in DB (postgreSQL what i use).
Case :
In postgre db I have table called Monitoring and it has several field like id, date, starttime, endtime, duration, transformation name, status, desc. All those value I get from Transformation Logging.
So, when I run the transformation it will insert into Monitoring table and set value for field status with Running. And when it done it will update the status into Finish. What I'm trying is to define value in table field by myself not take it from Transformation Logging so I can customize the value like I want to.
Goal is Update transformation status value from 'running' to 'finish/error/abort etc' in my db using pentaho and display that status in web app
I have thinking to used Modified Java Script step to do it but if there any other way maybe? A better one. (Just need opinion about this)
Apart from my remark, did you try the Value Mapper?
modified javascript is not a good idea to use. Ideally, it shouldn't be used due to the performance issue. You can use "add constant" step or "User defined Java Class" for an alternative.
You cannot change the values of the built-in Logging tables, for the simple reason that they are reserved for PDI usage. This causes a known issue in case of hard error: for example the status is not set to finish when the data base server crashes, or when a NullException is not catch by the DPI code.
You have some work around.
The simplest, the one used in the ETL-Pilot is to test (Status=Finish OR LogDate< 15 minutes ago) is the web app.
You can update the table when the transformation is not running. For example, put an hourly (or less) crontab that changes to Finish the status of any transformation whose LogDate is older than 15 mn. This crontab may be a simple SQL or included in a transformation that also check the tables size and/or send an email in case of potential error.
You can copy the table (if it is a non locking operation in your DB system), modify the Status column and use this table for your web app.

Changing Length of Siebel Column

Suppose we have a existing siebel column and this column has corresponding mapped eim column also. If I change the length of this siebel base table's column from 100 to 200varhcar by running alter query from backend. How it will impact on the EIM process? Will import process be successful?
Regards,
Robin
If you are interested in knowing conceptually, here are the implications that i can foresee.
a) Table column added using alter table is virtually useless as the application wont be able to use it because its definition is missing from Siebel Repository.
b) If you change the length of an existing column, application would still be using the length mentioned in Siebel Repository.
c) EIM process will ignore your new column length as it loads data dictionary before running the job.
d) And finally, during code migration you have to do the alter table every time since DDLSync process cannot take care of your scenario.
I would advise you not to alter the length of an existing vanilla table column, and instead extend the database table to add a new column. Just as the other poster mentioned, you should do this using Siebel Tools. You will then need to also add reference for this new field into the EIM components (this you also do using Siebel Tools).
This is a best-practice. If your client ever had an Siebel code review done by Oracle, you would be told to do what I described above (not what you were considering doing).
Changing the column length using the alter table command will only change it in the database layer, which will have no repercussions with a siebel standpoint. The EIM tables will still be valid as they will be using the column length mentioned in the repository sent in by tools. If you dont change it in the tools and apply the table, I dont think the changes will work.
I would not recommend that you do this. In this case, probably nothing will go wrong. EIM columns will load data that are upto 100 characters long but from the gui, you could insert upto 200 characters. Something unexpected can go wrong, we would need to know your application better to answer this question.

How can I get all data exported by communication node for post processing in MA Tool of CI Studio?

I am trying to do a post processing on data exported by the communication node. One option I have is to export the data in form of sasdataset and import it in the process node. But if I can get it directly from the macro varibale like &intable or anything similar, it would be easier for me. I have already tried &intable and &intable1. They only have subject_id level data and not all the data that is being exported by the communication node. Is this possible? If so how?
The communication node will be creating a sas dataset with the same structure as the export definition. The sas dataset can be used in the post processing node.
Try an inner join to your exported dataset from &intable.
Something like:
Update exporteddata
from tempdb.exporteddata
inner join &intable
&intable is strictly subject IDs. Other attributes are not editable on that variable. Nor can you directly edit a field in the information map. But process nodes allow you to work with data that SAS has access to.
Worst case you can try setting the data to its own dataset and running a pass through query the database engine to perform the manipulation.

How do I update a value in a row in MySQL using Connector/C++

I have a simple database and want to update an int value. I initially do a query and get back a ResultSet (sql::ResultSet). For each of the entries in the result set I want to modify a value that is in one particular column of a table, then write it back out to the database/update that entry in that row.
It is not clear to me based on the documentation how to do that. I keep seeing "Insert" statements along with updates - but I don't think that is what I want - I want to keep most of the row of data intact - just update one column.
Can someone point me to some sample code or other clear reference/resource?
EDIT:
Alternatively, is there a way to tell the database to update a particular field (row/col) to increment an int value by some value?
EDIT:
So what is the typical way that people use MySQL from C++? Use the C api or the mysql++? I guess I chose the wrong API...
From a quick scan of the docs it appears Connector/C++ is a partial implementation of the Java JDBC API for C++. I didn't find any reference to updateable result sets so this might not be possible. In Java JDBC the ResultSet interface includes support for updating the current row if the statement was created with ResultSet.CONCUR_UPDATABLE concurrency.
You should investigate whether Connector/C++ supports updateable resultsets.
EDIT: To update a row you will need to use a PreparedStatement containing an SQL UPDATE, and then the statement's executeUpdate() method. With this approach you must identify the record to be update with a WHERE clause. For example
update users set userName='John Doe' where userID=?
Then you would create a PreparedStatement, set the parameter value, and then executeUpdate().