Great Expectations - Run Validation over specific subset of a PostgreSQL table - great-expectations

I am fairly new to Great Expectations - and have a question. Essentially I have a PostgreSQL database, and every time I run my data pipeline, i want to validate a specific subset of the PostgreSQL table based off some key. Eg: If the data pipeline is run every day, the would be a field called current_batch. And the validation would occur for the below query:
SELECT * FROM jobs WHERE current_batch = <input_batch>.
I am unsure the best way to complete this. I am a using v3-api of great expectations and am a bit confused as to whether to use a checkpoint, or a validator. I assume I want to use a checkpoint but I can't seem to figure out how to create a checkpoint, but then only validate a specific subset of the PostgreSQL datasource.
Any help or guidance would be much appreciated.
Thanks,

I completely understand your confusion because I am working with GE too and the documentation is not really clear.
First of all "Validators" are now called "Checkpoints", so they are not a different entity, as you can read here.
I am working on an Oracle database and the only way I found to apply a query before testing my data with expectations is to put the query inside the checkpoint.
To create a checkpoint you should run the great_expectations checkpoint new command from your terminal. After creating it, you should add the "query" field inside the .yml file that is your checkpoint.
Below you can see a snippet of a checkpoint I am working with. When I want to validate my data, I run the command great_expectations checkpoint run check1
name: check1
module_name: great_expectations.checkpoint
class_name: LegacyCheckpoint
batches:
- batch_kwargs:
table: pso
schema: test
query: SELECT p AS c,
[ ... ]
AND lsr = c)
datasource: my_database
data_asset_name: test.pso
expectation_suite_names:
- exp_suite1
Hope this helps! Feel free to ask if you have any doubts :)

I managed this using Views (in Postgres). Before running GE, I create (or replace the existing) view as a query with all necessary joins, filtering, aggregations, etc. And then specify the name of this view in GE checkpoints.
Yes, it is not the ideal solution. I would rather use a query in checkpoints too. But as a workaround, it covers all my cases.
Let's have view like this:
CREATE OR REPLACE VIEW table_to_check_1_today AS
SELECT * FROM initial_table
WHERE dt = current_date;
And checkpoint be configured something like this:
name: my_task.my_check
config_version: 1.0
validations:
- expectation_suite_name: my_task.my_suite
batch_request:
datasource_name: my_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: table_to_check_1_today

Yes, a view can be created using the "current_date" - and the checkpoint can simply run the view. However, this would mean that the variable (current_date) is stored in the database - which may not be desirable; you might want to run the query in the checkpoint for a different date - which could be coming from a environment variable or elsewhere - to the CLI or python/notebook
Yet to find a solution where we can substitute a string in the checkpoint query; using a config variable from the file is a very static way - there may be different checkpoints running for different dates.

Related

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

SQL Server to Big Query DDL

I am new to big query and finding ways to migrate all the table structures from SQL Server to Big Query. There were close to 300 tables that needs to be created. Is there any way to do this automated or do in less time. Kindly throw some light on it as I presume many would have done this task.
Thanks in advance,
Venkat.
I recenty wrote a blog post about the subject
https://ael-computas.medium.com/copy-sql-server-data-to-bigquery-without-cdc-c520b408bddf
I also released a package on pypi that you can use as a base for your integrations
from database_to_bigquery.sql_server import SqlServerToCsv, SqlServerToBigquery
sql_server_to_csv = SqlServerToCsv(username="scott",
password="t1ger",
host="127.0.0.1//optionalinstance_name",
database="thedb",
destination=f"gs://gcsbucketname")
bigquery = SqlServerToBigquery(sql_server_to_csv=sql_server_to_csv)
result = bigquery.ingest_table(sql_server_table="table_to_read",
sql_server_schema="dbo",
bigquery_destination_project="bigqueryproject",
bigquery_destination_dataset="bigquerydataset")
print(result.full_str())
Just getting the schema could be something like this.
columns, pks = sql_server_to_csv.get_columns("table", "dbo")
bigquery.write_bigquery_schema(columns, "/path/to/file")
# /path/to/file can be in cloud storage, ie - gs://foo/bar.json
It does make some assumptions on datatypes, that you might want to override, and that is entirely possible.

Maintain a audit table through re usable frame work

I was asked to create control table with Informatica. I am a newbie and do not have much knowledge about it. I saw the same kind of stuff in my previous project but don't know the way to create a mapplet for that. So the requirement is that I have to create a mapplet which has the following columns:
-mapping_name
-session_name
-last_run_date
--source count
--target count
--status
So what happens is
Example: We executed a workflow with a particular mapping last week.
Now after 1 week we are executing the same mapping.
The requirement is that we should be fetching only those records which fall in this particular time frame(i.e from previous run to the current run). This is something I do not know.
Can you please help me out? I can provide furthur details if required.
There is a solution provided in below link but it doesnt use mapplet.
See, if you want to use mapplet, you wont get 'status' attribute and mapplet approach can be difficult to implement for all mappings.
You can use this link to gather statistics as well.
http://powercenternotes.blogspot.com/2014/01/an-etl-framework-for-operational.html
Now, regarding your other requirement, it seems to me to be an issue with incremental extract. So, you need to store the date parameter when you ran your flow last - into a DB table or flat file.
Use that as reference and pull anything greater than that date.
Mapplet - We used this approach earlier to gather statistics. But this is difficult because you need to add this mapplet + a reusable generic target to capture stats.
Input -
Type_of_data- (this can be source, target)
unique_key - (unique key of the mapping)
MappingName - $PMMappingName
SessionName - $PMSessionName
Aggregator -
i/p-
Type_of_data
unique_key
MappingName group by
SessionName group by
o/p-
count_row = COUNT(*)
Output -
Type_of_data
MappingName
SessionName
count_row
Use a reusable generic target to capture all the rows. You need to add one set after each source, one set before each target. The approach in the link is better i think.

PDI - Update field value in Logging tables

I'm trying create a transformation that can change field value in DB (postgreSQL what i use).
Case :
In postgre db I have table called Monitoring and it has several field like id, date, starttime, endtime, duration, transformation name, status, desc. All those value I get from Transformation Logging.
So, when I run the transformation it will insert into Monitoring table and set value for field status with Running. And when it done it will update the status into Finish. What I'm trying is to define value in table field by myself not take it from Transformation Logging so I can customize the value like I want to.
Goal is Update transformation status value from 'running' to 'finish/error/abort etc' in my db using pentaho and display that status in web app
I have thinking to used Modified Java Script step to do it but if there any other way maybe? A better one. (Just need opinion about this)
Apart from my remark, did you try the Value Mapper?
modified javascript is not a good idea to use. Ideally, it shouldn't be used due to the performance issue. You can use "add constant" step or "User defined Java Class" for an alternative.
You cannot change the values of the built-in Logging tables, for the simple reason that they are reserved for PDI usage. This causes a known issue in case of hard error: for example the status is not set to finish when the data base server crashes, or when a NullException is not catch by the DPI code.
You have some work around.
The simplest, the one used in the ETL-Pilot is to test (Status=Finish OR LogDate< 15 minutes ago) is the web app.
You can update the table when the transformation is not running. For example, put an hourly (or less) crontab that changes to Finish the status of any transformation whose LogDate is older than 15 mn. This crontab may be a simple SQL or included in a transformation that also check the tables size and/or send an email in case of potential error.
You can copy the table (if it is a non locking operation in your DB system), modify the Status column and use this table for your web app.

Simple data update with talend

I have a job in talend which migrates some data from one database to another... At the end of data migration, I should update the date of last extraction in the source database with SYSDATE so it could be used as a criteria for the next extraction. The SQL query would be something like :
UPDATE MIGR_FOLLOWUP SET LAST_EXTR = SYSDATE WHERE SYSTEM = 'TARGET3'
I'd like to do that update in talend, and I guess it should be a component triggered by OnSubjobOK, but I just can't seem to understand how to do this in a simple manner... The only way I could possibly think of is using both tOracleInput and tOracleOutput components, in order to first extract the wanted row and then update it, but it really doesn't sound like a good manner to do this...
Can anyone point me out on how to do this?
Thanks!!
You can run arbitrary SQL by using the database row components such as tOracleRow.
If you linked this with an on subjob ok link from your main migration then once your main migration completes successfully it would update the LAST_EXTR field with the current time.
Alternatively you could update this using a tOracleOutput component but you would need to have Talend define the date-time stamp using something like Talend.getCurrentDate in a tMap or tFixedFlowInput.