Informatica Developer(IDQ) stats - informatica

How can we capture the mapping stats like mapping name ,Source rows,Target rows,Start time,End time in informatica developer(IDQ) tool into a table

You can see the mapping stat under Admin Console > Monitor > Execution Statistics > DIS name > Ad Hoc Jobs/Specific application

Related

AWS KINISES Analytics Studio ZEPPELIN Flink HIVE error

I create a Studio Notebook at Kinesis analytics, I can see data come in by MQTT SQL Legacy of Analytics. So I am receiving data:
enter image description here
When I go to "open in apache zeppelin"
I create the table
%flink.ssql
CREATE TABLE `ppgsignal0903` ( `timestamp` BIGINT,`[Heart Rate Measurement]` DOUBLE,
`[Energy Expended]` DOUBLE,
`RR-Interval` DOUBLE,
`iso_time` as TO_TIMESTAMP(FROM_UNIXTIME(`timestamp`)) )
WITH ( 'connector' = 'kinesis',
'stream' = 'PPG_PW',
'aws.region' = 'eu-central-1',
'scan.stream.initpos' = 'LATEST',
'format' = 'json' )
data is coming in all the time, when I go to see my table:
%flink.ssql(type=update)
SELECT * FROM ppgsignal0903;
I have the following error:
Fail to run sql command: SELECT * FROM ppgsignal0903
Unable to create a source for reading table
'hive.ppgdatabase.ppgsignal0903'.
Table options are:
'aws.region'='eu-central-1'
'connector'='kinesis'
'format'='json'
'scan.stream.initpos'='LATEST'
'stream'='PPG_PW'
does anyone have a tip?
I need to do some analytics and manipulate the data showing real time charts( for example, hear beats per second, time between diastolic and systolic blood pressure in the last 10 min, etc) , so I need to have different paragraphs where I could run separate with real time data

Optimal ETL process and platform

I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek
If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio

Getting Info from GCP Data Catalog

I notice when you query the data catalog in the Google Cloud Platform it retrieves stats for the amount of times a table has been queried:
Queried (Past 30 days): 5332
This is extremely useful information and I was wondering where this is actually stored and if it can be retrieved for all the tables in a project or a dataset.
I have trawled the data catalog tutorials and written some python scripts but these just retrieve entry names for tables and in an iterator which is not what I am looking for.
Likewise I also cannot see this data in the information schema metadata.
You can retrieve the number of completed/performed queries of any table/dataset exporting log entries to BiqQuery. Every query generates some logging on Stackdriver so you can use advanced filters to select the logs you are interested it and store them as a new table in Bigquery.
However, the retention period for the data access logs in GCP is 30 days, so you can only export the logs in the past 30 days.
For instance, use the following advance filter for getting the logs corresponding to all the jobs completed of an specific table:
resource.type="bigquery_resource" AND
log_name="projects/<project_name>/logs/cloudaudit.googleapis.com%2Fdata_access" AND
proto_payload.method_name="jobservice.jobcompleted"
"<table_name>"
Then select Bigquery as Sink Service and state a name for your sink table and the dataset where it will be stored.
All the completed jobs on this table performed after the sink is established will appear as a new table in BigQuery. You can query this table to get information about the logs (you can use a COUNT statement on any column to get the total number of successful jobs for instance).
This information is available in the projects.locations.entryGroups.entries/get API. It is availble as UsageSignal, and contains usage information of 24 hours, 7days, 30days.
Sample output:
"usageSignal": {
"updateTime": "2021-05-23T06:59:59.971Z",
"usageWithinTimeRange": {
"30D": {
"totalCompletions": 156890,
"totalFailures": 3,
"totalCancellations": 1,
"totalExecutionTimeForCompletionsMillis": 6.973312e+08
},
"7D": {
"totalCompletions": 44318,
"totalFailures": 1,
"totalExecutionTimeForCompletionsMillis": 2.0592365e+08
},
"24H": {
"totalCompletions": 6302,
"totalExecutionTimeForCompletionsMillis": 25763162
}
}
}
Reference:
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries/get
https://cloud.google.com/data-catalog/docs/reference/rest/v1/projects.locations.entryGroups.entries#UsageSignal
With Python Datacatalog - You first need to search the Data catalog and you will receive linked_resource in response.
Pass this linked_resource as a request to lookup_entry and you will fetch the last queried (30 days)
results = dc_client.search_catalog(request=request, timeout=120.0)
for result in results:
linked_resource = result.linked_resource
# Get the Location and number of times the table is queried in last 30 days
table_entry = dc_client.lookup_entry(request={"linked_resource": linked_resource})
queried_past_30_days = table_entry.usage_signal.usage_within_time_range.get("30D")
if queried_past_30_days is not None:
dc_num_queried_past_30_days = int(queried_past_30_days.total_completions)
else:
dc_num_queried_past_30_days = 0

Why does this work as a WQL query?

I'm writing a Powershell script and am querying the local DNS resource for both CNAME and A records matching specific criteria (specifically against HP servers). Using full administrator access rights using WQL against the root\MicrosoftDNS provider, I was presented with 0 records for the following:
select * from MicrosoftDNS_ResourceRecord where TextRepresentation like '%sql%'
However, negating a negation works.
select * from MicrosoftDNS_ResourceRecord where NOT(NOT(TextRepresentation like '%sql%'))
Why? Am I going insane?
The full query is:
select * from MicrosoftDNS_ResourceRecord where NOT (ContainerName like '..%' OR OwnerName like '%ilo%') AND (__CLASS = 'MicrosoftDNS_AType' OR __CLASS = 'MicrosoftDNS_CNAMEType') AND NOT(NOT(TextRepresentation like '%sql%'))

WSO2 - Table created using Analytic Script Invisible in Gadget Generation Tool

My use case: pushes data from a stream configured in the ESB to BAM and create a report using “Gadget Generation Tool”
Publishing the stream from ESB to BAM after adding an agent to the proxy service worked fine.
From the stream I created a table using the Analytics->Add screen and the table seems to persist as I am able to do a select and see results from the same screen.
Now I am trying to generate a Dashboard using the Gadget Generation Tool but the table is not available, though the jdbc connection is working fine but the table is nowhere:
Script for Analytic Table run from Analytics->Add screen
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(creditkey STRING, creditFlag STRING, version STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" ,
"cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Tried looking for table in following databases:
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Have not done any custom db configurations.
Did you try jdbc:h2:repository/database/samples/WSO2CARBON_DB;AUTO_SERVER=TRUE? Also, what you have pasted is the Cassandra Storage Definition, probably used for getting the input, not persisting the output. If you give the full hive query, that would help to figure out the problem more.
Why did I not see the table in Gadget Generation tool?
The table I have created using the Hive script is a Casandra Distributed database table and the reference I gave in the Gadget generation tool while looking up for the table were from the h2 RDBMS database table.
Below are the references to the h2 RDBMS databse which comes out of box with WSO2
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Resolution ----- How to get tables listed in the Gadget Generation tool?
To get the tables listed in the Gadget Generation tool you have to extensively use the Hive Script to complete the following 3 steps:
Create a Hive table reference for the Casandra data stream to which data is pushed from ESB in my case.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(
payload_creditkey STRING, payload_creditFlag STRING, payload_version STRING) STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" , "cassandra.ks.username" = "admin" , "cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" , "cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Using Hive script create a H2 RDBMS script and reference to which I would be copying my data from the Casandra stream.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLEh2summary(
creditFlg STRING,
verSion STRING
)
STORED BY
'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
TBLPROPERTIES (
'mapred.jdbc.driver.class' = 'org.h2.Driver' ,
'mapred.jdbc.url' = 'jdbc:h2:C:/wso2bam-2.2.0/repository/samples/database/BAM_STATS_DB' ,
'mapred.jdbc.username' = 'wso2carbon' ,
'mapred.jdbc.password' = 'wso2carbon' ,
'hive.jdbc.update.on.duplicate' = 'true' ,
'hive.jdbc.primary.key.fields' = 'creditFlg' ,
'hive.jdbc.table.create.query' = 'CREATE TABLE CREDITTABLE_newh2(creditFlg VARCHAR(100), version VARCHAR(100))' );
Write a Hive query using which data would be copied from Casandra to H2[RDBMS]
insert overwrite table CREDITTABLEh2summary select a.payload_creditFlag,a.payload_version from CREDITTABLE a;
On doing this I was able to see the table in the Gadget Generation tool however I also had to chage the referenc to the H2 Database to absolute in the JDBC URL value that I passed.
Observation:
Was wondering if the Gadget generation tool can directly point to the Casandra Stream without having to copy the tables to a RDBMS database.