Informatica - SQL query to extract data from JSON format value - informatica

My source is Oracle table 'SAMPLE' which contain the JSON value in the column REF_VALUE which becomes as a source in Informatica mapping. The data looks as below
PROPERTY | REF_VALUE
CTMappings | {CTMappings": [
{"CTId":"ABCDEFGHI_T2","EG":"ILTS","Type":"K1J10200"},
{"CTId":"JKLMNOPQR_T1","EG":"LAM","Type":"K1J10200"}
"}]}"
I have the SQL query to explore the JSON Data into rows as below
select
substr(CTId,1,9) as ID,
substr(CTId,9,11) as VERSION,
type as INTR_TYP,
eg as ENTY_ID,
from
(
select jt.*
from SAMPLE,
JSON_TABLE (ref_value, '$.Mappings[*]'
columns (
CTId varchar2(50) path '$.formId',
eg varchar2(50) path '$.eg',
type varchar2(50) path '$.Type'
)) jt
where trim(pref_property) = 'Ellipse.Integration.FSM.FormMappings')
The result of table as below
CTiD VERSION EG Typ
======== ======= ==== ======
ABCDEFGHI T2 ILTS K1J102001
KLMNOPQR T1 LAM K1J102000
which is required as an output ( JSON into rows)
Im new to Informatica so I have used source table as 'SAMPLE' and further I want to use this query to extract the data into row format in Informatica but I dont know how to proceed. I have the image please refer
If anyone answers quickly, it will be a great help

Declare the fields in Source Qualifier and use SQL Override property to place your query - that should do the trick.

Related

Snowflake table is not accepting null values in date field

I have one table in snowflake, I am performing bulk load using.
one of the columns in table is date, but in the source table which is on sql server is having null values in date column.
The flow of data is as :
sql_server-->S3 buckets -->snowflake_table
I am able to perform the sqoop job in EMR , but not able to load the data into snowflake table, as it is not accepting null values in the date column.
The error is :
Date '' is not recognized File 'schema_name/table_name/file1', line 2, character 18 Row 2,
column "table_name"["column_name":5] If you would like to continue loading when an error is
encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
can anyone help, where I am missing
Using below command you can able to see the values from stage file:
select t.$1, t.$2 from #mystage1 (file_format => myformat) t;
Based on the data you can change your copy command as below:
COPY INTO my_table(col1, col2, col3) from (select $1, $2, try_to_date($3) from #mystage1)
file_format=(type = csv FIELD_DELIMITER = '\u00EA' SKIP_HEADER = 1 NULL_IF = ('') ERROR_ON_COLUMN_COUNT_MISMATCH = false EMPTY_FIELD_AS_NULL = TRUE)
on_error='continue'
The error shows that the dates are not arriving as nulls. Rather, they're arriving as blank strings. You can address this a few different ways.
The cleanest way is to use the TRY_TO_DATE function on your COPY INTO statement for that column. This function will return database null when trying to convert a blank string into a date:
https://docs.snowflake.com/en/sql-reference/functions/try_to_date.html#try-to-date

Regex for Parsing vertical CSV in Athena

So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR
One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'
Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.

Issue querying Athena with select having special characters

Below is the select query I am trying:
SELECT * from test WHERE doc = '/folder1/folder2-path/testfile.txt';
This query returns zero results.
If I change the query using like, it works omitting the special chars /-.
SELECT * from test WHERE doc LIKE '%folder1%folder2%path%testfile%txt';
This works
How can I fix this query to use eq or IN operator, as I am interested to run a batch select?
To test your situation, I created a text file containing:
hello
there
/folder1/folder2-path/testfile.txt
this/that
here.there
I uploaded the file to a directory on S3, then created an external table in Athena:
CREATE EXTERNAL TABLE stack (doc string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\")
LOCATION 's3://my-bucket/my-folder/'
I then ran the command:
select * from stack WHERE doc = '/folder1/folder2-path/testfile.txt'
It returned:
1 /folder1/folder2-path/testfile.txt
So, it worked for me. Therefore, your problem would either be a result of the contents of the file, or the way that the external table is defined (eg using a different Serde).

Reading Json data from Athena

i have created a table by mapping the json data, unfortunately i am not able to read the nested array within the json.
{
"total":10,
"count":100,
"values":{
"source":[{"sourceid":"10001","source":"ABC"},
{"sourceid":"10002","source":"XYZ"}
]}
}
```athena table
CREATE EXTERNAL TABLE source_master_data(
total bigint,
count bigint,
values struct<source: array<struct<sourceid: string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://sourcemaster/'
I am trying to read the sourceid and source but no luck.. can anyone help me out
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data.Values) AS t1
The unnest need to be placed on the array type. In your query, you are trying to unnest the struct which is not possible in Athena.
The second issue is the use of values without quotes. This also fails, because values is a reserved word in Athena.
The overall query would look something like this.
select t1.source.sourceid
from source_master_data
cross join UNNEST(source_master_data."values".source) AS t1 (source)

Can kettle export BLOB data from a oracle table?

I have a oracle table where I have columns like Document (type BLOB), Extension ( VARCHAR2(10) with values like .pdf, .doc) and Document Description(VARCHAR2
(100)). I want to export this data and provide to my customer.
Can this be done in kettle ?
Thanks
I have a MSSQL database that stores images in a BLOB column, and found a way to export these to disk using a dynamic SQL step.
First, select only the columns necessary to build a file name and SQL statement (id, username, record date, etc.). Then, I use a Modified Javascript Value step to create both the output filename (minus the file extension):
outputPath = '/var/output/';
var filename = outputPath + username + '_' + record_date;
// --> '/var/output/joe_20181121'
and the dynamic SQL statement:
var blob_query = "SELECT blob_column FROM dbo.table WHERE id = '" + id + "'";
Then, after using a select to reduce the field count to just the filename and blob_query, I use a Dynamic SQL row step (with "Outer Join" selected) to retrieve the blob from the database.
The last step is to output to a file using Text file output step. It allows you to supply a file name from a field and give it a file extension to append. On the Content tab, all boxes are unchecked, the Format is "no new-line term" and the Compression is "None". The only field exported is the "blob_column" returned from the dynamic SQL step, and the type should be "binary".
Obviously, this is MUCH slower than other table/SQL operations due to the dynamic SQL step making individual database connections for each row... but it works.
Good luck!