Pentaho DI (Kettle) best way to select flow based on csv file header? - kettle

I'm using Pentaho DI (kettle) and not sure what's the best way to do the following:
From a downloaded csv file, check if a column exists, and based on that select the right next step.
There are 3 possible options.
Thanks,
Isaac

You did not mention possible options, so I'll just provide you with a sketch showing how to check if a column exists in a file.
For this you will need a CSV file input step and Metadata structure of stream step which will read the metadata of the incoming stream.
For a sample csv file with 3 columns named col1, col2 and col3 you get every column in a separate row with its name as a value in Fieldname column in Metadata step.
Then depending on your needs you could use for example Filter Rows or Switch / Case step for further processing.

Related

Informatica powercenter always can't read the file properly. All data always appears in one column

I have problem with informatica powercenter. When i want to import data from flat file csv, all datas always appear in one column. I need to edit the file first, and set define name in excel then informatica can read all data properly. How to read the data properly in powercenter without doing define name first in excel?
Thank you
You need to ensure,
You're reading file definition as delimited. Here is a file wizard where you can define it as delimited.
while reading set it so it reads col name from first row.
And then read from second row.
You can check this img.
https://2.bp.blogspot.com/-enDSMKLYyRY/UXADBtNE8WI/AAAAAAAAAu8/oVfr6IsAl8Y/s1600/8.jpg
If you set above properties up, infa should be able to read definition properly and you dont have to set col name or datatype.

Ordering the columns in the output of mapping task in Informatica cloud

I'm creating a mapping task to union,join 5 flat files and few transformation logic on top of it using Informatica cloud. I'm passing the output as .txt / .csv format for downstream processing and loading it to a data warehouse in certain column order.
I have to generate the output file during runtime because Liaison connection automatically cuts the output file which I'm dropping and pastes it inside data warehouse. (So I cannot use meta data and field mapping)
Is there any tool in the design which I can use to order the column sequence on the output (Like Column A should be the first column, Column C should be the second, Column B should be the third)
If there is no tool / object readily available inside the design pane of mapping task, is there any work around to do the same

Extract data from JSON field with Power BI desktop

I'm using Power BI desktop to connect to a MySQL database.
One of the fields contains data with the following structure:
a:1:{s:3:"IVA";O:8:"stdClass":3:{s:11:"tax_namekey";s:3:"IVA";s:8:"tax_rate";s:7:"0.23000";s:10:"tax_amount";d:25.07000000000000028421709430404007434844970703125;}}
I need to transform the data in a way that allows the extraction of the value of the tax amount. That is, I need to transform this column to: 25.07.
How can I do this? I tried splitting the column by semicolon, but since not all the columns have the same number of semicolons it didn't work.
Thanks in advance!
Use this function
Works only for your task - parse number 25.07 from source string
(src) => Splitter.SplitTextByEachDelimiter({";d:",";"})(src){1}
The value in the column is not actual JSON file .There is option in power bi itself to split json column but it should be valid json.To check whether is a json file or not try using the link
https://jsonformatter.curiousconcept.com/
After that go to edit query right click on the json column and transform-> JSON .
It will transform your json file into columns.

Parquet: read particular columns into memory

I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?
If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.
Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)
You can also use apache drill that natively parse parquet files.

Postgres Copy select rows from CSV table

This is my first post to stackoverflow. Your forum has been SO very helpful as I've been learning Python and Postgres on the fly for the last 6 months, that I haven't needed to post yet. But this task is tripping me up and I figure I need to start earning reputation points:
I am creating a python script for backing up data into an SQL database daily. I have a CSV file with an entire months worth of hourly data, but I only want to select a single day of data from from the file and copy those select rows into my database. Am I able to query the CSV table and append the query results into my database? For example:
sys.stdin = open('file.csv', 'r')
cur.copy_expert("COPY table FROM STDIN
SELECT 'yyyymmddpst LIKE 20140131'
WITH DELIMITER ',' CSV HEADER", sys.stdin)
This code and other variations aren't working out - I keep getting syntax errors. Can anyone help me out with this task? Thanks!!
You need create temporary table at first:
cur.execute('CREATE TEMPORARY TABLE "temp_table" (LIKE "your_table") WITH OIDS')
Than copy data from csv:
cur.execute("COPY temp_table FROM '/full/path/to/file.csv' WITH CSV HEADER DELIMITER ','")
Insert necessary records:
cur.execute("INSERT INTO your_table SELECT * FROM temp_table WHERE yyyymmddpst LIKE 20140131")
And don't forget do conn.commit()
Temp table will destroy after cur.close()
You can COPY (SELECT ...) TO an external file, because PostgreSQL just has to read the rows from the query and send them to the client.
The reverse is not true. You can't COPY (SELECT ....) FROM ... . If it were a simple SELECT PostgreSQL could try to pretend it was a view, but really it doesn't make much sense, and in any case it'd apply to the target table, not the source rows. So the code you wrote wouldn't do what you think it does, even if it worked.
In this case you can create an unlogged or temporary table, copy the full CSV to it, and then use SQL to extract just the rows you want, as pointed out by Dmitry.
An alternative is to use the file_fdw to map the CSV file as a table. The CSV isn't copied, it's just read on demand. This lets you skip the temporary table step.
From PostgreSQL 12 you can add a WHERE clause to your COPY statement and you will get only the rows that match the condition.
So your COPY statement could look like:
COPY table
FROM '/full/path/to/file.csv'
WITH( FORMAT CSV, HEADER, DELIMITER ',' )
WHERE yyyymmddpst LIKE 20140131