Ordering the columns in the output of mapping task in Informatica cloud - informatica

I'm creating a mapping task to union,join 5 flat files and few transformation logic on top of it using Informatica cloud. I'm passing the output as .txt / .csv format for downstream processing and loading it to a data warehouse in certain column order.
I have to generate the output file during runtime because Liaison connection automatically cuts the output file which I'm dropping and pastes it inside data warehouse. (So I cannot use meta data and field mapping)
Is there any tool in the design which I can use to order the column sequence on the output (Like Column A should be the first column, Column C should be the second, Column B should be the third)
If there is no tool / object readily available inside the design pane of mapping task, is there any work around to do the same

Related

How to determine if my AWS Glue Custom CSV Classifier is working?

I am using AWS Glue to catalog (and hopefully eventually transform) data. I am trying to create a Custom CSV Classifier for the crawler so that I can provide a known set of column headers to the table. The data is in TSV (tab separated value) format, and the files themselves have no header row. There is no 'quote' character in the data, but there are 1 or 2 columns which use a double quote in the data, so I've indicated in the Classifier that it should use single quote ('). To ensure I start clean, I delete the AWS Glue Catalog Table, and then run the Crawler with the Classifier attached. When I subsequently check the created table, it lists csv as the classification, and the columns names specified in the Classifier are not associated with the table (and instead are labelled as col0, col1, col2, col3 etc.). Further, when inspecting a few rows in the table, it appears as though the data associated with the columns does not use the same column ordering as in the raw data itself, which I can confirm because I have a copy of the raw data open locally on my computer.
AWS Glue Classifier documentation indicates that a crawler will attempt to use the Custom Classifiers associated with a Crawler in the order they are specified in the Crawler definition, and if no match is found with certainty 1.0, it will use Built-in Classifiers. In the event a match with certainty 1.0 is still not found, the Classifier with the highest certainty will be used.
My questions are these:
How do I determine if my Custom CSV Classifier (which I have specifically named, say for sake of argument customClassifier) is actually being used, or if it is defaulting to the Built-In CSV Classifier?
More importantly, given the situation above (having the columns known but separate from the data, and having double quotes be used in the actual data but no quoted values), how do I get the Crawler to use the specified column names for the table schema?
Why does it appear as though my data in the Catalog is not using the column order specified in the file (even if it is the generic column names)?
If it is even possible, how could I use an ApplyMapping transform to rename the columns for the workflow (which would be sufficient for my cases).? I need to do so without enabling script-only mode (by modifying an AWS Glue Studio Workflow), and without manually entering in over 200 columns

Transform two source DynamoDB tables into a new DynamoDB using AWS

So I have two source tables lets call the, table1 and table2, and the destination table table3 - inside these tables there is information that needs to be extracted from columns of one table, columns of another table, and then combined to give entries of columns to the new table.
Think of it as a complex transformation; for example:
partial text in column1 extracted from table1 and complete text in column1 of table2 combined into 4 rows of column1 (depending on the JSON of column1 in table1) in new transformed table.
So it's not a 1 to 1 mapping between 1 table and another, but a 1 to many mapping where the 1 row of the source comes from a mix of one row from two source table that translates to many rows of the new destination table.
Is this something that glue jobs can accomplish? or am I better of just writing a throwaway Python script? You can assume that the size of the table is not of any concern
Provided you plan to run this process at some frequency, this is a perfect use case for Glue. If this is just a one off, Glue is also a fine choice, but Glue is primarily designed for repeated use.
In you glue script I expect you will end up joining the two tables, and then select new result columns and rows by combining your existing columns. Typically the pattern to follow would be to convert the dynamic frames (created by glue), into pyspark data frames, and then work with pyspark from there, converting back to a dynamic frame before outputting to the database.
Note that depending on your design you may not need to add rows, it of course depends on the outcome you are seeking, but Dynamo does have support for some nifty hierarchical approaches that may remove your need for multiple rows.
If you have more specific examples of schema and the outcomes you are seeking, I could show you a bit of example code.

Informatica Cloud Target file column order

I am using informatica cloud to create a target output by combining three mappings. I am facing an issue while writing the records to a target file. The target file is being created at run time. There are around 350 columns in total. Out of this, around 20 columns are static and others are dynamic. But the columns in the target files are not displaying in a proper order. In the "Field Mapping" the columns are displaying in a order, but in output file it is displaying in a different order.
Is there a way to create the target file output columns in a specific order. I mean at least certain static columns in a specific order? I know we can attain this by using a template file. But I can not create a template file as certain columns are dynamic. Any help will be appreciated.

How to create a Fact table from multiple different tables in Pentaho

I have been following a tutorial on creating a data warehouse using Pentaho Data Integration/Kettle.
The tutorial is based off of a CSV file but I am practicing with the northwinds database and postgresql I am trying to figure out how to select values from more than one table then output them into a single table.
My ETL process goes like this: I have several stages for each table, values are selected from each table and stored in a stage table for each table in the database, from there I have my dimensions table set up but I am trying to figure out the step between the stages and the dimensions which is where I am trying to select the values to update the dimensions table.
I have several stages set up for each of my tables at this point I am not sure if I should create a separate values table for each table or a single values table. Any help would be greatly appreciated. Thanks
When I try to select values from multiple tables I get an error that says "we detected rows with varying number of fields" It' seems I would need to create separate tables with
In kette, the metadata structure of the data stream cannot change. As such, if row 1 has 3 columns, one integer and two strings, for example, all rows must have the same structure.
If you're combining rows coming from different sources, you must ensure the structure is the same. That error is telling you that some of the incoming streams of data have a different number of fields.

Parquet: read particular columns into memory

I have exported a mysql table to a parquet file (avro based). Now i want to read particular columns from that file. How can i read particular columns completely? I am looking for java code examples.
Is there an api where i can pass the columns i need and get back a 2D array of table?
If you can use hive, creating a hive table and issuing a simple select query would be by far the easiest option.
create external table tbl1(<columns>) location '<file_path>' stored as parquet;
select col1,col2 from tbl1;
//this works in hive 0.14
You can use JDBC driver to do that from java program as well.
Otherwise, if you want to stay completely in java, you need to modify the avro schema by excluding all the fields but the ones you want to fetch. Then when you read the file supply the modified schema as reader schema and it will only read the included columns. But you will get you original avro record back with excluded fields nullified, not a 2D array.
To modify the schema look at org.apache.avro.Schema and org.apache.avro.SchemaBuilder. make sure that modified schema is compatible with the original schema.
Options:
Use Hive table to create table with all columns with storage format parquet and read the required columns by specifying the column names
Create Thrift for the table and use the thrift fields to read the data from code (Java or Scala)
You can also use apache drill that natively parse parquet files.