How to Generalize an Informatica Mapping & Workflow? - informatica

I need to read a text file (JSON format) and load it into a database table in Informatica Developer. For only one text file and one database table, that is easy.
But now I have N different text files, hence N different database tables and their corresponding data processor transformations. The transformation logic inside the mappings is the same. Besides creating N sets of mappings and workflows for each set of text files, is it possible to create just one generalized mapping and workflow to cater for all text files? I would appreciate it if any one of you could give me a general direction for me to explore further.

Related

Can I add a new column without rewriting an entire file?

I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a new column without rewriting the entire file?
The short answer is probably no.
Arrow's in-memory format & libraries support this. You can add a chunked array to a table by just creating a new table (this should be zero-copy).
However, it appears you are talking about storing tables in files. None of the common file formats in use (parquet, csv, feather) support partitioning a table in this way.
Keep in mind, if you are reading a parquet file, you can specify which column(s) you want to read and it will only read the necessary data. So if your goal is only to support individual column retrieval/query then you can just build one large table with all your columns.

AWS Glue not detecting header in CSV

Hi I have a bunch of CSV's located in S3, a crawler setup via AWS Glue, this crawler builds about 10 tables as it scan 10 folders and only 1 of them where the headers are not being detected. The structure of the csv is the same as all the others. Advice please?
AWs glue crawler interprets header based on multiple rules. if the first line in your file doest satisfy those rules, the crawler wont detect the fist line as a header and you will need to do that manually. its a very common problem and we integrated a fix for this within our code to do it is part of our data pipeline.
Excerpt from aws doco
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has
content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex
requirements for a column name.
The header row must be sufficiently different from the data rows. To
determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
You can create the table yourself and instead of crawling point to an s3 path, you can crawl based on an existing table. This is the concept used when a crawler is not detecting the schema especially just column headings.
Also check if the skip.header.line.count=1 is being added automatically, if not you can add manually and it an update the schema to the correct one you require. On your subsequent runs for your crawler, you can change the properties so that it will ignore schema updates and only perform partition updates to your table.
You could use a custom classifier on your crawler to solve this problem: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Normally choosing Has headings in the classifier options Column Headings section will do the trick, if not, it may be necessary to enter in a list of headings in text box for that purpose.
because your columns are all classified as strings, it's likely that the columns violate the rules. in my case, i had a column name that was greater than 150 characters so Glue read the first row as data, as opposed to a header, and then assumed all columns were strings.

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

How are documents retrieved after reduce produces the output?

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?
MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements
What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"