I have a mapping which Reads data, Filters it and Writes to a result table in another database.
Read DBR1 -> Filter -> Write DBW
I want same mapping to be run on three different databases - DBR1, DBR2, DBR3 and Write the result in DBW database.
DBR1 -> Filter -> Write DBW
DBR2 -> Filter -> Write DBW
DBR3 -> Filter - > Write DBW
The database structure is same in all 3 databases.
Is there any economical and easier way to do it other than duplicating(or triplicating).
Yes. Consider parameterizing the source/target connections. So that you can run the same mapping with different source and target system through infacmd command
Please refer to developer guide for parmeterization techniques.
Related
I have a file structure such as:
gs://BUCKET/Name/YYYY/MM/DD/Filename.csv
Every day my cloud functions are creating another path with another file innit corresponding to the date of the day (so for today's 5th of August) we would have gs://BUCKET/Name/2022/08/05/Filename.csv
I need to find a way to query this data to Big Query automatically so that if I want to query it for 'manual inspection' I can select for example data from all 3 months in one query doing CREATE TABLE with gs://BUCKET/Name/2022/{06,07,08}/*/*.csv
How can I replicate this? I know that BigQuery does not support more than 1 wildcard, but maybe there is a way to do so.
To query data inside GCS from Big Query you can use an external table.
Problem is this will fail because you cannot have a comma (,)
as part of the URI list
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = ['gs://bucket/2022/{1,2,3}/data.csv']
)
You have to specify the 3 CSV file locations like this:
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = [
'gs://inigo-test1/2022/1/data.csv',
'gs://inigo-test1/2022/2/data.csv']
'gs://inigo-test1/2022/3/data.csv']
)
Since you're using this sporadically, probably makes more sense to create a temporal external table.
se I found a solution that works at least for my use case, without using the external table.
During the creation of table in dataset in BigQuery use create table from: GCS and then when using URI pattern I used gs://BUCKET/Name/2022/* ; As long as filename is the same in each subfolder and schema is identical, then BQ will load everything and then you can perform date operations directly in BQ (I have a column with ingestion date)
I need to read a text file (JSON format) and load it into a database table in Informatica Developer. For only one text file and one database table, that is easy.
But now I have N different text files, hence N different database tables and their corresponding data processor transformations. The transformation logic inside the mappings is the same. Besides creating N sets of mappings and workflows for each set of text files, is it possible to create just one generalized mapping and workflow to cater for all text files? I would appreciate it if any one of you could give me a general direction for me to explore further.
I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.
I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor.
I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this:
GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL
The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this.
PS: Do suggest me if there is any alternate approach to achieve this use case.
Are you looking to group the Avro records based on the partitions' values, one Avro file per unique value? Or do you only need the partitions' values for some number of LOAD DATA commands (and use a single Avro file with all the records)?
If the former, then you'd likely need a custom processor or ExecuteScript, since you'd need to parse, group/aggregate, and convert all in one step (i.e. for one CSV document). If the latter, then you can rearrange your flow into:
GetFile -> ConvertCSVToAvro -> PutHDFS -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> ReplaceText -> PutHiveQL
This flow puts the entire CSV file (as a single Avro file) into HDFS, then afterwards it does the split (after converting to JSON since we don't have an EvaluateAvroPath processor), gets the partition value(s), and generates the Hive DDL statements (LOAD DATA).
If you've placed the file at the location where the hive table is reading the data using the puthdfs processor then you don't need to call the puthiveql processor. I am also new to this but I think you should leverage the schema-on-read capability of hive.
I am working on Informatica Persistent Data Masking tool and I have to mask repeatable values in different tables and schemas with same masking pattern.
For example: if some name say sonal is repeating in different tables, I want to mask sonal in all tables with same masked value.
How can I do that? or which masking should I use? I have tried Key masking and similar value columns.
Thanks.
Here are the steps....
Create individual Connections (under 'Administrator' tab of the product) for Databases which contain the Tables that have data 'Sonal'.
Now, navigate to Projects tab and create a new Project and Import the Metadata from all the Connections you have created (make sure you pick required tables from each of those Connection)
Under 'Policies' page, create a "New Masking Rule" by choosing Masking Type as 'Substitution' (from 'Standard' drop-menu). Click on 'Next'.
In the 2nd step of this Rule creation wizard, select 'Repeatable' and enter any value between 1 & 1000 as 'Seed'
Select any valid Dictionary (can be a Flatfile dic or a Relational dic - but this should've already been created/added under 'Administrator -> Dictionaries' page) that has Serial Number & some valid Names which can be used to mask the Names in original/Source tables.
Pick correct columns for 'Masked Value' & 'Serial Number Column'fields. And save the Rule.
Add this Rule to your Project (which has Meatadata of all the required Tables) under "Project -> Overview -> Policies" page
Navigate to 'Define -> Data Masking' page, select all the required columns (in this example, whichever columns contain the name 'Sonal') and mark them as Similar value columns.
Assign the Substitution Masking Rule (which was created with 'Repeatable' option in Step #6) to a Column. Save the changes.
Create a Plan by navigating to 'Execute' page. Make sure you select the correct Masking Rule. After saving this Plan, select 'Generate and Execute' from Actions menu.
Once this Plan is executed successfully, you can see the Masked values (consistent value for all the occurrences of 'Sonal') in the Target Database(s).