I've been trying to use a MoveHDFS processor to move parquet files from a /working/partition/ directory in hdfs to a /success/partition/ directory. The partition value is set based on a ExecuteSparkJob processor earlier in the flow. After finding my parquet files in the root / directory, I found the following in the processor description for Output Directory:
The HDFS directory where the files will be moved to Supports
Expression Language: true (will be evaluated using variable registry
only)
Turns out the processor was sending the files to / instead of ${dir}/.
Since my attributes are set on the fly based on the spark processing result, I can't simply add to the variable registry and restart nodes for each flowfile (which from my limited understanding is what using the variable registry requires). One option is to use an ExecuteStreamCommand processor with a custom script to accomplish this use case. Is that my only option here or is there a built-in way to move HDFS files to attribute-set directories?
You can try this approach :
step 1 : Use MoveHDFS to move your file to temporary location, say path X. Input directory property in MoveHDFS processor can accept flowfile attribute.
step 2 : Connect success connection to FetchHDFS processor.
step 3 : Now in Fetch HDFS processor you can write the expression language for HDFS Filename property as ${absolute.hdfs.path}/${filename}. This will fetch the file data from path X into flowfile content.
step 4 : Connect success connection from FetchHDFS to PutHDFS processor.
step 5 : Configure PutHDFS directory property as per your requirements to accept the flowfile attribute for the partion data on the fly.
Cons:
One con in this approach is , the duplicate copy which will be created from moveHDFS to store the data temporarily before sending the data to the actual location. You might have to develop a separate flow to delete the duplicate copy if not required.
Related
I wanted to get the flat files as source without any file structure specified using IICS(Informatica Intelligent cloud services). Flat files names can be anything, and structure also will change any. And also need to create table dynamically based on flat file and insert into table.
There's a number of options here. You can use a fully parameterized mapping niside a taskflow that will start on file listener, prepare the parameters and statements to be executed as part of the pre-SQL statement on your Target.
Inside the mapping you define Source and Target as parameterized - and that's briefly it!
I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476
I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor.
I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this:
GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL
The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this.
PS: Do suggest me if there is any alternate approach to achieve this use case.
Are you looking to group the Avro records based on the partitions' values, one Avro file per unique value? Or do you only need the partitions' values for some number of LOAD DATA commands (and use a single Avro file with all the records)?
If the former, then you'd likely need a custom processor or ExecuteScript, since you'd need to parse, group/aggregate, and convert all in one step (i.e. for one CSV document). If the latter, then you can rearrange your flow into:
GetFile -> ConvertCSVToAvro -> PutHDFS -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> ReplaceText -> PutHiveQL
This flow puts the entire CSV file (as a single Avro file) into HDFS, then afterwards it does the split (after converting to JSON since we don't have an EvaluateAvroPath processor), gets the partition value(s), and generates the Hive DDL statements (LOAD DATA).
If you've placed the file at the location where the hive table is reading the data using the puthdfs processor then you don't need to call the puthiveql processor. I am also new to this but I think you should leverage the schema-on-read capability of hive.
I have a workflow which writes data from a table into a flatfile. It works just fine, but I want to insert a blank line inbetween each records. How can this be achieved ? Any pointer ?
Here, you can create 2 target instances. One with the proper data and in other instance pass blank line. Set Merge Type as "Concurrent Merge" in session properties.
Multiple possibilities -
You can prepare appropriate dataset into a relational table, and afterwards, dump data from that into a flat file. For preparation of that data set, you can insert blank rows into that relational target.
Send a blank line to a separate target file (based on some business condition using a router or something similar), after that you can use merge files option (in session config) to get that data into a single file.
I want to copy the data into one excel file, while the mapping has been running in informatica.
I used Informatica last year, now am on SSIS. If I remember correctly, you can setup a separate connection for Excel target destination. There after you pretty much drag all fields from your source to the target destination (in this case Excel).
As usual develop a mapping with source and target.
But make target as Flat File while creating target.
Then run the mapping u will get