Dynamic parameter for GetFile/ListFile processor - NiFi - hdfs

My workflow is like below.
ListenHTTP(i get a directory name here) --> SplitText -->
ExtractText(directory name added as attribute)
Now after this i will have to use that attribute directoryname and extract all the files in that local dir and put that into HDFS. I understand GetFile/ListFile could do this, but how do we provide a dynamic directory name to that processor?

Unfortunately, both GetFile and ListFile are source processors, which means they do not accept an incoming flowfile. The general pattern is to configure these processors with a static Input Directory value and allow them to read from it and manage their state.
In this case, I believe you need to use FetchFile, which accepts an incoming flowfile and reads the file path provided. By default, the File to Fetch property is set to ${absolute.path}/${filename}, which means it uses Apache NiFi Expression Language to resolve the value of those two attributes on the incoming flowfile. You could pass that flowfile to an ExecuteStreamCommand processor first and perform an ls on the directory, then split the results into individual flowfiles with one filename per line, and process each of these through the FetchFile.
I understand this isn't the most concise way to perform the task. Two other suggestions would be:
Open a Jira to request a processor which retrieves all of the files in a directory (at the time of incoming flowfile receipt) and requires an incoming flowfile to determine the directory.
Use an ExecuteScript processor. The processor would simply extract the attribute from the incoming flowfile and use Groovy/Ruby/Python/etc. facilities to retrieve the files from the directory, or perform the directory listing and pass individual flowfiles downstream to a FetchFile processor.

Related

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

I'm using google-cloud-platform data fusion.
Assuming that the bucket's path is as follows:
test_buk/...
In the test_buk bucket there are four files:
20190901, 20190902
20191001, 20191002
Let's say there is a directory inside test_buk called dir.
I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)
also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)
I'd like to complete the data-pipeline for 201909 and 201910 bundles.
Here's what I've tried:
with regex path filter
gs://test_buk/dir//2019 to run the data pipeline.
If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.
When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?
If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.
I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).
Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):
path = gs://test_buk/dir/
regex path filter = .*201909.*\.csv or .*201910.*\.csv
If you would like to know more about the regex used, you can take a look at (1)

Rename file after putHDFS

I have apache NIFI job where I get file from system using getFile then I use putHDFS, how can I rename the file in HDFS after putting the file in hadoop ?
I tried to use executeScript processor but can't get it to work
flowFile = session.get()
if flowFile != None:
tempFileName= flowFile.getAttribute("filename")
fileName=tempFileName.replace('._COPYING_','')
flowFile = session.putAttribute(flowFile, 'filename', fileName)
session.transfer(flowFile, REL_SUCCESS)
The answer above by Shu is correct for how to manipulate the filename attribute in NiFi, but if you have already written a file to HDFS and then use UpdateAttribute, it is not going to change the name of the file in HDFS, it will only change the value of the filename attribute in NiFi.
You could use the UpdateAttribute approach to create a new attribute called "final.filename" and then use MoveHDFS to move the original file to the final file.
Also of note, the PutHDFS processor already writes a temp file and moves it to the final file so I'm not sure if it is necessary for you to name ".COPYING". For example if you send a flow file to PutHDFS with filename of "foo" it will first write ".foo" to the directory and when done it will move it to "foo".
The only case where you need to use MoveHDFS is if some other process is monitoring the directory and can't ignore the dot files, then you write it somewhere else and use MoveHDFS once it is complete.
Instead of using ExecuteScript processor(extra overhead) use UpdateAttribute processor Feed the Success relationship from PutHDFS
Add new property in UpdateAttribute processor as
filename
${filename:replaceAll('<regex_expression>','<replacement_value>')}
Use replaceAll function from NiFi Expression Language.
(or)
Using replace Function
filename
${filename:replaceAll('<search_string>','<replacement_value>')}
NiFi expression language offers different functions to manipulate strings refer to this link for more documentation related to expression language.
i have tried same exact script that in Question with ExecuteScript processor with Script Engine as Python and everything works as expected.
As you are using .replace function and replacing with ''
Output:
As the filename fn._COPYING_ got changed to fn.

i have headers separately, how to import it to informatica target

I have source and target in an informatica powercenter developer. I heed some other header name to be imported in the target file automatically without any manual entry. How can I import customized headers to informatica target.
What have you tried?
You can use a header command in the session configuration for the target, I haven't used it, and couldn't find any documentation on it (i.e. what is possible and how, whether parameters can be used or not, etc.). I did test using (on Windows) an ECHO command to output its text to the header row, but it didn't seem to recognize parameters.
Or you can try to include the header as the first data output row. That means your output will have to be all string types and length restrictions may compound the issue.
Or you can try using two mappings, one that truncates the files and writes the header and one which outputs the data specifying append in the session. You may need two target definitions pointing to the same files. I don't know if the second mapping would attempt to load the existing data (i.e. typecheck), in which case it might throw an error if it didn't match.
Other options may be possible, we don't do much with flat files.
The logic is,
In session command, there is an option called user defined headers. Type echo followed by column name separated by comma delimited
echo A, B, C

Specifying Source file name using parameter variables in informatica 9?

I have a mapping like
SA-->SQ--->EXPR--->TGT
The source will be of the same structure and the tartget also.
There are a bunch of files(with the same structure) which will go through this mapping .
So i want to use a parameter file through which i will give the file names for every run manually.
How to use the param file in session for Source filename attribute
Please suggest..
you could use indirect source type, wherein your source file is basically a list of files, and in turn the session reads each of the files one by one.
the parameter file could reference a source file name (the list) as
$InputFile_myName=/a/b/c.list
In line with what Raghav says, indicate the name of a file that will hold a list of input files in the 'Source filename' property box for the SQ in question in the Mapping tab, making the file 'Source filetype' be 'Indirect', specified in the Session Properties. If you already know ahead of time the names of the input files, you can specify them in that file and deploy that file with the workflow to the location you indicate in the 'Source file directory' property box. However if you won't know the names of the input files until run-time but know the files' naming standard (e.g: "Input_files_name_ABC_" where "" represents variable text, such as a numeric value incremented per input file generated by some other process), then one way to deal with that is to use a Pre-Session Command specifiable in the 'Components' tab of the Session. Create one that will build a new file at the location and with the name specified for the Indirect input file referenced above by using the Unix shell (or if running on Windows, the cmd shell) to list the files conforming to the naming standard for them and redirect the listing output to that file.
Tricky thing is that there must be one or more files listed in that Indirect type of input file. If that file is empty, the workflow will fail (abend). An Indirect file type must have in it listed at least one file (even if that file is empty) and that file must exist. The workflow fails if the indirect file reader gets no files to read from or if a file listed in it is not present on the server to be read from. One way to get around this is to make sure an empty file is present at all times that conforms to the naming standard. This can be assured by creating a "touchfile" before executing the listing command to build the Indirect file type listing file. In Unix, you'd use the 'touch {path}/{filename}' command ({filename} could be, for example, "Input_files_name_ABC_TOUCHFILE"), or on Windows you'd redirect an empty string to a file likewise named via cmd shell process. Either way, that will help you avoid an abend. Cleaning up that file is easy to do: a Post-Session command can be used to delete the empty touchfile. Likewise, you can do the same for the Indirect type of file if desired.

Find context name in Railo/Jetty application code?

I create my contexts by dropping [myname].xml files in the contexts/ directory but in my CFML code I want to dynamically find the value of [myname], ie the name of the context/webapp (or failing that the filename of the xml file or the original value of the resourceBase property before path translation occurs).
I can get data about the context (like the array of virtualhosts) using the object returned from getPageContext().getConfig().getServletContext().getContextHandler().getCurrentWebAppContext() but if the context name is in there I haven't worked out how to get at it.
Use getDisplayName on that object you have?
It defaults to null (would be useful if it was the filename), but you can specify it in the context XML file with <Set name="DisplayName">bob</Set>
(If you have lots of XML files to deal with, do a script to loop through each file and plonk that with the filename inside the Configure tag.)