I have apache NIFI job where I get file from system using getFile then I use putHDFS, how can I rename the file in HDFS after putting the file in hadoop ?
I tried to use executeScript processor but can't get it to work
flowFile = session.get()
if flowFile != None:
tempFileName= flowFile.getAttribute("filename")
fileName=tempFileName.replace('._COPYING_','')
flowFile = session.putAttribute(flowFile, 'filename', fileName)
session.transfer(flowFile, REL_SUCCESS)
The answer above by Shu is correct for how to manipulate the filename attribute in NiFi, but if you have already written a file to HDFS and then use UpdateAttribute, it is not going to change the name of the file in HDFS, it will only change the value of the filename attribute in NiFi.
You could use the UpdateAttribute approach to create a new attribute called "final.filename" and then use MoveHDFS to move the original file to the final file.
Also of note, the PutHDFS processor already writes a temp file and moves it to the final file so I'm not sure if it is necessary for you to name ".COPYING". For example if you send a flow file to PutHDFS with filename of "foo" it will first write ".foo" to the directory and when done it will move it to "foo".
The only case where you need to use MoveHDFS is if some other process is monitoring the directory and can't ignore the dot files, then you write it somewhere else and use MoveHDFS once it is complete.
Instead of using ExecuteScript processor(extra overhead) use UpdateAttribute processor Feed the Success relationship from PutHDFS
Add new property in UpdateAttribute processor as
filename
${filename:replaceAll('<regex_expression>','<replacement_value>')}
Use replaceAll function from NiFi Expression Language.
(or)
Using replace Function
filename
${filename:replaceAll('<search_string>','<replacement_value>')}
NiFi expression language offers different functions to manipulate strings refer to this link for more documentation related to expression language.
i have tried same exact script that in Question with ExecuteScript processor with Script Engine as Python and everything works as expected.
As you are using .replace function and replacing with ''
Output:
As the filename fn._COPYING_ got changed to fn.
Related
I have a use case where I want to read the filename from a metadata table, I have written a pipeline function to read the metadata table, but I am not sure how can I pass this information to ReadFromText as it only takes string as input, Is it possible to assign this value to ReadFromText(). Please suggest some workarounds or ideas how to achieve this, Thanks
code: pipeline | 'Read from a File' >> ReadFromText(I want to pass the file path here?,
skip_header_lines=1)
Note: There will be various folders and files in storage, files are in csv format, but in my use case I can't directly pass the storage location or filename to file path in ReadFromText. I want to read it from metadata and pass the value. Hope I am clear, Thanks
I don't understand why you need to read the metadata. If you want to read all the files inside a folder, you can just provide a blob. This solution working in python, not sure about java.
p|readfromtext("./folder/*.csv")
"*" is the blob here, which allows pipeline to read all the patterns matching .csv. You can also add something at the starting.
What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.
I'm using google-cloud-platform data fusion.
Assuming that the bucket's path is as follows:
test_buk/...
In the test_buk bucket there are four files:
20190901, 20190902
20191001, 20191002
Let's say there is a directory inside test_buk called dir.
I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)
also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)
I'd like to complete the data-pipeline for 201909 and 201910 bundles.
Here's what I've tried:
with regex path filter
gs://test_buk/dir//2019 to run the data pipeline.
If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.
When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?
If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.
I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).
Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):
path = gs://test_buk/dir/
regex path filter = .*201909.*\.csv or .*201910.*\.csv
If you would like to know more about the regex used, you can take a look at (1)
My workflow is like below.
ListenHTTP(i get a directory name here) --> SplitText -->
ExtractText(directory name added as attribute)
Now after this i will have to use that attribute directoryname and extract all the files in that local dir and put that into HDFS. I understand GetFile/ListFile could do this, but how do we provide a dynamic directory name to that processor?
Unfortunately, both GetFile and ListFile are source processors, which means they do not accept an incoming flowfile. The general pattern is to configure these processors with a static Input Directory value and allow them to read from it and manage their state.
In this case, I believe you need to use FetchFile, which accepts an incoming flowfile and reads the file path provided. By default, the File to Fetch property is set to ${absolute.path}/${filename}, which means it uses Apache NiFi Expression Language to resolve the value of those two attributes on the incoming flowfile. You could pass that flowfile to an ExecuteStreamCommand processor first and perform an ls on the directory, then split the results into individual flowfiles with one filename per line, and process each of these through the FetchFile.
I understand this isn't the most concise way to perform the task. Two other suggestions would be:
Open a Jira to request a processor which retrieves all of the files in a directory (at the time of incoming flowfile receipt) and requires an incoming flowfile to determine the directory.
Use an ExecuteScript processor. The processor would simply extract the attribute from the incoming flowfile and use Groovy/Ruby/Python/etc. facilities to retrieve the files from the directory, or perform the directory listing and pass individual flowfiles downstream to a FetchFile processor.
I have a mapping like
SA-->SQ--->EXPR--->TGT
The source will be of the same structure and the tartget also.
There are a bunch of files(with the same structure) which will go through this mapping .
So i want to use a parameter file through which i will give the file names for every run manually.
How to use the param file in session for Source filename attribute
Please suggest..
you could use indirect source type, wherein your source file is basically a list of files, and in turn the session reads each of the files one by one.
the parameter file could reference a source file name (the list) as
$InputFile_myName=/a/b/c.list
In line with what Raghav says, indicate the name of a file that will hold a list of input files in the 'Source filename' property box for the SQ in question in the Mapping tab, making the file 'Source filetype' be 'Indirect', specified in the Session Properties. If you already know ahead of time the names of the input files, you can specify them in that file and deploy that file with the workflow to the location you indicate in the 'Source file directory' property box. However if you won't know the names of the input files until run-time but know the files' naming standard (e.g: "Input_files_name_ABC_" where "" represents variable text, such as a numeric value incremented per input file generated by some other process), then one way to deal with that is to use a Pre-Session Command specifiable in the 'Components' tab of the Session. Create one that will build a new file at the location and with the name specified for the Indirect input file referenced above by using the Unix shell (or if running on Windows, the cmd shell) to list the files conforming to the naming standard for them and redirect the listing output to that file.
Tricky thing is that there must be one or more files listed in that Indirect type of input file. If that file is empty, the workflow will fail (abend). An Indirect file type must have in it listed at least one file (even if that file is empty) and that file must exist. The workflow fails if the indirect file reader gets no files to read from or if a file listed in it is not present on the server to be read from. One way to get around this is to make sure an empty file is present at all times that conforms to the naming standard. This can be assured by creating a "touchfile" before executing the listing command to build the Indirect file type listing file. In Unix, you'd use the 'touch {path}/{filename}' command ({filename} could be, for example, "Input_files_name_ABC_TOUCHFILE"), or on Windows you'd redirect an empty string to a file likewise named via cmd shell process. Either way, that will help you avoid an abend. Cleaning up that file is easy to do: a Post-Session command can be used to delete the empty touchfile. Likewise, you can do the same for the Indirect type of file if desired.
I have a dataset containing thousands of tweets. Some of those contain urls but most of them are in the classical shortened forms used in Twitter. I need something that gets the full urls so that I can check the presence of some particular websites. I have solved the problem in Python like this:
import urllib2
url_filename='C:\Users\Monica\Documents\Pythonfiles\urlstrial.txt'
url_filename2='C:\Users\Monica\Documents\Pythonfiles\output_file.txt'
url_file= open(url_filename, 'r')
out = open(url_filename2, 'w')
for line in url_file:
tco_url = line.strip('\n')
req = urllib2.urlopen(tco_url)
print >>out, req.url
url_file.close()
out.close()
Which works but requires that I export my urls from Stata to a .txt file and then reimport the full urls. Is there some version of my Python script that would allow me to integrate the task in Stata using the shell command? I have quite a lot of different .dta files and I would ideally like to avoid appending them all just to execute this task.
Thanks in advance for any answer!
Sure, this is possible without leaving Stata. I am using a Mac running OS X. The details might differ on your operating system, which I am guessing is Windows.
Python and Stata Method
Say we have the following trivial Python program, called hello.py:
#!/usr/bin/env python
import csv
data = [['name', 'message'], ['Monica', 'Hello World!']]
with open('data.csv', 'w') as wsock:
wtr = csv.writer(wsock)
for i in data:
wtr.writerow(i)
wsock.close()
This "program" just writes some fake data to a file called data.csv in the script's working directory. Now make sure the script is executable: chmod 755 hello.py.
From within Stata, you can do the following:
! ./hello.py
* The above line called the Python program, which created a data.csv file.
insheet using data.csv, comma clear names case
list
+-----------------------+
| name message |
|-----------------------|
1. | Monica Hello World! |
+-----------------------+
This is a simple example. The full process for your case will be:
Write file to disk with the URLs, using outsheet or some other command
Use ! to call the Python script
Read the output into Stata using insheet or infile or some other command
Cleanup by deleting files with capture erase my_file_on_disk.csv
Let me know if that is not clear. It works fine on *nix; as I said, Windows might be a little different. If I had a Windows box I would test it.
Pure Stata Solution (kind of a hack)
Also, I think what you want to accomplish can be done completely in Stata, but it's a hack. Here are two programs. The first simply opens a log file and makes a request for the url (which is the first argument). The second reads that log file and uses regular expressions to find the url that Stata was redirected to.
capture program drop geturl
program define geturl
* pass short url as first argument (e.g. http://bit.ly/162VWRZ)
capture erase temp_log.txt
log using temp_log.txt
copy `1' temp_web_file
end
The above program will not finish because the copy command will fail (intentionally). It also doesn't clean up after itself (intentionally). So I created the next program to read what happened (and get the URL redirect).
capture program drop longurl
program define longurl, rclass
* find the url in the log file created by geturl
capture log close
loc long_url = ""
file open urlfile using temp_log.txt , read
file read urlfile line
while r(eof) == 0 {
if regexm("`line'", "server says file permanently redirected to (.+)") == 1 {
loc long_url = regexs(1)
}
file read urlfile line
}
file close urlfile
return local url "`long_url'"
end
You can use it like this:
geturl http://bit.ly/162VWRZ
longurl
di "The long url is: `r(url)'"
* The long url is: http://www.ciwati.it/2013/06/10/wdays/?utm_source=twitterfeed&
* > utm_medium=twitter
You should run them one after the other. Things might get ugly using this solution, but it does find the URL you are looking for. May I suggest that another approach is to contact the shortening service and ask nicely for some data?
If someone at Stata is reading this, it would be nice to have copy return HTTP response header information. Doing this entirely in Stata is a little out there. Personally I would use entirely Python for this sort of thing and use Stata for the analysis of data once I had everything I needed.