how to access HDFS extended attributes in java code - hdfs

How can I access extended attributes of a HDFS file or directory in my java code? Any pointers will help.
Thanks and Regards,
Abhay Dandekar

You can use FileSystem class to access and modify them:
There is a method called getXAttr, here is the piece of the documentation taken from the API
public byte[] getXAttr(Path path, String name) throws IOException
Get an xattr name and value for a file or directory. The name must be
prefixed with the namespace followed by ".". For example, "user.attr".
Refer to the HDFS extended attributes user documentation for details.
Parameters:
path - Path to get extended attribute
name - xattr name.
Returns:
byte[] xattr value.
You can check the whole API here

Related

Google Cloud Storage JSON REST API - Insert and List objects in a sub-directory in a bucket

I'm trying to figure out how to:
a) Store / insert an object on Google Cloud Storage within a sub-directory
b) List a given sub-directory's contents
I managed to resolve how to get an object here: Google Cloud Storage JSON REST API - Get object held in a sub-directory
However, the same logic doesn't seem to apply to these other types of call.
For store, this works:
https://www.googleapis.com/upload/storage/v1/b/bucket-name/o?uploadType=media&name=foldername%2objectname
but it then stores the file name on GCS as foldername_filename, which doesn't change functionality but isn't really ideal.
For listing objects in a bucket, not sure where the syntax for a nested directory should go in here: storage.googleapis.com/storage/v1/b/bucketname/o.
Any insight much appreciated.
The first thing to know is that GCS does not have directories or folders. It just has objects that share a fixed prefix. There are features that both gsutil and the UI used to create the illusion that folders do exist.
https://cloud.google.com/storage/docs/gsutil/addlhelp/HowSubdirectoriesWork
With that out of the way: to create an object with a prefix you need to URL encode the object name, as I recall %2F is the encoding for /:
https://cloud.google.com/storage/docs/request-endpoints#encoding
Finally to list only the objects with a common prefix you would use the prefix parameter:
https://cloud.google.com/storage/docs/json_api/v1/objects/list#parameters
Using prefix=foo/bar/baz (after encoding) would list all the objects in the foo/bar/baz "folder", note that this is recursive, it will include foo/bar/baz/quux/object-name in the results. To stop at one level you want to read about the delimiter parameter too.

I wonder if I can perform data-pipeline by directory of a specific name with DataFusion

I'm using google-cloud-platform data fusion.
Assuming that the bucket's path is as follows:
test_buk/...
In the test_buk bucket there are four files:
20190901, 20190902
20191001, 20191002
Let's say there is a directory inside test_buk called dir.
I have a prefix-based bundle based on 201909(e.g, 20190901, 20190902)
also, I have a prefix-based bundle based on 201910(e.g, 20191001, 20191002)
I'd like to complete the data-pipeline for 201909 and 201910 bundles.
Here's what I've tried:
with regex path filter
gs://test_buk/dir//2019 to run the data pipeline.
If regex path filter is inserted, the Input value is not read, and likewise there is no Output value.
When I want to create a data pipeline with a specific directory in a bundle, how do I handle it in a datafusion?
If using directly the raw path (gs://test_buk/dir/), you might be getting an error when escaping special characters in the regex. That might be the reason for which you do not get any input file into the pipeline that matches your filter.
I suggest instead that you use ".*" to math the initial part (given that you are also specifying the path, no additional files in other folders will match the filter).
Therefore, I would use the following expressions depending on the group of files you want to use (feel free to change the extension of the files):
path = gs://test_buk/dir/
regex path filter = .*201909.*\.csv or .*201910.*\.csv
If you would like to know more about the regex used, you can take a look at (1)

Rename file after putHDFS

I have apache NIFI job where I get file from system using getFile then I use putHDFS, how can I rename the file in HDFS after putting the file in hadoop ?
I tried to use executeScript processor but can't get it to work
flowFile = session.get()
if flowFile != None:
tempFileName= flowFile.getAttribute("filename")
fileName=tempFileName.replace('._COPYING_','')
flowFile = session.putAttribute(flowFile, 'filename', fileName)
session.transfer(flowFile, REL_SUCCESS)
The answer above by Shu is correct for how to manipulate the filename attribute in NiFi, but if you have already written a file to HDFS and then use UpdateAttribute, it is not going to change the name of the file in HDFS, it will only change the value of the filename attribute in NiFi.
You could use the UpdateAttribute approach to create a new attribute called "final.filename" and then use MoveHDFS to move the original file to the final file.
Also of note, the PutHDFS processor already writes a temp file and moves it to the final file so I'm not sure if it is necessary for you to name ".COPYING". For example if you send a flow file to PutHDFS with filename of "foo" it will first write ".foo" to the directory and when done it will move it to "foo".
The only case where you need to use MoveHDFS is if some other process is monitoring the directory and can't ignore the dot files, then you write it somewhere else and use MoveHDFS once it is complete.
Instead of using ExecuteScript processor(extra overhead) use UpdateAttribute processor Feed the Success relationship from PutHDFS
Add new property in UpdateAttribute processor as
filename
${filename:replaceAll('<regex_expression>','<replacement_value>')}
Use replaceAll function from NiFi Expression Language.
(or)
Using replace Function
filename
${filename:replaceAll('<search_string>','<replacement_value>')}
NiFi expression language offers different functions to manipulate strings refer to this link for more documentation related to expression language.
i have tried same exact script that in Question with ExecuteScript processor with Script Engine as Python and everything works as expected.
As you are using .replace function and replacing with ''
Output:
As the filename fn._COPYING_ got changed to fn.

rails4 upload file extension error ActionDispatch::Http::UploadedFile

Guy
now i want to upload file with rails 4
my problem now i can't check the file extension before upload it
Note : I can upload the file well but i want to get the file kind before upload it
because i need the extension in another step in my App.
I'm tried to use the commands
File.extname(params[:Upload])
but always got the error
can't convert ActionDispatch::Http::UploadedFile into String
also how i can get the file base name before upload it ??
when i tryed to use
File.basename(params[:Upload])
i got the same error
can't convert ActionDispatch::Http::UploadedFile into String
also when i tried to convert the name to Sting i don't get any thing
That's because File.extname expects a string file name, but an uploaded file (your params[:upload] is an object, it's an instance of the ActionDispatch::Http::UploadedFile class (kind of a temporary file)
To fix the problem you need to call the path property on your params[:upload] object, kind of like that
File.extname(params[:Upload].path)
Btw, if you're trying to get the type of the uploaded file, I'd encourage you to check for the params[:Upload].content_type instead, it's harder to spoof
You can use this:
params[:Upload].original_filename.split('.').last
The original_filename contains full filename with extension.
so you split it based on '.' and last index will contain the file extension.
For Example:
"my_file.doc.pdf".split('.').last # => 'pdf'
You can check this for more info ActionDispatch::Http::UploadedFile.

How to download file from aws s3 using python without using key

I need to download an xml file from AWS-S3.
I tried using get_contents_to_filename(fname) , it worked.
But i need to download the file without specifying fname, because if i specify the fname my downloaded file gets saved tofname.
I want to save the file as it is, with its name.
this is my current code
k = Key(bucket)
k.set_contents_from_filename(fname)
can someone please help me to download and fetch the file without using key.
Thanks in advance!
I'm not sure which library you're using, but if k is the AWS key you want to download, then k.name is probably the key name, so k.get_contents_to_filename(k.key) would probably do more or less what you want.
The one problem is that the key name might not be a legal file name, or it may have file path separators. So if the key name were something like '../../../../somepath/somename' the file would be saved somewhere you don't expect. So copy k.name to a string and either sanitize it by changing all dangerous characters to safe ones, or just extract the part of the key name you want to use for the file name.