Apache Beam/Dataflow- passing file path to ReadFromText

Apache Beam/Dataflow- passing file path to ReadFromText - google-cloud-platform

I have a use case where I want to read the filename from a metadata table, I have written a pipeline function to read the metadata table, but I am not sure how can I pass this information to ReadFromText as it only takes string as input, Is it possible to assign this value to ReadFromText(). Please suggest some workarounds or ideas how to achieve this, Thanks
code: pipeline | 'Read from a File' >> ReadFromText(I want to pass the file path here?,
skip_header_lines=1)
Note: There will be various folders and files in storage, files are in csv format, but in my use case I can't directly pass the storage location or filename to file path in ReadFromText. I want to read it from metadata and pass the value. Hope I am clear, Thanks

I don't understand why you need to read the metadata. If you want to read all the files inside a folder, you can just provide a blob. This solution working in python, not sure about java.
p|readfromtext("./folder/*.csv")
"*" is the blob here, which allows pipeline to read all the patterns matching .csv. You can also add something at the starting.

What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.

Related

Error: Cannot find the jsonl: gs://{bucket_name}/Frist_test.jsonl in request

I am exploring using Google Cloud Platform Natural Language for entity extraction. I am working on just setting up a playground to get the hang of things and I can't seem to get past square one. I have created a new cloud store bucket to hold my project file.
I made a simple csv file to point to a one line jsonl file. But I am missing something in the address to my cloud bucket stored file.
My csv looks like this:
Train, gs://new_wc_training/Frist_test.jsonl
And my jsonl file looks like this:
{"text_snippet":{"content": "This is a first test of my json file."}}
When I import my csv file I get the error:
Error: Cannot find the jsonl: gs://new_wc_training/Frist_test.jsonl in request.
I am sure I am just missing something in the structure of the address to the jsonl file in the bucket, but I am at a loss as to finding it.
Thank you for looking over my issue and if there is any additional information needed do not hesitate in asking.

As per discussion with #fred_rogers, the problem is that the JSONL filename declared inside the CSV file does not match with the actual filename in the bucket.
The fix is to match the JSONL filename in the bucket and in the CSV file.

Rename file after putHDFS

I have apache NIFI job where I get file from system using getFile then I use putHDFS, how can I rename the file in HDFS after putting the file in hadoop ?
I tried to use executeScript processor but can't get it to work
flowFile = session.get()
if flowFile != None:
tempFileName= flowFile.getAttribute("filename")
fileName=tempFileName.replace('._COPYING_','')
flowFile = session.putAttribute(flowFile, 'filename', fileName)
session.transfer(flowFile, REL_SUCCESS)

The answer above by Shu is correct for how to manipulate the filename attribute in NiFi, but if you have already written a file to HDFS and then use UpdateAttribute, it is not going to change the name of the file in HDFS, it will only change the value of the filename attribute in NiFi.
You could use the UpdateAttribute approach to create a new attribute called "final.filename" and then use MoveHDFS to move the original file to the final file.
Also of note, the PutHDFS processor already writes a temp file and moves it to the final file so I'm not sure if it is necessary for you to name ".COPYING". For example if you send a flow file to PutHDFS with filename of "foo" it will first write ".foo" to the directory and when done it will move it to "foo".
The only case where you need to use MoveHDFS is if some other process is monitoring the directory and can't ignore the dot files, then you write it somewhere else and use MoveHDFS once it is complete.

Instead of using ExecuteScript processor(extra overhead) use UpdateAttribute processor Feed the Success relationship from PutHDFS
Add new property in UpdateAttribute processor as
filename
${filename:replaceAll('<regex_expression>','<replacement_value>')}
Use replaceAll function from NiFi Expression Language.
(or)
Using replace Function
filename
${filename:replaceAll('<search_string>','<replacement_value>')}
NiFi expression language offers different functions to manipulate strings refer to this link for more documentation related to expression language.
i have tried same exact script that in Question with ExecuteScript processor with Script Engine as Python and everything works as expected.
As you are using .replace function and replacing with ''
Output:
As the filename fn._COPYING_ got changed to fn.

Is it possible to write back to data file in postman?

While working with postman, data.someVariable returns data from within a csv file that can also be used as {{someVariable}} in uri/json.
This gives us the data for that variable from that row/iteration.
Is there a mechanism to write back to the data file by doing something like postman.setData('responseCode') = responseCode.
This would be really helpful to store response code in the data file and to record call wise details in same format as the input within csv.

The only solution I figured out is
to populate json objects in the environment with information about the data file name and structure/values of information to be added
to create a separate web service (maybe in node.js) that exposes an http call to write to a file and takes in as parameter a json input as the one created in the environment as mentioned above and writes that to a file / original data file (or a copy of it) in the desired format
to call the above mentioned web service call at the end of each run or desired rest call execution to generate step wise information/debug report

There is no way to write back to data file in postman as of now .
However, you can populate that in your environment file at run time using
pm.environment.set("varname")
keep varname in such a way that you understand this is the variable you wanted to write back into data file.

Use TfileUnarchive on Amazon S3

I have a talend job which is simple like below:
ts3Connection -> ts3Get -> tfileinputDelimeted -> tmap -> tamazonmysqloutput.
Now the scenario here is that some times I get the file in .txt format and sometimes I get it in a zip file.
So I want to use tFileUnarchive to unzip the file if it's in zip or process it bypassing the tFileUnarchive component if the file is in unzipped format i.e only in .txt format.
Any help on this is greatly appreciated.

The trick here is to break the file retrieval and potential unzipping into one sub job and then the processing of the files into another sub job afterwards.
Here's a simple example job:
As normal, you connect to S3 and then you might list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get. Alternatively you might have another way of passing the relevant object key that you want to download to tS3Get.
In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:
((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a Run If conditional link from the tS3Get which links the tFileUnarchive with the condition:
((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a .zip file.
The tFileUnarchive component then just needs to be told what to unzip, which will be the file we've just downloaded:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.
From here we can now iterate through the downloads folder looking for the file types we want by setting the directory to "C:/Talend/5.6.1/studio/workspace/S3_downloads" and the global expression to "*.csv" in my case as I wanted to read in only the CSV files (including the zipped ones) I had in S3.
Finally, we then read the delimited files by setting the file to be read by the tFileInputDelimited component as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console but obviously you would then want to perform some transformation before uploading to your AWS RDS instance.

How to download file from aws s3 using python without using key

I need to download an xml file from AWS-S3.
I tried using get_contents_to_filename(fname) , it worked.
But i need to download the file without specifying fname, because if i specify the fname my downloaded file gets saved tofname.
I want to save the file as it is, with its name.
this is my current code
k = Key(bucket)
k.set_contents_from_filename(fname)
can someone please help me to download and fetch the file without using key.
Thanks in advance!

I'm not sure which library you're using, but if k is the AWS key you want to download, then k.name is probably the key name, so k.get_contents_to_filename(k.key) would probably do more or less what you want.
The one problem is that the key name might not be a legal file name, or it may have file path separators. So if the key name were something like '../../../../somepath/somename' the file would be saved somewhere you don't expect. So copy k.name to a string and either sanitize it by changing all dangerous characters to safe ones, or just extract the part of the key name you want to use for the file name.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Beam/Dataflow- passing file path to ReadFromText - google-cloud-platform

What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.

Related

Error: Cannot find the jsonl: gs://{bucket_name}/Frist_test.jsonl in request

Rename file after putHDFS

Is it possible to write back to data file in postman?

Use TfileUnarchive on Amazon S3

How to download file from aws s3 using python without using key

Categories

Resources