am trying to use storm HDFS spout, I have Apache Nifi moving files to HDFS directory where storm is listening, but as soon as apache nifi starts to move the file, storm senses that and starts processing noting that the file haven't been completely moved.
I have tried to use conf.put(Configs.IGNORE_SUFFIX, ignoreSuffix) from storm side
and apache nifi updateAttribute to rename the file to .ignore
I need to rename the file again after being completely moved, how can I achieve that ? or is there another way ?
When writing to HDFS, NiFi will write the file with the filename containing a dot at the beginning like ".foo.txt" and when the write operation is complete it will rename it to the name without the dot like "foo.txt". So if Storm has a way to ignore a prefix then you should be able to ignore anything starting with a dot.
Related
Reading CSV or Parquet files from local fs is very easy, but it seems that arrow does not support reading files from a remote server given its ip. Is there a way to achieve this? e.g. read a subset columns of a Parquet file from a remote server (path is like "ip://path/to/remote/file"). Thanks.
There is an open issue for this if you would like to contribute or follow development: https://issues.apache.org/jira/browse/ARROW-7594
(By 'remote server' I assume you mean over HTTP(s) or similar. If you're looking for a custom client-server protocol, check out Arrow Flight.)
pyarrow.dataset.dataset() has a filesystem argument through which it supports many remote file systems.
See the Arrow documentation for file systems. An fsspec file system can also be passed in, of which there are very many.
For example, if your Parquet file is sitting on a web server, you could use the fsspec HTTP file system:
import pyarrow.dataset as ds
import fsspec.implementations.http
http = fsspec.implementations.http.HTTPFileSystem()
d = ds.dataset('http://localhost:8000/test.parquet', filesystem=http)
I have a NiFi flow where I am getting all data from s3 and putting it in the destination folder. Now, the requirement is if there is any latest data then just transfer the latest data only. I have a data file in s3 like below:
20201130-011101493.parquet
20201129-011101493.parquet
And the regex I tried:
\d[0-9]{8}.parquet
The problem is it is not picking the first file which is the latest data i.e 30/11/2020
How can I modify my regex so that it will be picking the latest file only if the job runs once per day? I also referred this SO post but I guess I am not able to get my regex correct.
I am curious if I will need any extra processors for a GETHDFS - > COUNT PUTEMAIL flow using Apache Nifi.
I will be reading a CSV file from a HDFS location and I want to email the contents of the the directory using the PUTEMAIL.
i
If you want to put the contents of the FlowFile as the email message body, then you need to extract the contents to an attribute which can be done using the ExtractText processor.
Otherwise, you don't need any other processor. You can just use the FlowFile as an attached file and you're done.
I am looking into nifi to write files to HDFS.
What I would like to is to have the files written in a directory structure based on the File name / year / month / day / hour
So for example a file called "datasetX_xxxx" received on 10 august 2019 at 11 AM would be in directory /datasetX/2019/08/10/11/dataset_xxxx
1) is this possible?
2) how would I set this up?
Thanks in advance.
K
Yes, this is certainly possible!
First you need to extract/derive your directory structure from filename then you can put files into HDFS. Nifi has different processors to accomplish this. While putting files in HDFS, you set processor property 'TRUE' to create desired directory structure in HDFS if not exist. Kindly refer below guides -
CSV to HDFS
Extract directory names from filename
Skim through Apache Nifi Processors
Configure PutHDFS processor directory property as below
1.Using current timestamp(now) to create directories:
/datasetX/${now():format('yyyy')}/${now():format('MM')}/${now():format('dd')}/${now():format('HH')}/
(or)
If your flowfile in NiFi have timestamp in its filename (or) as attribute to the flowfile then use NiFi expression language string functions..etc to get the value and create directories in HDFS.
Refer to NiFi expression language, this and this links for more in built functions that we can use in NiFi.
Looking for approach in Apache Samza to read file from local system or HDFS
then apply filters, aggregate, where condition, order by, group by into batch of data.
Please provide some help.
You should create a system for each source of data you want to use. For example, to read from a file, you should create a system with the FileReaderSystemFactory -- for HDFS, create a system with the HdfsSystemFactory. Then, you can use the regular process callback or windowing to process your data.
You can feed your Samza Job using standard Kafka producer. To make it easy for you. You can use Logstash, you need to create Logstash script where you specify:
input as local file or hdfs
filters (optional) here you can do basic filtering, aggregation etc.
kafka output with specific topic you want to feed
input
I was using this approach to feed my samza job from local file
Another approach could be using Kafka Connect
http://docs.confluent.io/2.0.0/connect/