Apache Nifi - Write to HDFS in directory structure

Apache Nifi - Write to HDFS in directory structure - hdfs

I am looking into nifi to write files to HDFS.
What I would like to is to have the files written in a directory structure based on the File name / year / month / day / hour
So for example a file called "datasetX_xxxx" received on 10 august 2019 at 11 AM would be in directory /datasetX/2019/08/10/11/dataset_xxxx
1) is this possible?
2) how would I set this up?
Thanks in advance.
K

Yes, this is certainly possible!
First you need to extract/derive your directory structure from filename then you can put files into HDFS. Nifi has different processors to accomplish this. While putting files in HDFS, you set processor property 'TRUE' to create desired directory structure in HDFS if not exist. Kindly refer below guides -
CSV to HDFS
Extract directory names from filename
Skim through Apache Nifi Processors

Configure PutHDFS processor directory property as below
1.Using current timestamp(now) to create directories:
/datasetX/${now():format('yyyy')}/${now():format('MM')}/${now():format('dd')}/${now():format('HH')}/
(or)
If your flowfile in NiFi have timestamp in its filename (or) as attribute to the flowfile then use NiFi expression language string functions..etc to get the value and create directories in HDFS.
Refer to NiFi expression language, this and this links for more in built functions that we can use in NiFi.

Related

Create Athena table using s3 source data

Below is given the s3 path where I have stored the files obtained at the end of a process. The below-provided path is dynamic, that is, the value of the following fields will vary - partner_name, customer_name, product_name.
s3://bucket/{val1}/data/{val2}/output/intermediate_results
I am trying to create Athena tables for each output file present under output/ as well as under intermediate_results/ directories, for each val1-val2.
Each file is a CSV.
But I am not much familiar with AWS Athena so I'm unable to figure out the way to implement this. I would really appreciate any kind of help. Thanks!

Use CREATE TABLE - Amazon Athena. You will need to specify the LOCATION of the data in Amazon S3 by providing a path.
Amazon Athena will automatically use all files in that path, including subdirectories. This means that a table created with a Location of output/ will include all subdirectories, including intermediate_results. Therefore, your data storage format is not compatible with your desired use for Amazon Athena. You would need to put the data into separate paths for each table.

GetHdfs to PutEmail Flow Apache NiFi

I am curious if I will need any extra processors for a GETHDFS - > COUNT PUTEMAIL flow using Apache Nifi.
I will be reading a CSV file from a HDFS location and I want to email the contents of the the directory using the PUTEMAIL.
i

If you want to put the contents of the FlowFile as the email message body, then you need to extract the contents to an attribute which can be done using the ExtractText processor.
Otherwise, you don't need any other processor. You can just use the FlowFile as an attached file and you're done.

Can AWS Glue Crawler handle different file types in same folder?

I have reports delivered to S3 in the following structure:
s3://chum-bucket/YYYY/MM/DD/UsageReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/SearchReportYYYYMMDD.zip
s3://chum-bucket/YYYY/MM/DD/TimingReportYYYYMMDD.zip
The YYYY MM DD vary per day. The YYYMMDD in the filename is there because the files all go into one directory on a server before they are moved to S3.
I want to have 1 or 3 crawlers that deliver 3 tables to the catalog, one for each type of report. Is this possible? I can't seem to specify
s3://chum-bucket/**/UsageReport*.zip
s3://chum-bucket/**/SearchReport*.zip
s3://chum-bucket/**/TimingReport*.zip
I can write one crawler that excludes SearchReport and TimingReport, and therefore crawls the UsageReport only. Is that the best way?
Or do I have to completely re-do the bucket / folder / file name design?

Amazon Redshift loads all files in a given path, regardless of filename.
Redshift will not take advantage of partitions (Redshift Spectrum will, but not a normal Redshift COPY statement), but it will read files from any subdirectories within the given path.
Therefore, if you want to load the data into separate tables (UsageReport, SearchReport, TimingReport), the they need to be in separate paths (directories). All files within the designated directory hierarchy must be in the same format and will be loaded into the same table via the COPY command.
An alternative is that you could point to a specific file using manifest files, but this can get messy.
Bottom line: Move the files to separate directories.

Storm HDFS locking file from NiFi

am trying to use storm HDFS spout, I have Apache Nifi moving files to HDFS directory where storm is listening, but as soon as apache nifi starts to move the file, storm senses that and starts processing noting that the file haven't been completely moved.
I have tried to use conf.put(Configs.IGNORE_SUFFIX, ignoreSuffix) from storm side
and apache nifi updateAttribute to rename the file to .ignore
I need to rename the file again after being completely moved, how can I achieve that ? or is there another way ?

When writing to HDFS, NiFi will write the file with the filename containing a dot at the beginning like ".foo.txt" and when the write operation is complete it will rename it to the name without the dot like "foo.txt". So if Storm has a way to ignore a prefix then you should be able to ignore anything starting with a dot.

How to configure Apache Flume to not to rename ingested files with .COMPLETE

We have one AWS S3 bucket in which we get new CSV files at 10 minute interval. Goal is to ingest these files into Hive.
So the obvious way for me is to use Apache Flume for this and use Spooling Directory source which will keep looking for new files in landing directory and ingest them in Hive.
We have read-only permissions for S3 bucket and for landing directory in which files will be copied and Flume suffixes ingested files with .COMPLETED suffix. So in our case Flume won't be able to mark completed files because of permission issue.
Now questions are:
What will happen if Flume is not able to add suffix to completed
files? Will it give any error or it will silently fail? (I am actually testing this but if anyone has already tried this then I don't have to reinvent the wheel)
Whether
Flume will be able to ingest files without marking them with
.COMPLETED?
Is there any other Big Data tool/technology better
suited for this use case?

Flume Spooling Directory Source needs to have write permission either to rename or delete the processed/read log file.
check 'fileSuffix', 'deletePolicy' settings.
If it doesnt rename/delete the completed files, it can't figure out which files are already processed.
You might want to write a 'script' that reads from read-only S3 bucket to a 'staging' folder with write permissions and provide this staging folder as source to flume.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Apache Nifi - Write to HDFS in directory structure - hdfs

Related

Create Athena table using s3 source data

GetHdfs to PutEmail Flow Apache NiFi

Can AWS Glue Crawler handle different file types in same folder?

Storm HDFS locking file from NiFi

How to configure Apache Flume to not to rename ingested files with .COMPLETE

Categories

Resources