looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?
example- 3 files present in directory-
hdfs://app/data/customer/file1.parquet
hdfs://app/data/customer/file2.parquet
hdfs://app/data/customer/file3.parquet
Thanks!
You can use FetchParquet processor in combination with ListHDFS/GetHDFS..etc processors.
This processor added starting from NiFi-1.2 version and Jira NiFi-3724 addressing this improvement.
ListHDFS //stores the state and runs incrementally.
GetHDFS //doesn't stores the state get's all the files from the configured directory (Keep source file property to True incase you don't want to delete the source file).
You can use some other ways(using UpdateAttribute..etc) to add fully qualified filename as attribute to the flowfile then feed the connection to FetchParquet processor then processor fetches those parquet files.
Based on the RecordWriter specified FetchParquet Processor reads parquet files and write them in the format specified in RecordWriter.
Flow:
ListHDFS/GetHDFS -> FetchParquet -> other processors
If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:
A combination of ListHDFS and FetchHDFS
GetHDFS
The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.
Related
I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.
One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).
While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?
If you have a look at the documentation:
The extensions appended will be one of: .bzip2, .deflate, .gz.
Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.
In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.
Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.
Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.
What is the suggest way of loading data from GCS? The sample code shows copying the data from GCS to the /tmp/ directory. If this is the suggest approach, how much data may be copied to /tmp/?
While you have that option, you shouldn't need to copy the data over to local disk. You should be able to reference training and evaluation data directly from GCS, by referencing your files/objects using their GCS URI -- eg. gs://bucket/path/to/file. You can use these paths where you'd normally use local file system paths in TensorFlow APIs that accept file paths. TensorFlow supports the ability to access data (and write to) GCS.
You should also be able to use a prefix to reference a set of matching files, rather than referencing each file individually.
Followup note -- you'll want to check out https://cloud.google.com/ml/docs/how-tos/using-external-buckets in case you need to appropriately ACL your data for being accessible to training.
Hope that helps.
I'm writing a number of CSV files from my local file system to HDFS using Flume.
I want to know what would be the best configuration for Flume HDFS sink such that each file on local system will be copied exactly in HDFS as CSV. I want each CSV file processed by Flume to be a single event, flushed and written as a single file. As much as possible, I want the file to be exactly the same without the header stuffs etc.
What do I need to put on these values to simulate the behavior that I want?
hdfs.batchSize = x
hdfs.rollSize = x
hdfs.rollInterval = x
hdfs.rollCount = x
Kindly provide if there are other Flume agent config variables I need to change as well.
If this will not work using existing configuration, do I need to use custom sink then to achieve what I want?
Thanks for your input.
P.S. I know hadoop fs -put or -copyFromLocal would be more suited for this job, but since this is a proof of concept (showing that we can use Flume for data ingestion), that's why I need to use Flume.
You will have to disable all roll* properties by setting the values to 0. That will effectively prevent flume from rolling over files. As you might have noticed, flume operates on a per event basis, in most cases an event is a single line in a file. To also achieve a preservation of the file structure itself, you will need to use the spool dir source and activate fileHeader:
fileHeader false Whether to add a header storing the absolute path filename.
set that to true. It will provide a %{file} property which you can reference in your hdfs sink path specification.
I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job.
I have tried using NONE as my reducer, but the output directory fills with files like part-00000, part-00001, etc. And there are more of these than there are files in my input directory; each part- files represents only a processed fragment.
Any advice is appreciated.
Hadoop provides a reducer called the Identity Reducer.
The Identity Reducer literally just outputs whatever it took in (it is the identity relation). This is what you want to do, and if you don't specify a reducer the Hadoop system will automatically use this reducer for your jobs. The same is true for Hadoop streaming. This reducer is used for exactly what you described you're doing.
I've never run a job that doesn't output the files as part-####. I did some research and found that you can do what you want by subclassing the OutputFormat class. You can see what I found here: http://wiki.apache.org/hadoop/FAQ#A27. Sorry I don't have an example.
To site my sources, I learned most of this from Tom White's book: http://www.hadoopbook.com/.
it seems from what i've read about hadoop is that you need a reducer even if it doesn't change the mappers output just to merge the mappers outputs
You do not need to have a reducer. You can set the number of reducers to 0 in the job configuration stage, eg
job.setNumReduceTasks(0);
Also, to ensure that each mapper processes one complete input file, you can tell hadoop that the input files are not splitable. The FileInputFormat has a method
protected boolean isSplitable(JobContext context, Path filename)
that can be used to mark a file as not splittable, which means it will be processed by a single mapper. See here for documentation.
I just re-read your question, and realised that your input is probably a file with a list of filenames in it, so you most likely want to split it or it will only be run by one mapper.
What I would do in your situation is have an input which is a list of file names in s3. The mapper input is then a file name, which it downloads and runs your exe against. The output of this exe run is then uploaded to s3, and the mapper moves on to the next file. The mapper then does not need to output anything. Though it might be a good idea to output the file name processed so you can check against the input afterwards. Using the method I just outlined, you would not need to use the isSplitable method.