is any risk/notes to write output to different folders using mapreduce - mapreduce

The goal is to write output to different folders(different path) using one reduce.
I use old mapreduce api, and I do a little modification on MultipleOutputs(loose the restriction), and it works.
But the outputformat I use extends FileOutputFormat, where FileOutputCommitter is refered by FileOutputFormat.
And I find there will be a _success file in only one folder. it will be a problem?
And there still a empty file part-00000, I don't know why it is generated?

_SUCCESS is written only once after the job is complete. It is useful to check if the job is complete. I dont think there is any risk with that. You should know that it is created only after the job is complete and you should know where to look for that file if you are using it.
Regarding the part- files, take a look at
map reduce output files: part-r-* and part-*

Related

Using ListSFTP and FETCHFTP to process a file with a condition

I am a begginer and I've to List two files (a.xlsx and mark.txt) in a SFTP, fetching them and only process it when i've both files,
This is the logic:
If i have "mark.txt" i process a.xlsx and i delete "mark.txt".
For the next start, when i don't have "mark.txt" i don't process anything.
If i have again "mark.txt" i process a.xlsx and i delete "mark.txt".
Repeat.
I've tried with ListSFTP, then FetchSFTP, and then use a RouteonAttribute, but i don't know how to solve it.
Thank you in advance for your help
What you could do is look for the file a.xlsx and then process it if found. When NiFi picks up this file, it can delete it so the next time it looks for the xlsx file, it will be a new one. Therefore, if the file isn’t found, then it won’t do anything. Looking for the .txt and then pulling the .xlsx isn’t the best way to do this, just pull the XLSX directly.
One way to do what you’re asking is to look for mark.txt and if found, then you can write a script using a language like Python to get the file, instead of having to write a NiFi processor. This would be something like List File -> ExecuteStreamCommand where the ExecuteStreamCommand would be a Python script.

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

When a file changes, I'd like to modify one or more different files

I've been scouring the web for hours looking for an approach to solving this problem, and I just can't find one. Hopefully someone can fast-track me. I'd like to cause the following behaviour:
When running ember s and a file of a certain extension is changed, I'd like to analyze the contents of that file and write to several other files in the same directory.
To give a specific example, let's assume I have a file called app/dashboard/dashboard.ember. dashboard.ember consists of 3 concatenated files: app/dashboard/controller.js, .../route.js, and .../template.hbs with a reasonable delimiter between the files. When dashboard.ember is saved, I'd like to call a function (inside an addon, I assume) that reads the file, splits it at the delimiter and writes the corresponding splitted files. ember-cli should then pick up the changed source (.js, .hbs, etc.) files that it knows how to handle, ignoring the .ember file.
I could write this as a standalone application, of course, but I feel like it should be integrated with the ember-cli build environment, but I can't figure out what concoction of hooks and tools I should use to achieve this.

Possible to use Parquet files and Text (csv) files as input to same M/R Job?

I tried researching this but found no useful information. I have an M/R job already reading from parquet (not partitioned, using a thrift schema). I need to add another set of input files to the process that are not in parquet format, they're just regular csv files.
Anyone know if this is possible or how it could be done?
Never mind, I think i found what I needed in another post unrelated to parquet.
Using multiple InputFormat classes while configuring MapReduce job
Here is the information I took from the answer I linked to and adapted to my own solution:
MultipleInputs.addInputPath(job, new Path("/path/to/parquet"), ParquetInputFormat.class, ParquetMapper.class);
MultipleInputs.addInputPath(job, new Path("/path/to/txt"), TextInputFormat.class, TextMapper.class);

File Processing with Elastic MapReduce - No Reducer Step?

I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job.
I have tried using NONE as my reducer, but the output directory fills with files like part-00000, part-00001, etc. And there are more of these than there are files in my input directory; each part- files represents only a processed fragment.
Any advice is appreciated.
Hadoop provides a reducer called the Identity Reducer.
The Identity Reducer literally just outputs whatever it took in (it is the identity relation). This is what you want to do, and if you don't specify a reducer the Hadoop system will automatically use this reducer for your jobs. The same is true for Hadoop streaming. This reducer is used for exactly what you described you're doing.
I've never run a job that doesn't output the files as part-####. I did some research and found that you can do what you want by subclassing the OutputFormat class. You can see what I found here: http://wiki.apache.org/hadoop/FAQ#A27. Sorry I don't have an example.
To site my sources, I learned most of this from Tom White's book: http://www.hadoopbook.com/.
it seems from what i've read about hadoop is that you need a reducer even if it doesn't change the mappers output just to merge the mappers outputs
You do not need to have a reducer. You can set the number of reducers to 0 in the job configuration stage, eg
job.setNumReduceTasks(0);
Also, to ensure that each mapper processes one complete input file, you can tell hadoop that the input files are not splitable. The FileInputFormat has a method
protected boolean isSplitable(JobContext context, Path filename)
that can be used to mark a file as not splittable, which means it will be processed by a single mapper. See here for documentation.
I just re-read your question, and realised that your input is probably a file with a list of filenames in it, so you most likely want to split it or it will only be run by one mapper.
What I would do in your situation is have an input which is a list of file names in s3. The mapper input is then a file name, which it downloads and runs your exe against. The output of this exe run is then uploaded to s3, and the mapper moves on to the next file. The mapper then does not need to output anything. Though it might be a good idea to output the file name processed so you can check against the input afterwards. Using the method I just outlined, you would not need to use the isSplitable method.