I'm wondering if the following is possible:
I want to edit parameters.dat file that the executable a.out will read and perform calculations on. If I submit a job with qsub, can I modify this same parameters.dat file and submit a different job even if the first job is seating in the queue? What can I expect to be the outcome?
Thank you so much in advance.
No, the parameter file will be read at runtime, not at queue time. So if you change it in between, you won't get the results you expect.
One solution is to have a different input file for each job you submit. The easiest is probably to create one directory containing a parameters.dat for each job you submit and submit your job to run inside this directory.
Related
I am a begginer and I've to List two files (a.xlsx and mark.txt) in a SFTP, fetching them and only process it when i've both files,
This is the logic:
If i have "mark.txt" i process a.xlsx and i delete "mark.txt".
For the next start, when i don't have "mark.txt" i don't process anything.
If i have again "mark.txt" i process a.xlsx and i delete "mark.txt".
Repeat.
I've tried with ListSFTP, then FetchSFTP, and then use a RouteonAttribute, but i don't know how to solve it.
Thank you in advance for your help
What you could do is look for the file a.xlsx and then process it if found. When NiFi picks up this file, it can delete it so the next time it looks for the xlsx file, it will be a new one. Therefore, if the file isn’t found, then it won’t do anything. Looking for the .txt and then pulling the .xlsx isn’t the best way to do this, just pull the XLSX directly.
One way to do what you’re asking is to look for mark.txt and if found, then you can write a script using a language like Python to get the file, instead of having to write a NiFi processor. This would be something like List File -> ExecuteStreamCommand where the ExecuteStreamCommand would be a Python script.
I am preparing some C++ code to be run on a cluster, managed by SLURM. The cluster takes one compiled file: a.out. It will then execute it on 500 different nodes via the JOB_ARRAY. When executing each copy of the file, it will need to read one input parameter, say double parameter. My idea was to prepare a .txt file which would hold one value of the parameter at each line. What is the best strategy for implementing the reading of these parameter values?
a.out will read the value from the first line and immediately delete it. If this is the right strategy, how to ensure that two copies of a.out are not doing the same thing at the same time?
a.out will read the value from the n-th line. How to let the copy of a.out know which n is it working with?
Is there any better implementation strategy then the two above? If so, how to do this? Is C++ fstream the way to go, or shall I try something completely different.
Thank you for any ideas.
I would appreciate if you also left some very simple code for how a.out shall look like.
Option two is the best way to go: You can use $SLURM_ARRAY_TASK_ID to get the specific line, so the call in your jobscript is simply:
a.out $(head -n $SLURM_ARRAY_TASK_ID parameter.txt | tail -1)
This should get the line corresponding to the task array ID.
You could deploy a param file with single line witch each executable, that's the simplest solution, because you know in advance to how many nodes you're deploying.
Or you could have one node playing role of a service register. The service register would then distribute the params to the nodes (for example via networking). It could hold a list of params and every client (differentiable for ex. via IP) would get the next line from the file via this service.
I have a .py file that reads integers from a separate csv , i just cant launch it from windows task scheduler, after 2 days and much frustration im posting here. A lot of similar questions have been posted but none are answered adequately for my case.
I have no problems launching other python files or exe's, the problem arises when the python file needs to read a csv. I have turned the file into a batch file, and i have also went through every possible permutation of administration and permission options, but still no cigar. The problem stems solely from the fact that the python needs to call from an external csv. Has anyone got an opinion, or a work-around?
Thanks.
Assuming that you try this under the windows 7 Task-scheduler...
You may try the following:
In the security options of your Task (1st page) ensure that you have selected the SYSTEM account. Tick the high privileges check box near the bottom of the dialog (i guess you already did that)
check if the file can be accessed (write into it with notepad)
try to call the executable from the python processor directly with your script-file as an argument (maybe something went wrong with the inheritance of access rights when windows calls the python processor; assuming that you linked the .py file in the Task Scheduler)
check the execution profile of the python command processor and compare it to the ownership of the CSV file (does the csv file reside in a user-account folder and has therefor other access requirements the python process can provide ? example: csv owned by user X, Task is run as user Y)
you may also try to create a new, empty textfile somewhere else (C:) and fill the content in from the CSV
greetings :)
The goal is to write output to different folders(different path) using one reduce.
I use old mapreduce api, and I do a little modification on MultipleOutputs(loose the restriction), and it works.
But the outputformat I use extends FileOutputFormat, where FileOutputCommitter is refered by FileOutputFormat.
And I find there will be a _success file in only one folder. it will be a problem?
And there still a empty file part-00000, I don't know why it is generated?
_SUCCESS is written only once after the job is complete. It is useful to check if the job is complete. I dont think there is any risk with that. You should know that it is created only after the job is complete and you should know where to look for that file if you are using it.
Regarding the part- files, take a look at
map reduce output files: part-r-* and part-*
I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job.
I have tried using NONE as my reducer, but the output directory fills with files like part-00000, part-00001, etc. And there are more of these than there are files in my input directory; each part- files represents only a processed fragment.
Any advice is appreciated.
Hadoop provides a reducer called the Identity Reducer.
The Identity Reducer literally just outputs whatever it took in (it is the identity relation). This is what you want to do, and if you don't specify a reducer the Hadoop system will automatically use this reducer for your jobs. The same is true for Hadoop streaming. This reducer is used for exactly what you described you're doing.
I've never run a job that doesn't output the files as part-####. I did some research and found that you can do what you want by subclassing the OutputFormat class. You can see what I found here: http://wiki.apache.org/hadoop/FAQ#A27. Sorry I don't have an example.
To site my sources, I learned most of this from Tom White's book: http://www.hadoopbook.com/.
it seems from what i've read about hadoop is that you need a reducer even if it doesn't change the mappers output just to merge the mappers outputs
You do not need to have a reducer. You can set the number of reducers to 0 in the job configuration stage, eg
job.setNumReduceTasks(0);
Also, to ensure that each mapper processes one complete input file, you can tell hadoop that the input files are not splitable. The FileInputFormat has a method
protected boolean isSplitable(JobContext context, Path filename)
that can be used to mark a file as not splittable, which means it will be processed by a single mapper. See here for documentation.
I just re-read your question, and realised that your input is probably a file with a list of filenames in it, so you most likely want to split it or it will only be run by one mapper.
What I would do in your situation is have an input which is a list of file names in s3. The mapper input is then a file name, which it downloads and runs your exe against. The output of this exe run is then uploaded to s3, and the mapper moves on to the next file. The mapper then does not need to output anything. Though it might be a good idea to output the file name processed so you can check against the input afterwards. Using the method I just outlined, you would not need to use the isSplitable method.