Where is output file created after running FileWriter in AWS EMR

Where is output file created after running FileWriter in AWS EMR - amazon-web-services

This is how I am writing to file. (Scala code)
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("something to write into output file")
fw.close()
This is part of a spark job I'm running on AWS EMR. Thw job runs and finishes successfully. The problem is I'm unable to locate my_output_filename.txt anywhere once it's done.
A bit more context:
What I'm trying to do: Doing some processing on each row of a dataframe and writing it into a file. so it looks something like this:
myDF.collect().foreach( row => {
import java.io.FileWriter
val fw = new FileWriter("my_output_filename.txt", true)
fw.write("row data to be written into file")
fw.close()
})
How I checked:
When I ran it in local, I found newly created file in same directory where code is present. But couldn't find it in remote node.
I ran find / -name "my_output_filename.txt".
I also checked in HDFS: hdfs dfs -find / -name "my_output_filename.txt"
Where can I find the output file ?
Is there a better way to do this?

Related

Waiting on another python process to continue

Python version: Python 2.7.13
I am trying to write a script that is able to go through a *.txt file and launch a batch file to execute a particular test.
The code below goes through the input file, changes the string from 'N' to 'Y' which allows the particular test to be executed. I am in the process of creating a for loop to go through all the lines within the *.txt file and execute all the test in a sequence. However, my problem is that I do not want to execute the test at the same time (which is what would happen if I just write the test code).
Is there a way to wait until the initial test is finished to launch the next one?
Here is what I have so far:
from subprocess import Popen
import os, glob
path = r'C:/Users/user1/Desktop/MAT'
for fname in os.listdir(path):
if fname.startswith("fort"):
os.remove(os.path.join(path, fname))
with open('RUN_STUDY_CHECKLIST.txt', 'r') as file:
data = file.readlines()
ln = 4
ch = list(data[ln])
ch[48] = 'Y'
data[ln] = "".join(ch)
with open('RUN_STUDY_CHECKLIST.txt', 'w') as file:
file.writelines(data)
matexe = Popen('run.bat', cwd=r"C:/Users/user1/Desktop/MAT")
stdout, stderr = matexe.communicate()
In this particular instance I am changing the 'N' in line 2 of the *.txt file to a 'Y' which will be used as an input for another python script.
I have to mention that I would like to do this task without having to interact with any prompt, I would like to do execute the script and leave it running (since it would take a long time to go through all the tests).
Best regards,
Jorge

After further looking through several websites I managed to get a solution to my question.
I used:
exe1 = subprocess.Popen(['python', 'script.py'])
exe1.wait()
I wanted to post the answer just in case this is helpful to anyone.

Increase Haddop_HEAPSIZE in amazon EMR to run job with a few million input files

I am running into an issue with my EMR jobs where too many input files throws out of memory errors. Doing some research I think changing the HADOOP_HEAPSIZE config parameter is the solution. Old amazon forums from 2010 say it cannot be done.
can we do that now in 2018??
I run my jobs using the C# API for EMR and normally I set configurations using statements like below. can I set HADOOP_HEAPSIZE using similar commands.
config.Args.Insert(2, "-D");
config.Args.Insert(3, "mapreduce.output.fileoutputformat.compress=true");
config.Args.Insert(4, "-D");
config.Args.Insert(5, "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec");
config.Args.Insert(6, "-D");
config.Args.Insert(7, "mapreduce.map.output.compress=true");
config.Args.Insert(8, "-D");
config.Args.Insert(9, "mapreduce.task.timeout=18000000");
If I need to bootstrap using a file, I can do that too. If someone can show me the contents of the file for the config change.
Thanks

I figured it out...
I created a shell script to increase the memory size on the master machine (code at the end)...
I run a bootstrap action like this
ScriptBootstrapActionConfig bootstrapActionScriptForHeapSizeIncrease = new ScriptBootstrapActionConfig
{
Path = "s3://elasticmapreduce/bootstrap-actions/run-if",
Args = new List<string> { "instance.isMaster=true", "<s3 path to my shell script>" },
};
The shell script code is this
#!/bin/bash
SIZE=8192
if ! [ -z $1 ] ; then
SIZE=$1
fi
echo "HADOOP_HEAPSIZE=${SIZE}" >> /home/hadoop/conf/hadoop-user-env.sh
Now I am able to run a EMR job with master machine tye as r3.xlarge and process 31 million input files

read a parquet files from HDFS using PyArrow

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()
I also know I can read a parquet file using pyarrow.parquet's read_table()
However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance.
Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along.

Try
fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)
or
import pyarrow.parquet as pq
with fs.open(path) as f:
pq.read_table(f, **read_options)
I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this

I tried the same via Pydoop library and engine = pyarrow and it worked perfect for me.Here is the generalized method.
!pip install pydoop pyarrow
import pydoop.hdfs as hd
#read files via Pydoop and return df
def readParquetFilesPydoop(path):
with hd.open(path) as f:
df = pd.read_parquet(f ,engine='pyarrow')
logger.info ('file: ' + path + ' : ' + str(df.shape))
return df

Tensorboard: No graph definition files were found.

In my Python code I execute
train_writer = tf.summary.FileWriter(TBOARD_LOGS_DIR)
train_writer.add_graph(sess.graph)
I can see 1.6MB file created in E:\progs\tensorboard_logs (and no other file)
but then when I execute
tensorboard --logdir=E:\progs\tensorboard_logs
it loads, but says: "No graph definition files were found." when I click on Graph.
Additionally, running tensorboard --inspect --logdir=E:\progs\tensorboard_logs
displays
Found event files in:
E:\progs\tensorboard_logs
These tags are in E:\progs\tensorboard_logs:
audio -
histograms -
images -
scalars -
Event statistics for E:\progs\tensorboard_logs:
audio -
graph
first_step 0
last_step 0
max_step 0
min_step 0
num_steps 1
outoforder_steps []
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
This is TF 1.01 or so, on Windows 10.

I had similar issue. The issue occurred when I specified 'logdir' folder inside single quotes instead of double quotes. Hope this may be helpful to you.
egs: tensorboard --logdir='my_graph' -> Tensorboard didn't detect the graph
tensorboard --logdir="my_graph" -> Tensorboard detected the graph

In Tensorflows dealing with graphs, there are three parts:
1) creating the graph
2) Writing the graph to event file
3) Visualizing the graph in tensorboard
Example: Creating graph in tensorflow
a = tf.constant(5, name="input_a")
b = tf.constant(3, name="input_b")
c = tf.multiply(a,b, name="mul_c")
d = tf.add(a,b, name="add_d")
e = tf.add(c,d, name="add_e")
sess = tf.Session()
sess.run(c) <--check, value should be 15
sess.run(d) <--check, value should be 8
sess.run(e) <--check, value should be 23
Writing graph in event file
writer = tf.summary.FileWriter('./tensorflow_examples', sess.graph)
It is very important to specify a directory(in this case, the directory is tensorflow_examples), where the event file will be written to.
writer = tf.summary.FileWriter('./', sess.graph) didnt work for me, because the shell command => tensorboard --logdir expects a directory name.
After executing this step, verify if event file has been created in specified directory.
Visualizing graph in Tensorboard
Open terminal(bash), under working directory type:
tensorboard --logdir='tensorflow_examples' --host=127.0.0.1
Then open a new browser in http://127.0.0.1:6006/ or http://localhost/6006 and now tensorboard shows the graph successfully.

The problem might be the parameter --logdir. make sure you have type the correct
example:
in the code:
writer = tf.summary.FileWriter('./log/', s.graph)
open powershell
cd to your work directory and type
tensorboard --logdir=log
you can also use --debug to see if there is a problem in finding the log file. if you see:
TensorBoard path_to_run is: {'C:\\Users\\example\\log': None} that means it can not find the file.

You may need to change the powershell directory to your log file. And the logdir need not the single quotation marks.(Double quotation marks or without the quotes will be both OK)

How to set up replication in BerkeleyDB

I've been struggling for some time now on setting up a "simple" BerkeleyDB replication using the db_replicate utility.
However no luck in making it actually work, and I'm not finding any concrete example on how thing should be set up.
Here is the setup I have so far. Environment is a Debian Wheezy with BDB 5.1.29
Database generation
A simple python script reading "CSV" files and inserting each line into the BDB file
from glob import glob
from bsddb.db import DBEnv, DB
from bsddb.db import DB_CREATE, DB_PRIVATE, DB_INIT_MPOOL, DB_BTREE, DB_HASH, DB_INIT_LOCK, DB_INIT_LOG, DB_INIT_TXN, DB_INIT_REP, DB_THREAD
env = DBEnv()
env.set_cachesize(0, 1024 * 1024 * 32)
env.open('./db/', DB_INIT_MPOOL | DB_INIT_LOCK | DB_INIT_LOG |
DB_INIT_TXN | DB_CREATE | DB_INIT_REP | DB_THREAD)
db = DB(env)
db.open('apd.db', dbname='stuff', flags=DB_CREATE, dbtype=DB_BTREE)
for csvfile in glob('Stuff/*.csv'):
for line in open(csvfile):
db.put(line.strip(), None)
db.close()
env.close()
DB Configuration
In the DB_CONFIG file, this is where I'm missing the most important part I guess
repmgr_set_local_site localhost 6000
Actual replication try
# Copy the database file to begin with
db5.1_hotbackup -h ./db/ -b ./other-place
# Start replication master
db5.1_replicate -M -h db
# Then try to connect to it
db5.1_replicate -h ./other-place
The only thing I currently get from the replicate tool is:
db5.1_replicate(20648): DB_ENV->open: No such file or directory
edit after stracing the process I found out it was trying to access to __db.001, so I've copied those files manually. The current output is:
db5.1_replicate(22295): repmgr is already started
db5.1_replicate(22295): repmgr is already started
db5.1_replicate(22295): repmgr_start: Invalid argument
I suppose I'm missing the actual configuration value for the client to connect to the server, but so far no luck as all the settings yielded unrecognized name-value pair errors
Does anyone know how this setup might be completed? Maybe I'm not even headed in the right direction an this should be something completely different?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Where is output file created after running FileWriter in AWS EMR - amazon-web-services

Related

Waiting on another python process to continue

Increase Haddop_HEAPSIZE in amazon EMR to run job with a few million input files

read a parquet files from HDFS using PyArrow

Tensorboard: No graph definition files were found.

How to set up replication in BerkeleyDB

Categories

Resources