disk full issue in Hadoop yarn - amazon-web-services

I am using EMR & running a spark streaming job with yarn as resource manager and Hadoop 2.7.3-amzn-0, facing disk full issue so, I have implemented two properties yarn.nodemanager.localizer.cache.cleanup.interval-ms as 600000 and yarn.nodemanager.localizer.cache.target-size-mb as 1024 in yarn-site.xml then also my disk space is become full after some time and cross the limit which I configured. It seems my configured properties are not working and manually I have to clean the disk using this command: rm -rf filecache/ usercache/
every spark job is creating directory in filecache like:- /mnt/yarn/usercache/hadoop/filecache/5631/__spark_libs__8149715201399895593.zip having all .jar files.
What to do for automatically clean the filecache & usercache? On which file and location do i have to change?
Can anyone please help.

Related

Google Cloud Ops Agent

I am having an issue where Google Cloud Ops Agent logging gathers a lot of data and fills up my entire debian server hard drive in about 3 weeks due to the ever increasing size of the log file.
I do not want to increase the size of my server hard drive.
Does anyone know how to configure Google Cloud Ops Agent so that it only retains log data for the previous 7 days ?
EDIT: Google Cloud Ops Agent log file is stored in directory below
/var/log/google-cloud-ops-agent/subagents/logging-module.log
I faced the same issue recently while using agent 2.11.0. And it's not just an enormous log file, it's also a ridiculous CPU usage! Check it out in htop.
If you open the log file you'll see it spamming errors about buffer chunks. Apparently, they got broken smh, so the agent can't read them and send away. Thus, high IO and CPU usage.
The solution is to stop the service:
sudo service google-cloud-ops-agent stop
Then clear all buffer chunks:
sudo rm -rf /var/lib/google-cloud-ops-agent/fluent-bit/buffers/
And delete log file if you want:
sudo rm -f /var/log/google-cloud-ops-agent/subagents/logging-module.log
Then start the agent:
sudo service google-cloud-ops-agent start
This helped me out.
Btw this issue is described here and it seems that Google "fixed" it since 2.7.0-1. Whatever they mean by it since we still faced it...

how to run a one-off task on cloudfoundry to upload data before starting Python app

Hi this is my first experience trying to deploy a Python app on cloud using CF. I am having issues deploying my app; I sincerely appreciate if anyone can help me or point me to the right direction to solve the issue.
The main problem is the app that I am trying to deploy is large size due to a lot of python dependencies. The size of my app directory is 200 Kb. The first error I observed was: Staging fails due to "Failed to upload payload for droplet" . I think the reason is when all Python dependencies are downloaded from requirements.txt file and finally the droplet is created its size is too large for upload. The droplet size=982. 3 Mb.
The first solution I tried was vendoring app where I created a vendor directory containing all python dependencies but the size of vendor directory was greater that 1Gb, which causes the upload size exceed 1Gb limit and leads to failure in uploading app files.
The second solution I am working on is to upload all installed Python libraries on an object store (in my case S3 bucket which is bounded to my app) and then download the dependencies folder called Pypackages to the app's root directory: /home/vcap/app, so I want to have /home/vcap/app/Pypackages exist before my app starts on the cloud. But I couldn't do it successfully yet. I have included a python script in my app directory which downloads files from S3 bucket successfully. (I have put the correct absolute path for download in downloadS3.py script ie, /home/vcap/app/Pypackages) I want to run this script using "python downloadS3.py" as a one-off task. First I tried the solution here : Can I have multiple commands run in a manifest.yml file?
and although I can see the status of the task is SUCCEED via '$cf tasks my-app-name' , /home/vcap/app/Pypackages does not exist.
I also tried to run one-off task as the steps below:
1-
$ cf push -c 'python downloadS3.py && sleep infinity' -i 1 --no-route
2-
$ cf push -c 'null'
I have printed the contents of /home/vcap/app on my app, ie when app is started and I enter the url in my browser (I don't know what is the right way to see the contents of root directory). Anyway, the problem is Pypackages are not downloaded to the correct root directory. I am not sure if I am running the one-off task in a wrong way or if there is a better solution to make my app work.
I appreciate any helps! (edited)
Diego Cells stage apps and upload droplet to blobstore via cloud controller, the max file can be uploaded is configurable at Ops Manager > TAS for VMs > Application Developer Control > Maximum File Upload Size (MB), default is 1024MB. Seems this is causing restriction, if you can get it increased with your admin help...
Tasks run in their own containers so possibly not an option. I think Python buildpack collects and install the packages before creating the droplet, so don't think copying packages directly to /app directory will be of much help.
If you have data files then you can use .profile file and do some scripting to copy them from S3 or server/NFS location into the /app directory. Something like
wget http://s3.location.com/data_files
cp data_files /home/vcap/app/
But if all these are packages and increasing the size is not feasible then you may need to look to break the app..

Copy files to Container-Optimised OS from a GCP Storage bucket

How can one download files from a GCP Storage bucket to a Container-Optimised OS (COS) on instance startup?
I know of the following solutions:
gcloud compute copy-files
SSH through console
SCP
Yet all of these have to be done manually and externally after an instance is started.
There is also cloud init, yet I can't find any info on how to copy files from a Storage bucket. Examples seem to be suggesting that it's better to include content of files in the cloud init file directly, which is not something I want to do because security. Is it possible to download files from Storge bucket using cloud init?
I considered using a startup script, yet COS lacks CLI tools such as gcloud or gsutil to be able to run any such commands in a startup script.
I know I could copy the files manually and then save the image as a boot disk, but I'm hoping there are solutions that avoid having to do so.
Most of all, I'm assuming I'm not asking for something impossible, given that COS instance setup allows me to specify Docker volumes that I could mount onto the starting container. This seems to suggest I should be able to have some private files on the instance the moment COS will attempt to run my image on startup. But how?
Trying to execute a startup-script with a cloud-sdk image and copying files there as suggested by Guillaume didn't work for me for a while, showing this log. Eventually I realised that the cloud-sdk image is 2.41GB when uncompressed and takes over 2 minutes to complete pulling. I tried again with an empty COS instance and the startup script completed successfully, downloading the data from a Storage bucket.
However, a 2.41GB image and over 2 minutes of boot time sound like a bit of an overkill to download a 2KB file. Don't they?
I'm glad to see a working solution to my question (thanks Guillaume!) although I'm still wondering: isn't there a nicer way to do this? I feel that this method is even less tidy than manually putting the files on the COS instance and then creating a machine image to use in the future.
Based on Guillaume's answer I created and published a gsutil wrapper image, available as voyz/gsutil_wrap. This way I am able to run a startup-script with the following command:
docker run -v /host/path:/container/path \
--entrypoint gsutil voyz/gsutil_wrap \
cp gs://bucket/path /container/path
It's essentially a copy of what Guillaume suggested, except it is using an image containing only a minimum setup required to run gsutil. As a result it weighs 0.22GB and pulls within 10-20 seconds on average - as opposed to 2.41GB and over 2 minutes respectively for the google/cloud-sdk image suggested by Guillaume.
Also, credit to this incredibly useful StackOverflow answer that allows gsutil to use the default service account for authentication.
The startup-script is the correct location to do this. And YES, COS lacks some useful library.
BUT you can run container! And, for example, the Google Cloud SDK container!
So, add this startup-script in the VM metadata:
key -> startup-script
value ->
docker run -v /local/path/to/copy/files:/dummy/container/path \
--entrypoint gsutil google/cloud-sdk \
cp gs://your_bucket/path/to/file /dummy/container/path
Note: the startup script is ran in root mode. Perform a chmod/chown in your startup script if you need to change the file access mode.
Let me know if you need more explanation on this command line
Of course, with a fresh COS image, the startup time is quite long (pull the container image and extract it).
To reduce the startup time, you can "bake" your image. I mean, start with a COS, download/install what you want on it (or only perform a docker pull of the googkle/cloud-sdk container) and create a custom image from this.
Like this, all the required dependencies will be present on the image and the boot start will be quicker.

S3-Dist-Cp Failing on EMR5

I am facing issues with s3-dist-cp command in emr-5.0.0 version. In my application, I need to push some files from hdfs to S3. I am using s3-dist-cp command to achieve this. It was working fine in emr-4.2.0. But its not working in emr-5.0.0. If I run the command manually it works fine. But it fails in my application. I didn't make any change in my application to run it on emr-5.
Do I need to make any change if I need to use emr-5? Has there been any change in way we use s3-dist-cp command in emr-5?
I am using following command:
s3-dist-cp --src /user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
s3-dist-cp is only available on the master node(s3-dist-cp.jar).
The following is the location of the application.
/usr/share/aws/emr/s3-dist-cp/
The s3-dist-cp.jar is not available in the slave nodes.
You can login into slave machine and verify it.
So the reason your application failure might be, In new emr you might be using some workflow management tool which deploy the application on slaves and start from there. As s3 s3-dist-cp is not available and it fails.
Work Around
First Option
bundle the jar and use following commands
hadoop jar s3-dist-cp.jar --src location --dest location
Second
Boot Strap the s3-dist-cp.jars on the cluster
You can even run it as java program
First thing, s3n:// is now deprecated, start using s3:// for S3 paths.
Secondly, if you're merely copying a file into S3 from a local file on your cluster, you can use aws s3 cp:
aws s3 cp /user/hive/warehouse/abc.text s3://bucket/abc.text
The syntax that you have used for s3-dist-cp is incorrect. Please try again with the command below.
s3-dist-cp --src hdfs:///user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
Let me know if this solves your proble.

HDFS directory on a MAPR cluster

I need to save my Spark Streaming checkpoint files on a HDFS directory. I can access a remote cluster which has MAPR installed on it.
But, I am not sure on which path MAPR denoting to a HDFS directory
is it opt/mapr/..?
When you are connected to your MapR cluster you can run the following command:
hadoop fs -ls /
This will list, like inside any HDFS cluster the list of files/folders, so you see here anything special.
So if your Spark job is running on MapR cluster you just have to point to the folder your want, for example:
yourRdd.saveAsTextFile("/apps/output");
You can do exactly the same from your development environment, but you have to install and configure the MapR-Client
Note that you can also access MapR File System (FS) using NFS, that should run on your cluster, by default the mount point is /mapr
So you can see the content of your FS using:
cd /mapr/you-cluster-name/apps/output
/mapr/opt is the folder that contains the MapR installed product.
So if you look at it from a pure Spark point of view: nothing change just save/read data from a folder, if you are running in MapR this will be done in MapR-FS.