error running job using xavier initialization on google cloud machine learning - google-cloud-ml

I'm using the xavier conv2d method for initializing my variables like this:
initializer = tf.contrib.layers.xavier_initializer_conv2d()
variable = tf.get_variable(name=name,shape=shape,initializer=initializer)
the training runs locally using gcloud ml-engine local train, however it crashes when sending it as a job to the cloud.
The crash log: "Module raised an exception <type 'exceptions.SystemExit'>:-15."
If I replace the xavier initializer by a random uniform initializer, the training works both on my local machine and on the cloud:
initializer = tf.random_uniform_initializer(-0.25,0.25)
I'm running gpu_enabled tensorflow version 1.01 on my local machine using python 2.7.13

Related

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

AWS Device Farm - Schedule Run - Errors

I am hoping someone here has come across this issue and has an answer for me.
I have setup a project in device farm and have written automation tests in Appium using JS.
When I create a run manually using the console the runs succeed without any issues and my tests get executed.
However when I try and schedule a run using the CLI using the following command it fails with an error
aws devicefarm schedule-run --project-arn projectArn --app-arn appArn --device-pool-arn dpARN --name myTestRun --test type=APPIUM_NODE,testPackageArn="testPkgArn"
Error : An error occurred (ArgumentException) when calling the ScheduleRun operation: Standard Test environment is not supported for testType: APPIUM_NODE
Cli Versions : aws-cli/1.17.0 Python/3.8.1 Darwin/19.2.0 botocore/1.14.0
That is expected currently for the standard environment. The command will need to use the custom environment which the cli can do by setting the testSpecArn value.
This arn is an upload in device farm consisting of a .yaml file which defines how the tests are executed.
This process is discussed here
https://docs.aws.amazon.com/devicefarm/latest/developerguide/how-to-create-test-run.html#how-to-create-test-run-cli-step6
The error in this case is caused by the fact that the APPIUM_NODE test type can only be used with the custom environment currently.

AWS Glue Python shell jobs: No space left on device

I'm running a Glue job with the following configurations
Type: Python Shell
Python version: Python 2 (1.0) [I'd created these jobs without a version and as per documentation, they should be defaulting to 0.9
The job fails to initialize with the following error:
No space left on device: 'build/bdist.linux-x86_64/egg/pycparser
Has anyone else encountered the same error or have any potential resolutions?

When trying to deploy to aws-lambda Error: operation not permitted occurs

I am trying to deploy a function to aws that takes screenshot of the given url and tweets it.I am using puppeteer-core, #serverless-chrome/lambda and serverless-plugin-chrome to take screenshot following these articles(but instead of uploading to aws I tweet the image): https://swizec.com/blog/serverless-chrome-on-aws-lambda-the-guide-works-in-2019-beyond/swizec/9024 and https://nadeesha.github.io/headless-chrome-puppeteer-lambda-servelerless/.
It works fine on invoking locally and does everything but when I try to deploy it show an error that 'operation not permitted'. Below is the console log when I try to deploy.
Serverless: Injecting Headless Chrome...
Error --------------------------------------------------
EPERM: operation not permitted, symlink 'C:\Users\xx\yy\zz\node_modules' -> 'C:\Users\xx\yy\zz\.build\node_modules'
For debugging logs, run again after setting the "SLS_DEBUG=*" environment variable.
Get Support --------------------------------------------
Docs: docs.serverless.com
Bugs: github.com/serverless/serverless/issues
Issues: forum.serverless.com
Your Environment Information ---------------------------
OS: win32
Node Version: 8.10.0
Serverless Version: 1.45.1
I initially tried using just puppeteer but the package size was too big so I decided to go with this serverless-chrome approach and here is a relevant link but I haven't been able to solve it https://github.com/adieuadieu/serverless-chrome/issues/155
Try removing your .build folder, before deploying.

AWS: Exception:It appears that you are attempting to reference SparkContext from a broadcast variable. SparkContext can only be used on the driver

While trying to run my program on AWS Amazon Cluster .
[hadoop#ip-172-31-5-232 ~]$ spark-submit 6.py .
I got the following error:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Here is the sample of my code where the error appears:
result= l.map(lambda x : (x[0], list(x[1]))).collect()
if (NbrVertex > (2*(len(filteredResults.collect())+ ExtSimilarity))):
Successor= filteredResults3.map(lambda j:matchedSuccessor(j,result))
print(Successor.collect())
you can see the image below]1
collect causes data to go to the Driver.
Successor ... references the Driver therefore from the Worker, via the .map. Not allowed.
The message confirms that, the Spark paradigm.