I am having an error when trying to automate AWS DataSource creation from S3:
I am running a shell script:
#!/bin/bash
for k in 1 2 3 4 5
do
aws machinelearning create-data-source-from-s3 --cli-input-json file://data/cfg/dsrc_training_00$k.json
aws machinelearning create-data-source-from-s3 --cli-input-json file://data/cfg/dsrc_validate_00$k.json
done
and here is an example of the json file it references:
{
"DataSourceId": "Iris_training_00{k}",
"DataSourceName": "[DS Iris] training 00{k}",
"DataSpec": {
"DataLocationS3": "s3://ml-test-predicto-bucket/shuffled_{k}.csv",
"DataSchemaLocationS3": "s3://ml-test-predicto-bucket/dsrc_iris.csv.schema",
"DataRearrangement": {"splitting":{"percentBegin" : 0, "percentEnd" : 70}}
},
"ComputeStatistics": true
}
But when I run my script from the command line I get the error:
Parameter validation failed:
Invalid type for parameter DataSpec.DataRearrangement, value: {u'splitting': {u'percentEnd': u'100', u'percentBegin': u'70'}}, type: <type 'dict'>, valid types: <type 'basestring'>
Can someone please help, I have looked at the API AWS ML documentation and I think I am doing everything right, but I can't seem to solve this error... many thanks !
The DataRearrangement element expects a JSON String object. You are passing a dictionary object.
Change:
"DataRearrangement": {"splitting":{"percentBegin" : 0, "percentEnd" : 70}}
[to]
"DataRearrangement": "{\"splitting\":{\"percentBegin\":0,\"percentEnd\":70}}"
Related
Using jmespath and given the below json, how would I filter so only JobNames starting with "analytics" are returned?
For more context, the json was returned by the aws cli command aws glue list-jobs
{
"JobNames": [
"analytics-job1",
"analytics-job2",
"team2-job"
]
}
Tried this
JobNames[?starts_with(JobNames, `analytics`)]
but it failed with
In function starts_with(), invalid type for value: None, expected one
of: ['string'], received: "null"
Above I extracted the jmespath bit, but here is the entire aws cli command I tried and failed is this
aws glue list-jobs --query '{"as_string": to_string(JobNames[?starts_with(JobNames, `analytics`)])}'
I couldn't test it on list-jobs but the query part works on list-crawlers. Just replaced the JobNames with CrawlerNames.
aws glue list-jobs --query 'JobNames[?starts_with(#, `analytics`) == `true`]'
I am trying to load S3 data into redshift using COPY command using following jsonPaths
{
_meta-id : 1,
payload: {..}
}
In my redshift table, I want to store entire JSON doc as my second column
{
"jsonpaths": [
"$['_meta-id']",
"$"
]
}
This gives error
Invalid JSONPath format. Supported notations are 'dot-notation' and 'bracket-notation': $
Query:
copy table_name
from 's3://abc/2018/12/15/1'
json 's3://xyz/jsonPaths';
[Amazon](500310) Invalid operation: Invalid JSONPath format. Supported notations are 'dot-notation' and 'bracket-notation': $..
Details:
-----------------------------------------------
error: Invalid JSONPath format. Supported notations are 'dot-notation' and 'bracket-notation': $
code: 8001
context:
query: 21889
location: s3_utility.cpp:672
process: padbmaster [pid=11925]
-----------------------------------------------;
1 statement failed.
Can someone help?
I am trying to read data from elasticsearch from pyspark. I was using the elasticsearch-hadoop api in Spark. The es cluster sits on aws emr, which requires credential to sign in. My script is as below:
from pyspark import SparkContext, SparkConf sc.stop()
conf = SparkConf().setAppName("ESTest") sc = SparkContext(conf=conf)
es_read_conf = { "es.host" : "vhost", "es.nodes" : "node", "es.port" : "443",
"es.query": '{ "query": { "match_all": {} } }',
"es.input.json": "true", "es.net.https.auth.user": "aws_access_key",
"es.net.https.auth.pass": "aws_secret_key", "es.net.ssl": "true",
"es.resource" : "index/type", "es.nodes.wan.only": "true"
}
es_rdd = sc.newAPIHadoopRDD( inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=es_read_conf)
Pyspark keeps throwing error:
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [HEAD] on
[index] failed; servernode:443] returned [403|Forbidden:]
I checked everything which all made sense except for the user and pass entries, would aws access key and secret key work here? We don't want to use the console user and password here for security purpose. Is there a different way to do the same thing?
I created a bucket in Amazon S3, uploaded my FreePBX.ova, and created permissions, etc. When I run this command:
aws ec2 import-image --cli-input-json "{\"Description\":\"freepbx\", \"DiskContainers\":[{\"Description\":\"freepbx\",\"UserBucket\":{\"S3Bucket\":\"itbucket\",\"S3Key\":\"FreePBX.ova\"}}]}"
I get:
Error parsing parameter 'cli-input-json': Invalid JSON: Extra data: line 1 column 135 - line 1 column 136 (char 134 - 135)
JSON received: {"Description":"freepbx", "DiskContainers":[{"Description":"freepbx","UserBucket":{"S3Bucket":"itbucket","S3Key":"FreePBX.ova"}}]}?
And I can't continue the process. I tried to Google it with no results.
What is wrong with this command? How can I solve it?
Not sure what I'm getting wrong with my json format. Just trying to test out aws cli and run aws s3api list-objects --cli-input-json <json_file>.json --profile <profile_name> where <my_json> is below but getting:
Error parsing parameter 'cli-input-json': Invalid JSON: Expecting value: line 1 column 1 (char 0)
JSON received: <my_json.json>
{"Bucket": "<bucket_name>","Delimiter": "","EncodingType": "","Marker": "","MaxKeys": 0,"Prefix": "<prefix_name>"}
Instead of :
my_json.json
You have to use file:// before the json name :
file://my_json.json
In my case, I think it was needing ASCII encoding (unicode being the default)... added -Encoding ASCII -NoNewline to my Out-File and it worked.