Issue with AWS Kinesis SQL - Random Cut Forest algorithm

Issue with AWS Kinesis SQL - Random Cut Forest algorithm - amazon-web-services

I have this code in an AWS Kinesis application:
CREATE OR REPLACE STREAM "OUT_FILE" (
"fechaTS" timestamp,
"celda" varchar(25),
"Field1" DOUBLE,
"Field2" DOUBLE,
"ANOMALY_SCORE" DOUBLE,
"ANOMALY_EXPLANATION" varchar(1024)
);
CREATE OR REPLACE PUMP "PMP_OUT" AS
INSERT INTO "OUT_FILE"
SELECT STREAM
"fechaTS",
"celda",
"Field1",
"Field2",
"ANOMALY_SCORE",
"ANOMALY_EXPLANATION"
FROM TABLE(RANDOM_CUT_FOREST_WITH_EXPLANATION(
CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001"), 300, 512, 8064, 4, true))
WHERE "celda" = 'CELLNUMBER'
;
I just expect the usual output of anomaly scores calculations per each input record.
Instead, I get this error mesage:
Number of numeric attributes should be less than or equal to 30 (Please check the documentation to know the supported numeric SQL types)
The number of numerical attributes I am feeding into the model is just 2. On the other hand, supported SQL numeric types are these, according with the documentation: DOUBLE, INTEGER, FLOAT, TINYINT, SMALLINT, REAL, and BIGINT. (I have tried also with FLOAT).
What am I doing wrong?

The solution is to define the variables as DOUBLE (or other accepted type), at the level of input schema: to define them as DOUBLE in SQL is not enough.
I tried a JSON like this and worked:
{"ApplicationName": "<myAppName>",
"Inputs": [{
"InputSchema": {
"RecordColumns": [{"Mapping": "fechaTS", "Name": "fechaTS", "SqlType": "timestamp"},
{"Mapping": "celda","Name": "celda","SqlType": "varchar(25)"},
{"Mapping": "Field1","Name": "Field1","SqlType": "DOUBLE"},
{"Mapping": "Field2","Name": "Field2","SqlType": "DOUBLE"},
{"Mapping": "Field3","Name": "Field3","SqlType": "DOUBLE"}],
"RecordFormat": {"MappingParameters": {"JSONMappingParameters": {"RecordRowPath": "$"}},
"RecordFormatType": "JSON"}
},
"KinesisStreamsInput": {"ResourceARN": "<myInputARN>", "RoleARN": "<myRoleARN>"},
"NamePrefix": "<myNamePrefix>"
}]
}
Additional information: if you save this JSON in myJson.json, then issue this command:
aws kinesisanalytics create-application --cli-input-json file://myJson.json
AWS Command Line Interface (CLI) must be previously installed and configured.

Related

AWS Athena query int column, but response empty

I'm trying to make database in AWS Athena.
In S3, I have csv file and contents are like below
sequence,AccelX,AccelY,AccelZ,GyroX,GyroY,GyroZ,MagX,MagY,MagZ,Time
13, -2012.00, -2041.00, 146.00, -134.00, -696.00, 28163.00,1298.00, -1054.00, -1497.00, 2
14, -1979.00, -2077.00, 251.00, 52.00, -749.00, 30178.00,1286.00, -1036.00, -1502.00, 2
...
and I made table
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test1(
sequence bigint,
AccelX float,
AccelY float,
AccelZ float,
GyroX float,
GyroY float,
GyroZ float,
MagX float,
MagY float,
MagZ float,
Time bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/210303/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
get data in db
SELECT * FROM mydb.test1 LIMIT 10
but I can get all data except last column
enter image description here
I think last column(Time) data is bigint, but select doesn't show what I want.
However changing Time column data type to string or to float, it shows data properly.
This problem looks simple, but I don't know why this happened.
Anyone knows this issue?

answer myself.
the problem is a space after comma.

Snowflake - getting 'Error parsing JSON' while using the Copy command from S3 to snowflake

i'm trying to copy gz files from my S3 directory to Snowflake.
i created a table in snowflake (notice that the 'extra' field is defined as 'Variant')
CREATE TABLE accesslog
(
loghash VARCHAR(32) NOT NULL,
logdatetime TIMESTAMP,
ip VARCHAR(15),
country VARCHAR(2),
querystring VARCHAR(2000),
version VARCHAR(15),
partner INTEGER,
name VARCHAR(100),
countervalue DOUBLE PRECISION,
username VARCHAR(50),
gamesessionid VARCHAR(36),
gameid INTEGER,
ingameid INTEGER,
machineuid VARCHAR(36),
extra variant,
ingame_window_name VARCHAR(2000),
extension_id VARCHAR(50)
);
i used this copy command in snowflake:
copy INTO accesslog
FROM s3://XXX
pattern='.*cds_201911.*'
CREDENTIALS = (
aws_key_id='XXX',
aws_secret_key='XXX')
FILE_FORMAT=(
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TYPE = CSV
COMPRESSION = GZIP
FIELD_DELIMITER = '\t'
)
ON_ERROR = CONTINUE
I run it, and got this result (i got many error lines, this is an example to one)
snowflake result
snowflake result -more
a17589e44ae66ffb0a12360beab5ac12 2019-11-01 00:08:39 155.4.208.0 SE 0.136.0 3337 game_process_detected 0 OW_287d4ea0-4892-4814-b2a8-3a5703ae68f3 e9464ba4c9374275991f15e5ed7add13 765 19f030d4-f85f-4b85-9f12-6db9360d7fcc [{"Name":"file","Value":"wowvoiceproxy.exe"},{"Name":"folder","Value":"C:\\Program Files (x86)\\World of Warcraft\\_retail_\\Utils\\WowVoiceProxy.exe"}]
can you please tell me what cause this error?
thanks!

I'm guessing;
The 'Error parsing JSON' is certainly related to the extra variant field.
The JSON looks fine, but there are potential problems with the backslashes \.
If you look at the successfully loaded lines, have the backslashes been removed?
This can (maybe) happen if you have STAGE settings involving escape characters.
The \\Utils substring in the Windows path value can then trigger a Unicode decode error, eg.
Error parsing JSON: hex digit is expected in \U???????? escape sequence, pos 123
UPDATE:
It turns out you have to turn off escape char processing by adding the following to the FILE_FORMAT:
ESCAPE_UNENCLOSED_FIELD = NONE
The alternative is to doublequote fields or to doubly escape backslash, eg. C:\\\\Program Files.

Google Datastore projection query on array of complex objects

I have a simple Datastore kind having the following properties:
id (long)
createdAt (timestamp)
userId (string)
metrics (array of complex objects)
type of metric
value of metric
Each stored row in the Datastore might have a different amount of metrics as well as different types of metrics.
I have a very specific requirement to query the latest metrics of a user. The problem here is that different rows have different metrics so I can't just take the most recent row, I need to look into metrics array to retrieve all the data.
I decided to use projection queries. My idea was to create a projection based on the following properties: metrics.type, metrics.value and use distinct on metrics.type and adding order by createdAt desc.
For a better explanation, a simple example of rows from the Datastore:
1. { "id": 111, "createdAt": "2019-01-01 00:00", "userId" : "user-123", [{ "type" : "metric1", "value" : 123 }, { "type" : "metric2", "value" : 345 }] }
2. { "id": 222, "createdAt": "2019-01-02 00:00", "userId" : "user-123", [{ "type" : "metric3", "value" : 567 }, { "type" : "metric4", "value" : 789 }] }
I expected a projection query with distinct on metrics.type filter to return the following results:
1. "metric1", 123
2. "metric2", 345
3. "metric3", 567
4. "metric4", 789
but actually what query returns is:
1. "metric1", 123
2. "metric2", 123
3. "metric3", 123
4. "metric4", 123
So all metrics have the same value (which is incorrect). Basically it happens because of an exploded index - Datastore thinks I have 2 arrays but indeed it's a single array
Is there any way to make projection query to return what I expect instead of exploding the index? If not, how can I rebuild what I have so it meets my requirements?

The Cloud Datastore documentation specifically warns your exact issue.
https://cloud.google.com/datastore/docs/concepts/queries#projections_and_array-valued_properties
One option to solve this is to combine both the type and value. So, have a property called "metric" that will have values like "metric1:123", "metric2:345". Then you will be projecting a single array-valued property.

AWS Glue ApplyMapping from double to string

I'm having a bit of a frustrating issues with a Glue Job.
I have a table which I have created from a crawler. It's gone through some CSV data and created a schema. Some elements of the schema need to be modified, e.g. numbers to strings and apply a header.
I seem to be running into some problems here - the schema for some fields appears to be have picked up as a double. When I try and convert this into a string which is what I require, it includes some empty precision e.g. 1234 --> 1234.0.
The mapping code I have is something like:
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [
("col1","double","first_column_name","string"),
("col2","double","second_column_name","string")
],
transformation_ctx = "applymapping1"
)
And the resulting table I get after I've crawled the data is something like:
first_column_name second_column_name
1234.0 4321.0
5678.0 8765.0
as opposed to
first_column_name second_column_name
1234 4321
5678 8765
Is there a good way to work around this? I've tried changing the schema in the table that is initially created by the crawler to a bigint as opposed to a double, but when I update the mapping code to ("col1","bigint","first_column_name","string") the table just ends up being null.

Just a little correction from botchniaque answer, you actually have to do BOTH ResolveChoice and then ApplyMapping to ensure the correct type conversion.
ResolveChoice will make sure you just have one type in your column. If you do not make this step and the ambiguity is not resolved, the column will become a struct and Redshift will show this as null in the end.
So apply ResolveChoice to make sure all your data is one type (int, for ie)
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
Finally, use ApplyMapping to change type for what you want
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1")
Hope this helps (:

Maybe your data is really of type double (some values may have a fractions), and that's why changing type results in data being turned to null. Also it's no wonder that when you change type of a double field to string it gets serialized with a decimal component - it's still a double, just printed.
Have you tried explicitly casting the values to integer?
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
And then to case to string
df3 = ResolveChoice.apply(df2, specs = [("col1", "cast:string"), ("col2", "cast:string")])
or use ApplyMapping to change type and rename as you did above.
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1"
)

What are the data formats that Cloud ML batch prediction supports today?

I have the inference graph constructed in my trained model and would like to use batch prediction to predict many records. How can I specify the inputs in the input file(s)?

Cloud ML supports three data formats so far. One is text file, each line of which is a record you want to predict over. The second and third formats are TFRecords; both compressed and gzip-compressed are supported. A TFRecord file is a container to store bytes,typically binary data, e.g. serialized Example proto. These bytes get fed directly into the prediction graph. You must specify them in the data_format field (TEXT, TF_RECORD, TF_RECORD_GZIP) in the request.
For text format, each line is either a JSON object or a UTF8 string. In case of the former, the keys are input tensor names and the values are the data that will be fed into the inference graph. If your graph has only one input tensor, you can skip the JSON and just save newline delimited strings.
Here are some examples:
You have four input tensors, namely index, height, name, and image
{“index”: 100, “height”: 5.5, “name”: “Alice”, “image”: [0.0, 0.0, 0.123, 0.17,0,0]}
{“index”: 101, “height”: 5.8, “name”: “John”, “image”: [0.0, 0.21, 0.09, 0.5, 0,0]}
...
You have one string input tensor. No need to specify the name.
“This is a string input”
“That is another string input”
...
You have one tensor with scalar type. No need to specify the name.
1445
425
3412
...
You have one input tensor, which is a numpy array. No need to specify the name.
[0, 3.14, 2.718, 0.0, 1.414]
[1.618, 299.7, 8.314, 0.0, 0.0]
...
Note that the names in the mutliple-tensor inputs case must match the aliases defined in the inputs collections in the inference graph.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js