AWS Athena query int column, but response empty - amazon-athena

I'm trying to make database in AWS Athena.
In S3, I have csv file and contents are like below
sequence,AccelX,AccelY,AccelZ,GyroX,GyroY,GyroZ,MagX,MagY,MagZ,Time
13, -2012.00, -2041.00, 146.00, -134.00, -696.00, 28163.00,1298.00, -1054.00, -1497.00, 2
14, -1979.00, -2077.00, 251.00, 52.00, -749.00, 30178.00,1286.00, -1036.00, -1502.00, 2
...
and I made table
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test1(
sequence bigint,
AccelX float,
AccelY float,
AccelZ float,
GyroX float,
GyroY float,
GyroZ float,
MagX float,
MagY float,
MagZ float,
Time bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/210303/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
get data in db
SELECT * FROM mydb.test1 LIMIT 10
but I can get all data except last column
enter image description here
I think last column(Time) data is bigint, but select doesn't show what I want.
However changing Time column data type to string or to float, it shows data properly.
This problem looks simple, but I don't know why this happened.
Anyone knows this issue?

answer myself.
the problem is a space after comma.

Related

Issue with AWS Kinesis SQL - Random Cut Forest algorithm

I have this code in an AWS Kinesis application:
CREATE OR REPLACE STREAM "OUT_FILE" (
"fechaTS" timestamp,
"celda" varchar(25),
"Field1" DOUBLE,
"Field2" DOUBLE,
"ANOMALY_SCORE" DOUBLE,
"ANOMALY_EXPLANATION" varchar(1024)
);
CREATE OR REPLACE PUMP "PMP_OUT" AS
INSERT INTO "OUT_FILE"
SELECT STREAM
"fechaTS",
"celda",
"Field1",
"Field2",
"ANOMALY_SCORE",
"ANOMALY_EXPLANATION"
FROM TABLE(RANDOM_CUT_FOREST_WITH_EXPLANATION(
CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001"), 300, 512, 8064, 4, true))
WHERE "celda" = 'CELLNUMBER'
;
I just expect the usual output of anomaly scores calculations per each input record.
Instead, I get this error mesage:
Number of numeric attributes should be less than or equal to 30 (Please check the documentation to know the supported numeric SQL types)
The number of numerical attributes I am feeding into the model is just 2. On the other hand, supported SQL numeric types are these, according with the documentation: DOUBLE, INTEGER, FLOAT, TINYINT, SMALLINT, REAL, and BIGINT. (I have tried also with FLOAT).
What am I doing wrong?
The solution is to define the variables as DOUBLE (or other accepted type), at the level of input schema: to define them as DOUBLE in SQL is not enough.
I tried a JSON like this and worked:
{"ApplicationName": "<myAppName>",
"Inputs": [{
"InputSchema": {
"RecordColumns": [{"Mapping": "fechaTS", "Name": "fechaTS", "SqlType": "timestamp"},
{"Mapping": "celda","Name": "celda","SqlType": "varchar(25)"},
{"Mapping": "Field1","Name": "Field1","SqlType": "DOUBLE"},
{"Mapping": "Field2","Name": "Field2","SqlType": "DOUBLE"},
{"Mapping": "Field3","Name": "Field3","SqlType": "DOUBLE"}],
"RecordFormat": {"MappingParameters": {"JSONMappingParameters": {"RecordRowPath": "$"}},
"RecordFormatType": "JSON"}
},
"KinesisStreamsInput": {"ResourceARN": "<myInputARN>", "RoleARN": "<myRoleARN>"},
"NamePrefix": "<myNamePrefix>"
}]
}
Additional information: if you save this JSON in myJson.json, then issue this command:
aws kinesisanalytics create-application --cli-input-json file://myJson.json
AWS Command Line Interface (CLI) must be previously installed and configured.

Convert a number column into a time format in Power BI

I'm looking for a way to convert a decimal number into a valid HH:mm:ss format.
I'm importing data from an SQL database.
One of the columns in my database is labelled Actual Start Time.
The values in my database are stored in the following decimal format:
73758 // which translates to 07:27:58
114436 // which translates to 11:44:36
I cannot simply convert this Actual Start Time column into a Time format in my Power BI import as it returns errors for some values, saying it doesn't recognise 73758 as a valid 'time'. It needs to have a leading zero for cases such as 73758.
To combat this, I created a new Text column with the following code to append a leading zero:
Column = FORMAT([Actual Start Time], "000000")
This returns the following results:
073758
114436
-- which is perfect. Exactly what I needed.
I now want to convert these values into a Time.
Simply changing the data type field to Time doesn't do anything, returning:
Cannot convert value '073758' of type Text to type Date.
So I created another column with the following code:
Column 2 = FORMAT(TIME(LEFT([Column], 2), MID([Column], 3, 2), RIGHT([Column], 2)), "HH:mm:ss")
To pass the values 07, 37 and 58 into a TIME format.
This returns the following:
_______________________________________
| Actual Start Date | Column | Column 2 |
|_______________________________________|
| 73758 | 073758 | 07:37:58 |
| 114436 | 114436 | 11:44:36 |
Which is what I wanted but is there any other way of doing this? I want to ideally do it in one step without creating additional columns.
You could use a variable as suggested by Aldert or you can replace Column by the format function:
Time Format = FORMAT(
TIME(
LEFT(FORMAT([Actual Start Time],"000000"),2),
MID(FORMAT([Actual Start Time],"000000"),3,2),
RIGHT([Actual Start Time],2)),
"hh:mm:ss")
Edit:
If you want to do this in Power query, you can create a customer column with the following calculation:
Time.FromText(
if Text.Length([Actual Start Time])=5 then Text.PadStart( [Actual Start Time],6,"0")
else [Actual Start Time])
Once this column is created you can drop the old column, so that you only have one time column in the data. Hope this helps.
I, on purpose show you the concept of variables so you can use this in future with more complex queries.
TimeC =
var timeStr = FORMAT([Actual Start Time], "000000")
return FORMAT(TIME(LEFT([timeStr], 2), MID([timeStr], 3, 2), RIGHT([timeStr], 2)), "HH:mm:ss")

Snowflake - getting 'Error parsing JSON' while using the Copy command from S3 to snowflake

i'm trying to copy gz files from my S3 directory to Snowflake.
i created a table in snowflake (notice that the 'extra' field is defined as 'Variant')
CREATE TABLE accesslog
(
loghash VARCHAR(32) NOT NULL,
logdatetime TIMESTAMP,
ip VARCHAR(15),
country VARCHAR(2),
querystring VARCHAR(2000),
version VARCHAR(15),
partner INTEGER,
name VARCHAR(100),
countervalue DOUBLE PRECISION,
username VARCHAR(50),
gamesessionid VARCHAR(36),
gameid INTEGER,
ingameid INTEGER,
machineuid VARCHAR(36),
extra variant,
ingame_window_name VARCHAR(2000),
extension_id VARCHAR(50)
);
i used this copy command in snowflake:
copy INTO accesslog
FROM s3://XXX
pattern='.*cds_201911.*'
CREDENTIALS = (
aws_key_id='XXX',
aws_secret_key='XXX')
FILE_FORMAT=(
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TYPE = CSV
COMPRESSION = GZIP
FIELD_DELIMITER = '\t'
)
ON_ERROR = CONTINUE
I run it, and got this result (i got many error lines, this is an example to one)
snowflake result
snowflake result -more
a17589e44ae66ffb0a12360beab5ac12 2019-11-01 00:08:39 155.4.208.0 SE 0.136.0 3337 game_process_detected 0 OW_287d4ea0-4892-4814-b2a8-3a5703ae68f3 e9464ba4c9374275991f15e5ed7add13 765 19f030d4-f85f-4b85-9f12-6db9360d7fcc [{"Name":"file","Value":"wowvoiceproxy.exe"},{"Name":"folder","Value":"C:\\Program Files (x86)\\World of Warcraft\\_retail_\\Utils\\WowVoiceProxy.exe"}]
can you please tell me what cause this error?
thanks!
I'm guessing;
The 'Error parsing JSON' is certainly related to the extra variant field.
The JSON looks fine, but there are potential problems with the backslashes \.
If you look at the successfully loaded lines, have the backslashes been removed?
This can (maybe) happen if you have STAGE settings involving escape characters.
The \\Utils substring in the Windows path value can then trigger a Unicode decode error, eg.
Error parsing JSON: hex digit is expected in \U???????? escape sequence, pos 123
UPDATE:
It turns out you have to turn off escape char processing by adding the following to the FILE_FORMAT:
ESCAPE_UNENCLOSED_FIELD = NONE
The alternative is to doublequote fields or to doubly escape backslash, eg. C:\\\\Program Files.

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

AWS Glue ApplyMapping from double to string

I'm having a bit of a frustrating issues with a Glue Job.
I have a table which I have created from a crawler. It's gone through some CSV data and created a schema. Some elements of the schema need to be modified, e.g. numbers to strings and apply a header.
I seem to be running into some problems here - the schema for some fields appears to be have picked up as a double. When I try and convert this into a string which is what I require, it includes some empty precision e.g. 1234 --> 1234.0.
The mapping code I have is something like:
applymapping1 = ApplyMapping.apply(
frame = datasource0,
mappings = [
("col1","double","first_column_name","string"),
("col2","double","second_column_name","string")
],
transformation_ctx = "applymapping1"
)
And the resulting table I get after I've crawled the data is something like:
first_column_name second_column_name
1234.0 4321.0
5678.0 8765.0
as opposed to
first_column_name second_column_name
1234 4321
5678 8765
Is there a good way to work around this? I've tried changing the schema in the table that is initially created by the crawler to a bigint as opposed to a double, but when I update the mapping code to ("col1","bigint","first_column_name","string") the table just ends up being null.
Just a little correction from botchniaque answer, you actually have to do BOTH ResolveChoice and then ApplyMapping to ensure the correct type conversion.
ResolveChoice will make sure you just have one type in your column. If you do not make this step and the ambiguity is not resolved, the column will become a struct and Redshift will show this as null in the end.
So apply ResolveChoice to make sure all your data is one type (int, for ie)
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
Finally, use ApplyMapping to change type for what you want
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1")
Hope this helps (:
Maybe your data is really of type double (some values may have a fractions), and that's why changing type results in data being turned to null. Also it's no wonder that when you change type of a double field to string it gets serialized with a decimal component - it's still a double, just printed.
Have you tried explicitly casting the values to integer?
df2 = ResolveChoice.apply(datasource0, specs = [("col1", "cast:int"), ("col2", "cast:int")])
And then to case to string
df3 = ResolveChoice.apply(df2, specs = [("col1", "cast:string"), ("col2", "cast:string")])
or use ApplyMapping to change type and rename as you did above.
df3 = ApplyMapping.apply(
frame = df2,
mappings = [
("col1","int","first_column_name","string"),
("col2","int","second_column_name","string")
],
transformation_ctx = "applymapping1"
)