Column names containing dots in Spectrum - amazon-web-services

I created a customers table with columns has account_id.cust_id, account_id.ord_id and so on.
My create external table query was as follows:
CREATE EXTERNAL TABLE spectrum.customers
(
"account_id.cust_id" numeric,
"account_id.ord_id" numeric
)
row format delimited
fields terminated by '^'
stored as textfile
location 's3://awsbucketname/test/';
SELECT "account_id.cust_id" FROM spectrum.customers limit 100
and I get an error as :
Invalid Operation: column account_id.cust_id does not exists in
customers.
Is there any way or syntax to write column names like account_id.cust_id (text.text) while creating the table or while writing the select query?
Please help.
PS: Single quotes, back ticks don't work either.

Related

Column does not exist AWS Timestream Query error

I am trying to apply WHERE clause on DIMENSION of the AWS Timestream records. However, I got the error: Column does not exist
Here is my table schema:
The table schema
The table measure
First, I will show all the sample data I put in the table
SELECT username, time, manual_usage
FROM "meter-reading"."meter-metrics"
ORDER BY time DESC
LIMIT 4
The result:
Result
What I wanted to do is to query and filter the records by the Dimension ("username" specifically).
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE measure_name = "OnceADay"
ORDER BY time DESC LIMIT 10
Then I got the Error: Column 'OnceADay' does not exist
I tried to search for any quotas for Dimensions name and check for error in my schema:
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.naming
https://docs.aws.amazon.com/timestream/latest/developerguide/ts-limits.html#limits.system_identifier
But I didn't find that my "username" for the dimension violate any of the above rules.
I checked for some other queries by AWS Blog, the author used the WHERE clause for the Dimension filter normally:
https://aws.amazon.com/blogs/database/effective-queries-for-common-query-patterns-in-amazon-timestream/
I figured it out after I tried with the sample code. Turn out it was a silly mistake I believe.
Using apostrophe (') instead of single quotation marks ("") solved my problem.
SELECT *
FROM "meter-reading"."meter-metrics"
WHERE username = 'OnceADay'
ORDER BY time DESC LIMIT 10

How to RENAME struct/array nested columns using ALTER TABLE in BigQuery?

Suppose we have the following table in BigQuery:
CREATE TABLE sample_dataset.sample_table (
id INT
,struct_geo STRUCT
<
country STRING
,state STRING
,city STRING
>
,array_info ARRAY
<
STRUCT<
key STRING
,value STRING
>
>
);
I want to rename the columns inside the STRUCT and the ARRAY using an ALTER TABLE command. It's possible to follow the Google documentation available here for normal columns ("non-nested" columns) i:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN id TO str_id
But when I try to run the same command for nested columns I got errors from BigQuery.
Running the command for a column inside a STRUCT gives me the following message:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN `struct_geo.country` TO `struct_geo.str_country`
Error: ALTER TABLE RENAME COLUMN not found: struct_geo.country.
The exact same message appears when I run the same statement, but targeting a column inside an ARRAY:
ALTER TABLE sample_dataset.sample_table
RENAME COLUMN `array_info.str_key` TO `array_info.str_key`
Error: ALTER TABLE RENAME COLUMN not found: array_info.str_key
I got stuck since the BigQuery documentation about nested columns (available here) lacks examples of ALTER TABLE statements and refers directly to the default documentation for non-nested columns.
I understand that I can rename the columns by simply creating a new table using a CREATE TABLE new_table AS SELECT ... and then passing the new column names as aliases, but this would run a query over the whole table, which I'd rather avoid since my original table weighs way over 10TB...
Thanks in advance for any tips or solutions!

Regex for Parsing vertical CSV in Athena

So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR
One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'
Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.

Does AWS Athena supports Sequence File

Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';

Redshift: create external table returns 0 rows

I have a text file test.txt located at s3://myBucket/, see sample below, using which I want to create an external table in Redshift.
When I select from the table, it returns 0 rows.
1,One
2,Two
3,Three
create external table spectrum_schema.test(
Id integer,
Name varchar(255))
row format delimited
fields terminated by ','
stored as textfile
location 's3://myBucket/';
select * from spectrum_schema.test //returns 0 rows
Any suggestions how I can fix this?
I fixed this by moving the file to s3://myBucket/test