How to use column names with spaces with AWS Athena Iceberg tables? - amazon-web-services

I have created an iceberg table like this:
CREATE TABLE IF NOT EXISTS my_table(`Date` date, `Table Name` string, `My Count` bigint)
LOCATION 's3://somepath/my_table'
TBLPROPERTIES ('table_type'= 'ICEBERG')
which worked fine. Then I attempted a couple of INSERT in this way:
insert into my_table ("Date", "Table Name", "My Count") values (DATE'2022-02-09', 'address', 1000)
insert into my_table values (DATE'2022-02-09', 'address', 1000)
and both failed with the same error:
ICEBERG_WRITER_OPEN_ERROR: Error creating Parquet file. If a data
manifest file was generated at
's3://my_path/output/12875dbd-2e57-409f-9788-c223a0848d58-manifest.csv',
you may need to manually clean the data from locations specified in
the manifest. Athena will not delete data in your account. This query
ran against the "my_db" database, unless qualified by the
query. Please post the error message on our forum or contact customer
support with Query Id: 12875dbd-2e57-409f-9788-c223a0848d58
Also if I look to the metadata.json file associated to the table I can see:
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "date",
"required" : false,
"type" : "date"
}, {
"id" : 2,
"name" : "table name",
"required" : false,
"type" : "string"
}, {
"id" : 3,
"name" : "my count",
"required" : false,
"type" : "long"
} ]
} ],
Which seems to not fully supporting the capitalisation. Any idea?

Related

GCP - BigTable to BigQuery

I am trying to query Bigtable data in BigQuery using the external table configuration. I have the following SQL command that I am working with. However, I get an error stating invalid bigtable_options for format CLOUD_BIGTABLE.
The code works when I remove the columns field. For context, the raw data looks like this (running query without column field):
rowkey
aAA.column.name
aAA.column.cell.value
4271
xxx
30
yyy
25
But I would like the table to look like this:
rowkey
xxx
4271
30
CREATE EXTERNAL TABLE dev_test.telem_test
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/telem/instances/dbb-bigtable/tables/db1'],
bigtable_options =
"""
{
bigtableColumnFamilies: [
{
"familyId": "aAA",
"type": "string",
"encoding": "string",
"columns": [
{
"qualifierEncoded": string,
"qualifierString": string,
"fieldName": "xxx",
"type": string,
"encoding": string,
"onlyReadLatest": false
}
]
}
],
readRowkeyAsString: true
}
"""
);
I think you let the default value for each column attribute. the string is the type of the value to provide, but not the raw value to provide. It makes no sense in JSON here. Try to add double quote like that
CREATE EXTERNAL TABLE dev_test.telem_test
OPTIONS (
format = 'CLOUD_BIGTABLE',
uris = ['https://googleapis.com/bigtable/projects/telem/instances/dbb-bigtable/tables/db1'],
bigtable_options =
"""
{
bigtableColumnFamilies: [
{
"familyId": "aAA",
"type": "string",
"encoding": "string",
"columns": [
{
"qualifierEncoded": "string",
"qualifierString": "string",
"fieldName": "xxx",
"type": "string",
"encoding": "string",
"onlyReadLatest": false
}
]
}
],
readRowkeyAsString: true
}
"""
);
The false is correct because the type is a boolean. More details here. The encoding "string" will be erroneous (use a real encoding type).
The error here is in this part:
bigtableColumnFamilies: [
It should be:
"columnFamilies": [
Concerning adding columns for string you will only add:
"columns": [{
"qualifierString": "name_of_column_from_bt",
"fieldName": "if_i_want_rename",
}],
fieldName is not required.
However to access your field value you will still have to use such SQL code:
SELECT
aAA.xxx.cell.value as xxx
FROM dev_test.telem_test

Django differ JSONField values between lists and objects

I am using django 3.2 with Postgres as DB.
I have a model with JSONField:
class MyModel(models.Model):
data = models.JSONField(default=dict, blank=True)
In database there are a lot of records in this table and some data have JSON values as object and others as lists:
{
"0:00:00 A": "text",
"0:01:00 B": "text",
"0:02:00 C": "text",
}
[
{"time": "0:00:00", "type": "A", "description": "text"},
{"time": "0:01:00", "type": "B", "description": "text"},
{"time": "0:02:00", "type": "C", "description": "text"},
]
I need to filter all records which has JSON values as objects.
What I tried is to use has_key with time frame "0:00:00" :
result = MyModel.objects.filter(data__has_key="0:00:00 A")
But I really cant use it because I am not sure what the key with time frame look like completely.
Any ideas how to filter JSONField values by object struct?

AWS Datapipeline : moving data from RDS database(postgres) to Redshift using pipeline

Basically i am trying to transfer data from postgres to redshift using aws datapipeline and the process i am following
Write a pipeline(CopyActivity) that moves data from postgres to s3
Write a pipeline(RedShiftCopyActivity) that moves data from s3 to redshift
So in my case both are working perfectly with the pipelines i wrote, but the problem is the data was duplicating in the Redshift database
For example below is the data from postgres database in a table called company
After the successful run of s3 to redshift(RedShiftCopyActivity) pipeline the data was copied but it was duplicated as below
below is the some of the definition part from RedShiftCopyActivity(S3 to Redshift) pipeline
pipeline_definition = [{
"id":"redshift_database_instance_output",
"name":"redshift_database_instance_output",
"fields":[
{
"key" : "database",
"refValue" : "RedshiftDatabaseId_S34X5",
},
{
"key" : "primaryKeys",
"stringValue" : "id",
},
{
"key" : "type",
"stringValue" : "RedshiftDataNode",
},
{
"key" : "tableName",
"stringValue" : "company",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
{
"key" : "schemaName",
"stringValue" : RedShiftSchemaName,
},
]
},
{
"id":"CopyS3ToRedshift",
"name":"CopyS3ToRedshift",
"fields":[
{
"key" : "output",
"refValue" : "redshift_database_instance_output",
},
{
"key" : "input",
"refValue" : "s3_input_data",
},
{
"key" : "runsOn",
"refValue" : "ResourceId_z9RNH",
},
{
"key" : "type",
"stringValue" : "RedshiftCopyActivity",
},
{
"key" : "insertMode",
"stringValue" : "KEEP_EXISTING",
},
{
"key" : "schedule",
"refValue" : "DefaultScheduleTime",
},
]
},]
So according to the docs of RedShitCopyActivity we need to use insertMode to describe how the data should behave(inserted/updated/deleted) when copying to database table as below
insertMode : Determines what AWS Data Pipeline does with pre-existing data in the target table that overlaps with rows in the data to be loaded. Valid values are KEEP_EXISTING, OVERWRITE_EXISTING, TRUNCATE and APPEND. KEEP_EXISTING adds new rows to the table, while leaving any existing rows unmodified. KEEP_EXISTING and OVERWRITE_EXISTING use the primary key, sort, and distribution keys to identify which incoming rows to match with existing rows, according to the information provided in Updating and inserting new data in the Amazon Redshift Database Developer Guide. TRUNCATE deletes all the data in the destination table before writing the new data. APPEND will add all records to the end of the Redshift table. APPEND does not require a primary, distribution key, or sort key so items that may be potential duplicates may be appended.
So what my requirements are
When copying from postgres (infact data is in s3 now) to Redshift database if it found already existing rows then just update it
If it founds new records from s3 then create new records in Redshift
But for me even though i have used KEEP_EXISTING or OVERWRITE_EXISTING, the data was just repeating over and over again as shown in the above redshift database picture
So finally how to achieve my requirements ? are there still any tweaks or settings to add to my configuration ?
Edit
Table(company) definition from redshift
If you want to avoid duplication , you must define Primary key in redshift and also set myInsertMode as "OVERWRITE_EXISTING" .

Analytics in WSO2DAS

I'm getting a Table Not Found error while running a select query on spark console of wso2das. I've kept all the default configurations intact after the installation. I'm unable to fetch the data from the event stream even when it's been shown under table dropdown of data explorer.
Initially when the data is moved into the wso2das, it would be persisted in the data store you mention.
But, these are not the tables that are created in spark. You need to write a spark query to create a temporary table in spark which would reference the table you have persisted.
For example,
If your stream is,
{
"name": "sample",
"version": "1.0.0",
"nickName": "",
"description": "",
"payloadData": [
{
"name": "ID",
"type": "INT"
},
{
"name": "NAME",
"type": "STRING"
}
]
}
you need to write the following spark query in the spark console,
CREATE TEMPORARY TABLE sample_temp USING CarbonAnalytics OPTIONS (tableName "sample", schema "ID INT, NAME STRING");
after executing the above script,try the following,
select * from sample_temp;
This should fetch the data you have pushed into WSO2DAS.
Happy learning!! :)

Elasticsearch Filtering on fields within a list

I am having trouble with filtering on fields within a list in Elastic Search. I am indexing simple JSON objects for search and filtering.
An example object that is being indexed would be:
{
"id" : 1,
"name" : "My Inventory",
"description" : "This is a piece of inventory.",
"sizes" : [ "big", "small" ],
"geos" : [ { "country" : "US", "fullName" : "United States" } ]
}
I am able to filter by id, name, description, and size pretty easily, but when trying to filter on geo, I am hitting a brick wall. Below is the filter I am trying to use. I would be grateful for any kind of pointer to get me going in the right direction. Thanks!
curl -XPOST 'localhost:9200/stuff/inventory/_search?pretty=true' -d '
{
"fields" : [ "name" ],
"filter" : {
"terms" : { "geos.country" : [ "US" ] }
}
}
'
By default, string fields are analyzed ("broken down into searchable terms" is how the Elasticsearch docs describe it). This doesn't work with a term filter, since a term filter does exact matching but the field will contain a possibly altered version.
You can tell Elasticsearch not to do that by setting "index" : "not_analyzed" in your mapping. Your mapping for the geos field should look like
"geos" : {
"type" : "object",
"properties" : {
"country" : { "type" : "string", "index" : "not_analyzed" },
"fullName" : { "type" : "string" }
}
}
You may want to set fullName to not_analyzed also, depending on how you're going to use it. If you want to use it with term queries or filters, it should not be analyzed; if you want to do fuzzy matching, spelling correction, etc, it should be analyzed.