I'm trying to deploy new data publisher car. I looked at tthe APIM_LAST_ACCESS_TIME_SCRIPT.xml spark script (used by api manager) and didn't understand the difference between the two temporaries tables created: API_LAST_ACCESS_TIME_SUMMARY_FINAL and APILastAccessSummaryData
The two Spark temporary tables represent different JDBC tables (possibly in different datasources), where one of them acts as the source for Spark and the other acts as the destination.
To illustrate this better, have a look at the simplified script in question:
create temporary table APILastAccessSummaryData using CarbonJDBC options (dataSource "WSO2AM_STATS_DB", tableName "API_LAST_ACCESS_TIME_SUMMARY", ... );
CREATE TEMPORARY TABLE API_LAST_ACCESS_TIME_SUMMARY_FINAL USING CarbonAnalytics OPTIONS (tableName "API_LAST_ACCESS_TIME_SUMMARY", ... );
INSERT INTO TABLE APILastAccessSummaryData select ... from API_LAST_ACCESS_TIME_SUMMARY_FINAL;
As you can see, we're first creating a temporary table in Spark with the name APILastAccessSummaryData, which represents an actual relational DB table with the name API_LAST_ACCESS_TIME_SUMMARY in the WSO2AM_STATS_DB datasource. Note the using CarbonJDBC keyword, which can be used to directly map JDBC tables within Spark. Such tables (and their rows) are not encoded, and can be read by the user.
Second, we're creating another Spark temporary table with the name API_LAST_ACCESS_TIME_SUMMARY_FINAL. Here however, we're using the CarbonAnalytics analytics provider, which will mean that this table will not be a vanilla JDBC table, but an encoded table similar to the one from your previous question.
Now, from the third statement, you can see that we're reading (SELECT) a number of fields from the second table API_LAST_ACCESS_TIME_SUMMARY_FINAL and inserting them (INSERT INTO) into the first, which is APILastAccessSummaryData. This represents the Spark summarisation process.
For more details on the differences between the CarbonAnalytics and CarbonJDBC analytics providers or on how Spark handles such tables in general, have a look at the documentation page for Spark Query Language.
Related
I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.
I am building an ETL pipeline using primarily state machines, Athena, S3, and the Glue catalog. In general things work in the following way:
A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process.
A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time.
My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table?
I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture:
Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment.
Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access)
Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows.
everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.
I think this is a recurrent question in the Internet, but unfortunately I'm still unable to find a successful answer.
I'm using Ruby on Rails 4 and I would like to create a model that interfaces with a SQL query, not with an actual table in the database. For example, let's suppose I have two tables in my database: Questions and Answers. I want to make a report that contains statistics of both tables. For such purpose, I have a complex SQL statement that takes data from these tables to build up the statistics. However the SELECT used in the SQL statement does not directly take values from neither Answers nor Questions tables, but from nested SELECTs.
So far I've been able to create the StatItem model, without any migration, but when I try StatItem.find_by_sql("...nested selects...") the system complains about unexisting table stat_items in the database.
How can I create a model whose instance's data is retrieved from a complex query and not from a table? If it's not possible, I could create a temporary table to store the data in there. In such case, how can I tell the migration file to not create such table (it would be created by the query)?
How about creating a materialized view from your complex query and following this tutorial:
ActiveRecord + PostgreSQL Materialized Views
Michael Kohl and his proposal of materialized views has given me an idea, which I initially discarded because I wrongly thought that a single database connection could be shared by two processes, but after reading about how Rails processes requests, I think my solution is fine.
STEP 1 - Create the model without migration
rails g model StatItem --migration=false
STEP 2 - Create a temporary table called stat_items
#First, drop any existing table created by older requests (database connections are kept open by the server process(es).
ActiveRecord::Base.connection.execute('DROP TABLE IF EXISTS stat_items')
#Second, create the temporary table with the desired columns (notice: a dummy column called 'id:integer' should exist in the table)
ActiveRecord::Base.connection.execute('CREATE TEMP TABLE stat_items (id integer, ...)')
STEP 3 - Execute an SQL statement that inserts rows in stat_items
STEP 4 - Access the table using the model, as usual
For example:
StatItem.find_by_...
Any comments/improvements are highly appreciated.
I have read several other posts about this and in particular this question with an answer by greg about how to do it in Hive. I would like to know how to account for DynamoDB tables with variable amounts of columns though?
That is, the original DynamoDB table has rows that were added dynamically with different columns. I have tried to view the exportDynamoDBToS3 script that Amazon uses in their DataPipeLine service but it has code like the following which does not seem to map the columns:
-- Map DynamoDB Table
CREATE EXTERNAL TABLE dynamodb_table (item map<string,string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "MyTable");
(As an aside, I have also tried using the Datapipe system but found it rather frustrating as I could not figure out from the documentation how to perform simple tasks like run a shell script without everything failing.)
It turns out that the Hive script that I posted in the original question works just fine but only if you are using the correct version of Hive. It seems that even with the install-hive command set to install the latest version, the version used is actually dependent on the AMI Version.
After doing a fair bit of searching I managed to find the following in Amazon's docs (emphasis mine):
Create a Hive table that references data stored in Amazon DynamoDB. This is similar to
the preceding example, except that you are not specifying a column mapping. The table
must have exactly one column of type map. If you then create an EXTERNAL
table in Amazon S3 you can call the INSERT OVERWRITE command to write the data from
Amazon DynamoDB to Amazon S3. You can use this to create an archive of your Amazon
DynamoDB data in Amazon S3. Because there is no column mapping, you cannot query tables
that are exported this way. Exporting data without specifying a column mapping is
available in Hive 0.8.1.5 or later, which is supported on Amazon EMR AMI 2.2.3 and later.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html