Flink - Integration Testing Table API? - unit-testing

I have built a very small and straight forward Flink app which consumes events from Kafka (json) deserialize them to a Java object and then create two Table and uses the Table API to to some simple operation and finally join the two table and write the result back to a Kafka
What are the best practices to test such code? How do I go about to write Integration Test that verify that the code written with Table API produces the right result?
(Using Flink 1.8.3)

We added an integration test for Kafka SQL connector since 1.10 in KafkaTableITCase. It creates a kafka table and writes some data into it (using json format), and read it again and applies a window aggreation, finally checks the window results using TestingSinkFunction. You can check the code here:
https://github.com/apache/flink/blob/release-1.10/flink-connectors/flink-connector-kafka-base/src/test/java/org/apache/flink/streaming/connectors/kafka/KafkaTableTestBase.java

Related

Issues while working with Amazon Aurora Database

My Requirments:
I want to store real-time events data coming from e-commerce websites into a database
In parallel to storing the data, i want to access the events data from a database
I want to perform some sort of ad-hoc analysis(SQL)
Using some sort of built-in methods(either from Boto3 or JAVA SDK), I want to access the events data
I want to create some sort of Custom-API's to access events data stored in database
I recently came across with Amazon Aurora(mysql) database.
I thought Aurora is one of the good example for my requirements. But when I dig into this Amazon Aurora(mysql), I noticed that we can create a database using AWS-CDK
BUT
1. No equivalent methods to create tables using AWS-CDK/BOTO3
2. No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Can anyone tell me how i can store realtime data into AURORA?
Can anyone tell me how i can access realtime data stored in AURORA?
No equivalent methods to create tables using AWS-CDK/BOTO3
This is because only Aurora Serveless can be accessed using Data API, not regular database.
You have to use regular mysql tools (e.g., mysql cli, phpmyadmin, mysql workbench etc) to create tables and populate them.
No equivalent methods in BOTO3 or JAVA SDK to store/access the database data
Same reason and solution as for point 1.
Can anyone tell me how i can create a table using(IAC) in AURORA db?
Terraform has mysql, but its not for tables, but users and databases.
Can anyone tell me how i can store realtime data into AURORA?
There is no out-of-the box solution for that, so you need custom solution for that. Maybe stream data to Kinesis Streams or Firehose, then to lambda and lambda will populate your DB? Seems easiest to implement.
Can anyone tell me how i can access realtime data stored in AURORA?
If you stream data to Kinesis Stream first, you can use Kinesis Analytics to analyze it in real time.
Since many of the above requires custom solutions, other architectures are possible.
Create connectoin manager as
DriverManager.getConnection(
"jdbc:mysql://localhost:3306/$dbName", //replace here with you endpoints & database name
"root",
"admin123"
) then
val stmt: Statement = con.createStatement()
stmt.executeQuery("use productcatalogueinfo;")
Whenever your lambda is triggering then it performs this connection and DDL operations too.

Translation of a text column in BigQuery

I have a table in BigQuery containing consumers' reviews, some of them are in local languages and I need to use a translation API to translate them and create a new column to the existing table incorporating the transalted reviews. I was wondering whether I can automate this task? e.g. using Google Translate API in BigQuery....
An alter solution to achieve this if customer review has some limited review comments in response then you can create a Bigquery function to replace values.
A sample code is given over github repository.
If you want to use an external API in BigQuery, like a Language Translation API, you can use Remote Functions (a recent release).
In this GitHub repo you can see how to wrap the Azure Translator API (the same way you can use the Google Translate API) into a SQL function and use it in your queries.
Since you have created the Translation SQL function, you can write an update statement (and run it periodically - using scheduled queries) to achieve what you want.
UPDATE mytable SET translated_review_text=translation_function(review_text) WHERE translated_review_text IS NULL

AWS Athena Javascript SDK - Create table from query result (CTAS) - Specifiy otuput format

I am trying using the AWS JavaScript Node.JS SDK to make a query using AWS Athena and store the results in a table in AWS Glue with Parquet format (not just a CSV file)
If I am using the conosle, it is pretty simple with a CTAS query :
CREATE TABLE tablename
WITH (
external_location = 's3://bucket/tablename/',
FORMAT = 'parquet')
AS
SELECT *
FROM source
But with AWS Athena JavaScript SDK I am only able to set an output file destination using the Workgoup or Output parameters and make a basic select query, the results would output to a CSV file and would not be indexed properly in AWS Glue so it breaks a bigger process it is part of, if I try to call that query using the JavaScript SDK I get :
Table properties [FORMAT] are not supported.
I would be able to call that DDL statement using the Java SDK JDBC driver connection option.
Is anyone familiar with a solution or workaround with the Javascript SDK for Node.JS?
There is no difference between running the SQL you posted in the Athena web console, AWS SDK for JavaScript, AWS SDK for Java, or the JDBC driver, none of these will process the SQL, so if the SQL works in one of these it will work in all of them. It's only the Athena service that reads the SQL.
Check your SQL and make sure you really use the same in your code as you have tried in the web console. If they are indeed the same, the error is somewhere else in your code, so post that too.
Update the problem is the upper case FORMAT. If you paste the code you posted into the Athena web console, it bugs out and doesn't run the query, but if you run it with the CLI or an SDK you get the error you posted. You did not run the same SQL in the console as in the SDK, if you had you would have gotten the same error in both.
Use lower case format and it will work.
This is definitely a bug in Athena, these properties should not be case sensitive.

AWS IOT ANALYTICS

am trying to fetch data from iot analytics(AWS) from my java sdk, I have created channels and pipeline and data are in the datasets
does anyone have idea about aws iot analytics data fetch mechanism?
AWS IoT Analytics distinguishes between raw data stored in channels, processed data stored in datastores and queried data stored in data sets.
As part of creating the dataset with CreateDatasetContent [1], you'll write your SQL query which runs against your datastore and produces the result set stored in your dataset. This guy can either be run ad-hoc or periodically every x hours. After you created the dataset successfully, you can get the query result via the GetDatasetContent API [2].
Please note that the CreateDatasetContent API is async, meaning you'll need to wait until the query ran successfully. By default, GetDatasetContent will always return you the latest successful result which might be empty directly after creating the dataset since the query hasn't finished yet. In order to get the current state of your query, you can pass the optional version=$LATEST parameter to the GetDatasetContent call. This will give you more information about the currently running query or whether it failed to execute.
Hope this helps
[1] https://docs.aws.amazon.com/iotanalytics/latest/APIReference/API_CreateDatasetContent.html
[2] https://docs.aws.amazon.com/iotanalytics/latest/APIReference/API_GetDatasetContent.html

Problem regarding integration of various datasources

We have 4 datasources.2 datasources are internal and we can directly connect to the database.For the 3rd datasource we get a flat file (.csv) and have to pull in the data.4rth datasource is external and we cannot access it directly.
We need to pull data from all the 4 datasources, run business rules on them and store them in our database. We have a web application that runs on top of this database.Also every month we have to pull the data and do any updates/deletes/adds etc to existing data.
I am pretty much ignorant about this process.Also Can you please point some good books to study this topic.
These are the current approaches that i was thinking of.
To write an internal webservice that will talk to internal datasoureces and pull data.Create a handler to the external datasource using middleware (mqseries is already setup for this in some other existing project,planning to reuse that).PUll data from csv file again using Java.
On this data run some business rules from Java.Use this data.
This approach might run in my dev box, but not sure what all problems can occur in prod (specially due to synchronization)
Pull data from internal using plain java jdbc connection.For the remaining 2 get flat files, dump data using sql loader.All the data goes to temporary tables first.Run busines rules thru pl/sql and use.
Use some ELT tool like informatica to pull data.write business rules in perl (invoked by informatica)
Thanks.
A book like "The Data Warehouse ETL Toolkit" by Ralph Kimball is a good resource for learning techniques/architectures to bring data from different sources into one place.