Unit Testing Azure EventHub, Stream Analytics Job and Storage Table

Unit Testing Azure EventHub, Stream Analytics Job and Storage Table - unit-testing

I am working on this project which uses EventHub -> Stream Analytics Job -> Storage Table / Blob structure and I want to write a couple of unit tests for it.
I can test the EventHub Sender status to see if queries have the expected behavior, but how can I check if the data is being sent to the Table Storage, since the whole process doesn't happen instantly and there is a pretty long delay from the moment I hit the EventHub and the moment the data is being saved in Storage.

First create a new Azure Table storage account and then create a new Azure table within that account. In your Stream Analytics job add a new output for Table storage. When you setup the Output details, you will need to specify the storage account, account key, table name, and which columns names in the event will represent the Azure Table partition and row keys. As an example, I set mine up like this:
After the output is setup, you can create a simple Stream Analytics query that maps input events from Event Hub to the Azure Table output. I also have an Event Hub input named 'eventhub' with Send/Listen permissions. My query looks like this:
SELECT
*
INTO
tableoutput
FROM
eventhub
At this point hit the 'Start' button in Azure portal to run the Stream Analytics job. To generate the events, you can follow the instructions here but change the event message to this:
string guid = Guid.NewGuid().ToString();
var message = "pk,rk,value\n" + guid + ",1,hello";
Console.WriteLine("{0} > Sending message: {1}", DateTime.Now, message);
eventHubClient.Send(new EventData(Encoding.UTF8.GetBytes(message)));
To eyeball the Azure Table results, download a tool like TableXplorer and enter the storage account details. Double click your Azure Table and you should see something like the following. Keep in mind you may need to periodically hit F5 on your TableXplorer query for 10-60 seconds until the data gets pushed through. When it shows up it will look like the following:
For programmatic unit testing you will need to push the partition key / row key values generated in your Event Hub code into a data structure and have a worker poll the Azure Table using point queries. A good overview on Azure Table usage is here.

Related

Does GCP Data Loss Prevention support publishing its results to Data Catalog for External Big Query Tables

I was trying to auto tag InfoTypes like PhoneNumber, EmailId on the data in GCS Bucket and Big Query External tables using Data Loss Prevention Tool in GCP so that i can have those tags at Data Catalog and subsequently in Dataplex. Now the problems are that
If i select any sources other than Big Query table (GCS, Data Store etc.), the option to publish GCP DLP inspection results to Data Catalog is disabled.
If i select Big Query table, Data Catalog publish option is enabled but when i try to run the inspection job, its errors out saying , "External tables are not supported for inspection". Surprisingly it supports only internal big query tables.
The question is that, is my understanding of GCP DLP - Data Catalog integration works only for Internal Big Query tables correct? Am doing something wrong here, GCP documentation doesn't mention these things either!
Also while configuring the Inspection Job from the DLP UI Console, i had to provide Big Query tableid mandatorily, is there a way i can run DLP inspection job against a BQ Dataset or a bunch of tables?

Regarding Data Loss Prevention Services in Google Cloud, your understanding is correct, data cannot be ex-filtrated by copying to services outside the perimeter, e.g., a public Google Cloud Storage (GCS) bucket or an external BigQuery table. Visit this URL for more reference.
Now, about how to run a DLP Inspection job against a BQ bunch of tables, there are 2 ways to do it:
Programmatically fetch the Big Query tables, query the table and call DLP Streaming Content API. It operates in real time, but it is expensive. Here I share the concept in a Java example code:
url =
String.format(
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType=3;ProjectId=%s;",
projectId);
DataSource ds = new com.simba.googlebigquery.jdbc42.DataSource();
ds.setURL(url);
conn = ds.getConnection();
DatabaseMetaData databaseMetadata = conn.getMetaData();
ResultSet tablesResultSet =
databaseMetadata.getTables(conn.getCatalog(), null, "%", new String[]{"TABLE"});
while (tablesResultSet.next()) {
// Query your Table Data and call DLP Streaming API
}
Here is a tutorial for this method.
Programmatically fetch the Big Query tables, and then trigger one Inspect Job for each table. It is the cheapest method, but you need to consider that it's a batch operation, so it doesn’t execute in real time. Here is the concept in a Python example:
client = bigquery.Client()
datasets = list(client.list_datasets(project=project_id))
if datasets:
for dataset in datasets:
tables = client.list_tables(dataset.dataset_id)
for table in tables:
# Create Inspect Job for table.table_id
Use this thread for more reference on running a DLP Inspection job against a BQ bunch of tables.

BigQuery and Data Studio usage (cost)

I wanted to view the BigQuery costs in my project. I am downloading logs to the table according to the following function:
resource.type = "bigquery_resource"
protoPayload.methodName = "jobservice.jobcompleted"
However, when I view the data, the information about refreshing the table in Data Studio does not reflect here. This data appears during the filter:
protoPayload.serviceName = "bigquerybiengine.googleapis.com"
However, at this point there is no use, only information about the access to the data range. How can I read the data consumption data when refreshing reports in Data Studio?

To analyze the cost of Data Studio report and query costs you can use Cloud Audit Log for Bigquery, making use of event data and export it to BigQuery to analyze it.
Create a sink from cloud logging, this will output all BigQuery query_job_completed log events from Cloud Audit Logging service into your Bigquery table.
When you have the BigQuery event data flowing into your dataset, you can create a view and query it. You will get totalBilledBytes per query which can be used to calculate the cost of queries.
You can refer to this documentation for further information.

Is there a way to tell when AWS Amplify Datastore is initialized or ready to be queried?

I have an application that needs to update the UI with the results of an Amplify Datastore query. I am making the query as soon as the component mounts/renders, but the results of the query are empty even though I know there is available data. If I add a timeout of 1 second or greater before making the query, then the query returns the expected data. My hunch is that this is because the query is returning an empty set of data before the response from the delta sync table, which shows there is data to be fetched, is returned.
Is there any type of event provided by Datastore that would allow me to wait until the data store is initialized or has data to query before making the query?
I understand that I could use the .observe functionality of datastore for a similar effect, but this is currently not an option.

First, if you do not use the Datastore start method then sync from the backend starts when the first query is submitted. Queries are run against the local store so data won't be there yet.
Second, Datastore publishes events on the amplify hub so that you can monitor changes, such as a set of data being synced, Datastore being ready and even Datastore being ready and all data synced locally.
See the documentation on Datastore.start
and the documentation for Datastore events for more information.

Optimize data load from Azure Cosmos DB to Power BI

Currently we have a problem with loading data when updating the report data with respect to the DB, since it has too many records and it takes forever to load all the data. The issue is how can I load only the data from the last year to avoid taking so long to load everything. As I see, trying to connect to the COSMO DB in the box allows me to place an SQL query, but I don't know how to do it in this type of non-relational database.
Example

Power BI has an incremental refresh feature. You should be able to refresh the current year only.
If that still doesn’t meet expectations I would look at a preview feature called Azure Synapse Link which automatically pulls all Cosmos DB updates out into analytical storage you can query much faster in Azure Synapse Analytics in order to refresh Power BI faster.

Depending on the volume of the data you will hit a number of issues. First is you may exceed your RU limit, slowing down the extraction of the data from CosmosDB. The second issue will be the transforming of the data from JSON format to a structured format.
I would try to write a query to specify the fields and items that you need. That will reduce the time of processing and getting the data.
For SQL queries it will be some thing like
SELECT * FROM c WHERE c.partitionEntity = 'guid'
For more information on the CosmosDB SQL API syntax please see here to get you started.
You can use the query window in Azure to run the SQL commands, or Azure Storage Explorer to test the query, then move it to Power BI.
What is highly recommended is to extract the data into a place where is can be transformed into a strcutured format like a table or csv file.
For example use Azure Databricks to extract, then turn the JSON format into a table formatted object.
You do have the option of using running Databricks notebook queries in CosmosDB, or Azure DataBricks in its own instance. One other option would to use change feed to send the data and an Azure Function to send and shred the data to Blob Storage and query it from there, using Power BI, DataBricks, Azure SQL Database etc.

In the Source of your Query, you can make a select based on the CosmosDB _ts system property, like:
Query ="SELECT * FROM XYZ AS t WHERE t._ts > 1609455599"
In this case, 1609455599 is the timestamp which corresponds to 31.12.2020, 23:59:59. So, only data from 2021 will be selected.

How to monitor the number of records loaded into BQ table while using big query streaming?

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.

Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).
select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'
As an example, the query returns this

You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js