Avoid duplicate data on Time Series Insight / EventHub - azure-eventhub

I'm using Time Series Insight that is consuming events from EventHub. I just saw that we have some duplicated data. Is there a way to avoid the store of duplicate data on TSI or even on EventHub?

Related

Performant way to handle arrays in Athena/Quicksight

I currently have a large set of json data that I'd like to import into Amazon Athena for visualization in Amazon Quicksight. In each json, there are two fields: one is a comma separated string of ids (orderlist), and the other field is an array of strings(locations). Because Quicksight doesn't support array searching, I'm currently resorting to creating a view where I generate crossjoins across the two string arrays:
select id,
try_CAST(orderid AS bigint) orderid_targeting,
location
from advertising_json
CROSS JOIN UNNEST(split(orderlist, ',')) as x(orderid)
CROSS JOIN UNNEST(locations) t (location)
With two cross joins, this can explode out the data to 20x-30x the original size.
If I were working on individual queries on Athena, I could use Presto array functions to search through the arrays. Is there a better way to make these fields accessible for filtering on Quicksight?
You have two options: keep doing what you're doing or implement an ETL workflow where you periodically materialise the view, for example using CTAS. The latter has the added benefit that you can produce Parquet files, which could help speed up your queries.
On the other hand it's not as simple as it sounds. If you're in luck you can use INSERT INTO to transform partitions from your current table into an optimised table after a point in time when they will not change – but in my experience most of the time your most recent data gets updated during some window of time, but you still want to be able to query it during that window. In that situation the ETL process becomes much more complicated since you need to remove data from the optimised table to avoid ending up with duplicate data. It's not hard, it's just a lot of code and juggling S3 and Glue Data Catalog operations so that you never have tables that have duplicate data nor too little data.
Unless you feel like your current setup with the view is too slow, don't go implementing something big and complicated. Remember that you pay for bytes scanned in Athena, not the amount of time Athena spends crunching your query. You get quite a lot of compute power running your queries and in my experience there's rarely any point in micro-optimisation of queries, the gains you make are orders of magnitude lower than minimising the amount of data you process, either through clever partitioning or moving to columnar file formats. Most of the time the gains from small optimisations are not measurable because the error bars caused by Athena's query queue and waiting for S3 operations. You may get your query to run 50ms faster, but sometimes it gets queued for 500ms, and spends another 2000ms doing list operations on S3 so how can you tell?
If you decide to go down the materialisation route, first do it once using CTAS and run your QuickSight visualisation against the results. Don't implement the whole ETL workflow before you've checked that you get something that is significantly more performant.
If all you are worried about is that it's less performant to apply filters after the unnesting of your arrays than using array functions, write the two versions of the query and benchmark them against each other. I suspect array functions are going to be slightly faster – but for the same reasons I mentioned above, the gains may drown in the error bars caused by Athena's queuing and other operations.
Make sure to benchmark at different points during the day, and be especially conscious of the fact that top-of-the-hour behaviour in Athena is extremely different from other times (run queries at 10:00 and then at 10:10 – your total execution times will be very different because everyone's cron jobs run at the top of the hour).

DynamoDB: better way to update a table besides querying 1 by 1?

I want to do a weekly update to my table (500 items)
When I receive the updated data, should I take each item, query the corresponding item in DDB, then compare and update if necessary?
or should I try to scan the whole table into memory and compare in memory?
DynamoDB is meant for transactional data and performs best in a distributed way when you update one by one. Updating it sequentially will reduce you read cost, while it can take burst reads and writes, you need to have your error handling robust making it to update again upon failure with a delay.
Scanning will also affect the same read and write cost, doing it in a distributed way helps with performance and scalability.
More documentation on distributed writes and performance,
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-data-upload.html
Hope it helps.

Delays when streaming data from the Google Search console API to BigQuery

So I've been trying to stream data from Google Search Console API to BigQuery in real time.
The data are retrieved from GSC API and streamed to the BigQuery stream buffer. However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). So, the data stays in the streaming buffer but is not in the table.
The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows).
Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). -> This was explained by the second bullet in shollyman's answer.
What I want is to have the data in the table in real time. According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above).
Here's the code responsible for that part:
for row in response['rows']:
keys = ','.join(row['keys'])
# Data Manipulation Languate (DML) Insert one row each time to BigQuery
row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}
insert_all_data = {
"kind": "bigquery#tableDataInsertAllRequest",
"skipInvaliedRows": True,
"ignoreUnknownValues": True,
'rows':[{
'insertId': str(uuid.uuid4()),
'json': row_to_stream,
}]
}
build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
projectId=projectid,
datasetId=dataset_id,
tableId=tableid,
body=insert_all_data).execute(num_retries=5)
I've seen questions that seem very similar to mine on here but I haven't really found an answer. I therefore have 2 questions.
1. What could cause this issue?
Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (e.g., using PubSub and a few projects around real time Twitter data analysis).
2. How do you pick the best option for a particular task?
By default, the BigQuery web UI doesn't automatically refresh the state of a table. There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage.
I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. Data doesn't transition from managed storage back to the buffer.
Deciding what technology to use for streaming depends on your needs. Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration.

Why is loading dashDB analytics by trickle feed a bad idea?

I have a use case where I continuously need to trickle feed data into dashDB, however I have been informed that this is not optimal for dashDB.
Why is this not optimal? Is there a workaround?
Columnar warehouses are great for reads, but if you insert a single row into an N column table then the system has to cut the row into pieces and do N separate writes to disk. This makes small inserts relatively inefficient and things can slow down as a result.
You may want to do an initial batch load of data. Currently the compression dictionary is built only for bulk loads, so if you start with a new table and populate it only using inserts then the data doesn't get compressed at all.
Try to structure the loading into microbatches with a 2-5 minute load cycle.
What is the use case here? Check if dashDB Transactional can solve your need. DashDB transactional is tuned for OLTP and point of sale transactions which is what you are trying to feed.

What is the most efficient way to store time series in Riak with heavy reads

My current approach:
I have one domain class - Application
Each application in my system is stored in "applications" bucket under APPLICATION_KEY key
Apart from application metadata stored in this bucket, each application has its own bucket called "time_metrics/APPLICATION_KEY" where I store time series in a way:
KEY - timestamp / VALUE - some attributes
My concern is efficiency of queries made over specific time window for given application. Currently to get time series from some specific time window and eventually make some reductions I have to make map/reduce over whole "time_metric/APPLICATION_KEY" bucket, which what I have found is not the recommended use case for Riak Map/Reduce.
My question: what would be the best db structure for this kind of a system and how efficiently query it.
Adding onto #macintux's answer.
Basho has had a few customers that have used riak for time series metrics.
Boundary has a nice tech talk about how they use Riak with their network monitoring software. They rollup data into different chunks of time (1m, 5m, 15m) for analysis.
They also have a series of blog posts about lessons learned while implementing this system.
Kivra also has a good slide deck about how they use timeseries data with riak.
You could roll up your data into some sort of arbitrary time length, then read the range you need by issuing regular K/V gets, and then reconstruct the larger picture / reduce in your application.
If you have spare computing power and you know in advance what keys you need, you certainly can use Riak's MapReduce, but often retrieving the keys and running your processing on the client will be as fast (and won't strain your cluster).
Some general ideas:
Roll up your data into larger blocks
If you're concerned about losing data if your client crashes while buffering it, you can always store the data as it arrives
Similar idea: store the data as it arrives, then retrieve it and roll it up at certain intervals
You can automatically expire data once you're confident it is being reliably stored in larger blocks, using either the Bitcask or Memory backends
Memory backend is quite useful (RAM permitting) for any data that only needs to be stored for a limited period of time
Related: don't be afraid to store multiple copies of your data to make reading/reporting easier later
Multiple chunks of time (5- and 15-minute blocks, for example)
Multiple report formats
Having said all that, if you're doing straight key/value requests (it's ideal to always be able to compute the keys you need, rather than doing indexing or searching), Riak can support very heavy traffic loads, so I wouldn't recommend spending too much time creating alternative storage mechanisms unless you know you're going to face latency problems.