What does "interactive query" or "interactive analytics" mean in Presto SQL? - amazon-web-services

The Presto website (and other docs) talk about "interactive queries" on Presto. What is an "interactive query"? From the Presto Website: "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse."

An interactive query system is basically a user interface that translates the input from the user into SQL queries. These are then sent to Presto, which processes the queries and gets the data and sends it back to the user interface.
The UI then renders the output, which is typically NOT just a simple table of numbers and text, but rather a complex chart, a diagram or some other powerful visualization.
The users expects to be able to e.g. update one criteria and get the updated chart or visualization in near real time, just like you expect on any application typically. Even if the creation of this analysis involves LOTS of data to be processed.
Presto can do that since it can query massive distributed object storage systems like HDFS and many other cloud storage systems, as well as RDBMSs and so on. And it can be set up to have a huge cluster of workers that query the source in parallel and therefore process massive amounts of data for analysis, and still be fast enough for the user expectations.
A typical application to use for the visualization is Apache Superset. You just hook up Presto to it via the JDBC driver. Presto has to be configured to point at the underlying data sources and you are ready to go.

Related

Power BI Embedded Approach for 100s of SQL Targets

I'm trying to find the best approach to delivering a BI solution to 400+ customers which each have their own database.
I've got PowerBI Embedded working using service principal licensing and I have the PowerBI service connected to my data through the On Premise Data Gateway.
I've build my first report pointing to 1 of the customer databases. Which works lovely.
What I want to do next, when embedding the report, is to tell PowerBI, for this session, to get the database from a different database.
I'm struggling to find somewhere where this is explained, or to understand if this is even possible.
I'm trying to avoid creating 400+ WorkSpaces or 400+ Data Sets.
If someone could point me in the right direction, it would be appreciated.
You can configure the report to use parameters and these parameters can be used to configure the source for your dataset:
https://www.phdata.io/blog/how-to-parameterize-data-sources-power-bi/
These parameters can be set by the app hosting the embedded report:
https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/update-parameters-in-group
Because the app is setting the parameter, each user will only see their own data. Since this will be a live connection, you would need to think about how the underlying server can support the workload.
An alternative solution would be to consolidate the customer databases into a single database (just the relevant tables) and use row level security to restrict access for each customer. The advantage to this design is that you take the burden off of the underlying SQL instance and push it into a PBI dataset that is made to handle huge datasets with sub-second response times.
More on that here: https://learn.microsoft.com/en-us/power-bi/enterprise/service-admin-rls

ODBC Equivalent of DBMS_ALERT in Oracle

Is there anything (system procedure,function or other) in SQL Server that will provide the functionality of DBMS_ALERT package of ORACLE (and DBMS_PIPE respectively)?
I work in a plant and I'm using an extension-product of SQL-Server called InSQL Server by Wonderware which is specialized in gothering data from plant controllers and HumanMachineInterface(SCADA) software.
This system can record events happening in the plant (like a high-temperature alarm, for example). It stores sensor values in extension tables of SQL Sever, and other less dense information in normal SQL Server tables.
I want to be able to alert some applications running on operator PCs that an event has been recorded in the database.
An after insert trigger in the events table seems to be a good place to put something equivalent to DBMS_ALERT (if it exists), to wake up other applications that are waiting for the specific alert and have the operators type in some data.
In other words - I want to be able to notify other processes (that have connection to SQL Server) that something has happened in the database.
All Wonderware (InSQL but now called Aveva) Historian data is stored in the history blocks EXCEPT for the actual tag storage configuration and dedicated event data. The time series data for analog, discrete and strings is NOT in SQL tables at all - unless someone is doing custom configuration to. create tables of their own.
Where are you wanting these notifications to come up? Even though the historical data is NOT stored in SQL tables, Wonderware has extensive documentation on how to use SQL queries to appropriately retrieve data (check for whatever condition you are looking for)
You can easily build a stored procedure and configure it for a maintenance plan.
But are you just trying to alarm (provide notification) on the scada itself?
Or are you truly utilizing historical data (looking for a data trend - average, etc.)?
Or trying to send the notification to non-scada interfaces?
Depending on your specific answer, the scada itself should probably be able to do it.
But there is software that already does this type of thing Win-911, SeQent, Scadatec are a couple in the OT space. But also things like Hip Link or even DeskAlert which can connect to any SQL via it's own API.
So where does the info need to go (email, text, phone, desktop app...) and what is the real source of the data>

PostgreSQL: How to store and fetch historical SQL data from Azure Data Lakes (ADLS)

I have one single Django web application deployed on Azure with a transactional SQL DB i.e. PostgreSQL.
Within the Django application, every day this historical data needs to be accessed (eg: to show the pattern over a period of years, months etc.) from the ADLS.
However, the ADLS will only return a single/multiple Files, and my application needs an intermediate such as Azure Synapse to convert this unstructured data into Structured DB in order to perform Queries on this historical data to show it within the web application.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
This entire exercise is being undertaken in order to store older historical data in a large repository and also access/query data from that ADLS repository whenever required.
Please guide what Azure alternatives may work in this case.
You need to break down your problem. For each piece you have multiple choices with different cost implications and complexity of implementation and amount of control/flexibility you get.
Question. A) Would Azure Synapse fulfil this 'unstructured to structured conversion' requirement, or is there another Azure alternative.
Synapse Serverless SQL Pool lets you query JSON files from Datalake without a physical DB. It's only compute no storage.
This is for infrequent access to large datasets, because every query goes and parses the data in Datalake.
If you want you can also COPY INTO some_table all the data from files and then perform queries more efficiently on some_table (which is stored in DB, with indices, partitions, ...) using a dedicated Synapse SQL Pool.
E.g. following JSON
{
"_id":"ahokw88",
"type":"Book",
"title":"The AWK Programming Language",
"year":"1988",
"publisher":"Addison-Wesley",
"authors":[
"Alfred V. Aho",
"Brian W. Kernighan",
"Peter J. Weinberger"
],
"source":"DBLP"
}
Can be queried with following SQL:
SELECT
JSON_VALUE(jsonContent, '$.title') AS title
, JSON_VALUE(jsonContent, '$.publisher') as publisher
, jsonContent
FROM OPENROWSET
(
BULK 'json/books/*.json',
DATA_SOURCE = 'SqlOnDemandDemo'
, FORMAT='CSV'
, FIELDTERMINATOR ='0x0b'
, FIELDQUOTE = '0x0b'
, ROWTERMINATOR = '0x0b'
)
WITH
( jsonContent varchar(8000) ) AS [r]
WHERE
JSON_VALUE(jsonContent, '$.title') = 'Probabilistic and Statistical Methods in Cryptology, An Introduction by Selected Topics'
Question. B) Since Django is inherently tied to ORM (Object Relation Mapping), would there be any compatibility issues between the web app's PostgreSQL and Azure Synapse (i.e. ArrayField, JSONField etc.)
Synapse offers good old JDBC drivers, so as long as your ORM layer can use a JDBC source you should be good to go. Remember that underlying data source (Synapse) is meant for MPP and not transactional processing. So inserting 1000 rows in a for loop using INSERT INTO... would take 1000 seconds, but querying 10 million rows using a SELECT ... statement would probably take less than 100. So know what you do with it.
Does Synapse have to be configured with both the App DB and ADLS in a pipeline system through Azure Data Factory? And is this achievable for a PostgreSQL DB? Since I could not Azure docs that talk specifically about PostgreSQL DB <---> ADLS connections. – Simran 14 hours ago
You're mixing things here. You can NOT use Synapse to give a single view of data across two data sources: 1) PostgreSQL, 2) ADLS.
Only source for Serverless is ADLS.
You can do this using Data Factory, which would allow you to create two data sources (ADLS and PostgreSQL), read from them, merge them to produce a new data set, write the output to some output data sink like PostgreSQL. Your Django code then would be able to read this from PostgreSQL as usual.
Understand the cost and performance implications of each piece before you make a decision:
Serverless SQL Pool
Dedicated SQL pool
Data Factory

Logstash and looking up additional data from a relational table?

I have mobile app log data being posted daily (eventually it will be a data stream). I am looking at different solutions for processing this log data and providing analytics. I am considering using logstash/elasticsearch/kibana combination, but we have additional data on our users stored in a redshift database. So in addition to the mobile data, I would like to pull in additional data from redshift about the user at the time of interaction with mobile app.
However, I've read in some places that doing an actual database query through logstash isn't feasible, but you can use a dictionary file to do a lookup of each user.
I have two questions regarding this approach
Is there a limit to have large this lookup file can be? Mine would be < 500K records so I'd imagine it would be fine?
Can the process of making the the lookup file from redshift tables be fully automated (ideally though aws services) - i.e. each night the lookup table is refreshed and posted to logstash, and then used for breakouts in Kibana
The way we're currently doing it is processing a daily jason file with a lambda function, posting it to s3 and then reading it into a redshift table. This data is then processed into sessions and joined up with other tables to generate the final dataset to be used for visualization. This is currently done in Tableau but we are exploring other options (such as quicksight, or possibly the ELK stack)
Just trying to figure out what solution is going to be scalable to clickstream data and will be the most useful down the line.
Thanks!
logstash 7 has a jdbc_streaming filter plugin for dynamically adding stuff to your events, as well as the jdbc_static filter for static stuff.
As you found, you can also use the translate filter. The man page says they've tested "very large" datasets up to 100,000 entries, so your dataset may require some testing. The good part about this filter is that it will reload the data when it detects a change, so you can publish the data on your own schedule (e.g. cron) without restarting logstash. Be on the lookout for events that don't get the translated value, which might be a sign that your publishing frequency should be updated.

Suitable storage method for huge amount of data

what kind of storage do you recommend for very huge amount of data? (≈ 50 milions records per day). Is this proper situation for systems like Hadoop or RDBMS is still sufficient for this purpose?
With the amount of data you are describing, you might indeed be pushing into the Big Data terrirtory. Based on the amount of the details you provided, I would suggest loading raw data into Hadoop cluster, running map/reduce jobs to parse it and to load into date-based directories. You can then define an external Hive table partitioned by date (daily? weekly?) mapped to the results of your map/reduce jobs.
Next step would depend on the complexity of your reports and needed response time. If you can easily express them in SQL, you can just run queries on your Hive table. If they are more elaborated, you might have to write custom map/reduce jobs. Many suggest Pig for it, but I am personally more comforatble with the straight Java.
If you don't care about the response time of the reports, you can run them on-demand. If you care, but open to wait for the results for, say, tens of seconds or a few minutes, you can store report results also in Hive. If you want your reports to show up fast, say, in web-based or mobile UI, you might want to store the report data in a relational database.