What is LATEST_ON syntax in QuestDB? - questdb

I'm using QuestDB and SQL for the first time, and I stumbled upon the LATEST_ON syntax used in QuestDB. Can someone explain it's usage and where to use it?

Quoted from the docs:
For scenarios where multiple time series are stored in the same table, it is relatively difficult to identify the latest items of these time series with standard SQL syntax. QuestDB introduces LATEST ON clause for a SELECT statement to remove boilerplate clutter and splice the table with relative ease.
For more information visit the official documentation

LATEST ON is to find the latest record for each unique time series in a table. See this page for some examples: https://questdb.io/docs/reference/sql/latest-on/

It gives you the latest available record for each combination of the PARTITION BY values, according to the ON timestamp
Maybe easier to understand with an example. If you go to https://demo.questdb.io you can execute this query
select * from trades latest on timestamp
partition by symbol, side
It will then show you the latest existing row for each combination of Symbol and Side. If you wanted to do this using standard SQL, you would probably have to use a window function, something like this
select * from
(select *
,ROW_NUMBER() over (partition by Symbol, Side
order by timestamp DESC) AS RowNumber
from trades where timestamp > '2022-10-01') t
where t.RowNumber = 0

Latest on retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table.
Check this link for some examples: https://questdb.io/docs/reference/sql/latest-on/

Related

Detecting delta records for nightly capture?

I have an existing HANA warehouse which was built without create/update timestamps. I need to generate a number of nightly batch delta files to send to another platform. My problem is how to detect which records are new or changed so that I can capture those records within the replication process.
Is there a way to use HANA's built-in features to detect new/changed records?
SAP HANA does not provide a general change data capture interface for tables (up to current version HANA 2 SPS 02).
That means, to detect "changed records since a given point in time" some other approach has to be taken.
Depending on the information in the tables different options can be used:
if a table explicitly contains a reference to the last change time, this can be used
if a table has guaranteed update characteristics (e.g. no in-place update and monotone ID values), this could be used. E.g.
read all records where ID is larger than the last processed ID
if the table does not provide intrinsic information about change time then one could maintain a copy of the table that contains
only the records processed so far. This copy can then be used to
compare the current table and compute the difference. SAP HANA's
Smart Data Integration (SDI) flowgraphs support this approach.
In my experience, efforts to try "save time and money" on this seemingly simple problem of a delta load usually turn out to be more complex, time-consuming and expensive than using the corresponding features of ETL tools.
It is possible to create a Log table and organize columns according to your needs so that by creating a trigger on your database tables you can create a log record with timestamp values. Then you can query your log table to determine which records are inserted, updated or deleted from your source tables.
For example, following is from one of my test trigger codes
CREATE TRIGGER "A00077387"."SALARY_A_UPD" AFTER UPDATE ON "A00077387"."SALARY" REFERENCING OLD ROW MYOLDROW,
NEW ROW MYNEWROW FOR EACH ROW
begin INSERT
INTO SalaryLog ( Employee,
Salary,
Operation,
DateTime ) VALUES ( :mynewrow.Employee,
:mynewrow.Salary,
'U',
CURRENT_DATE )
;
end
;
You can create AFTER INSERT and AFTER DELETE triggers as well similar to AFTER UPDATE
You can organize your Log table so that so can track more than one table if you wish just by keeping table name, PK fields and values, operation type, timestamp values, etc.
But it is better and easier to use seperate Log tables for each table.

Pentaho PDI get SQL SUM() with conditions

I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?
The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.

Informatica : taking very long time when doing insert

i have one mapping which just includes one source table and one target table. The source table has 100 columns and around 33xxxx records, i need to use this tool to insert to the target table and the logic is insert only. The version of informatica is 9.6.1 version and Database is SQL Server 2012.
After i run the workflow, it takes 5x/s to insert. the speed is too slow. I think it may be related to the number of columns
Can anyone help me how to increase the speed?
Thanks a lot
I think i know the reason why it happened. It is there are two fields which are ntext field in this table. That's why it takes very long time.
You can try the below options
1) Use bulk option for 'Target Load type' attribute in session if the target table doesn't have any indexes or keys on it
2) If there is any SQL override in the SOURCE QUALIFIER try to tune the query
3) Find for 'BUSY' in the session log and note down the busy percentages of each thread. Based on the thread percentages you will be able to identify the exact thread which is taking more time (Reader, Transformation, Writer)
4) Try to use informatica partitions through which you can achieve parallel processing.
Thanks and Regards,
Raj
Consider following points to increase the performance:
Increase the "commit interval" size in the session level properties.
Use the "bulk load" in session level properties.
You can also use the "partitioning" in session level, to do this you need partitioning license.
If your source is a database and you are doing sql override in source qualifier transformation , then you can also use the "Hints" for increasing the performan

nosql/dynamodb hash and range use case

It's my first time using a NoSQL database so I'm really confused. I'd really appreciate any help I can get.
I want to store data comprising announcements in my table. Essentially, each announcement has an ID, a date, and a text.
So for example, an announcement might have ID of 1, date of 2014/02/26, and text of "This is a sample announcement". Newer announcements always have a greater ID value than older announcements, since they are added to the table later.
There are two types of queries I want to run on this table:
I want to retrieve the text of the announcements sorted in order of date.
I want to retrieve the text and dates of the x most recent announcements (say, the 3 most recent announcements).
So I've set up the table with the following attributes:
ID (number) as primary key, and
date (string) as range
Is this appropriate for what my use cases? And if so, what kind of query/reads/requests/scans/whatever (I'm really confused about the terminology here too) should I be running to accomplish the two types of queries I want to make?
Any help will be very much appreciated. Thanks!
You are on the right track.
As far as sorting, DynamoDB will sort by the range key, so date will work but I'd recommend storing it as a number, perhaps milliseconds since the Unix epoch, rather than a String. This will make it trivial to get the announcements in ascending or descending order based on their created date.
See this answer for an overview of local vs global secondary indexes and what capabilities they provide: Optional secondary indexes in DynamoDB
As far as retrieving all items, you would need to perform a scan. Scans are not as efficient as queries, but since all of Dynamo is on SSD's they're still relatively quick. You don't get the single digit millisecond performance with a scan that you get with a query, so if there's a way to associate announcements with a user ID, you might get better performance than with a scan.
Note that you cannot modify the table schema (hash key, range key, and indexes) after you create the table. There are ways to manually migrate a table or import/export it, but the point is that you should think hard about current and future query requirements up front and design the table to support them. It's very easy to add or stop storing non-key or non-item attributes though, which provides nice flexibility.
Finally, try to avoid thinking of Dynamo as relational. With Dynamo, in a lot of cases you may well be better off de normalizing or duplicating some of the data in exchange for fast query performance.

borland builder c++ oracle question

I have a Borland builder c++ 6 application calling Oracle 10g database. Operating over a LAN. When the application in question makes a simple db select e.g.
select table_name from element_tablenames where element_id = 10023842
the following is recorded as happening in Oracle (from the performance logs)
select table_name
from element_tablenames
where element_id = 10023842
then immediately (and not from C++ source code but perhaps deeper)
select table_name, element_tablenames.ROWID
from element_tablenames
where element_id = 10023842
The select statement is only called once in the TADODbQuery object, yet two queries are being performed - one to parse and the other adds the ROWID for executon.
Over a WAN and many, many queries this is obviously a problem to the user.
Does anyone know why this might be happening, can someone suggest a solution?
Agree with Robert.
The ROWID uniquely identifies a row in a table so that the returned record can be applied back to the database with any changes (or as a DELETE).
Is there a way to identify a particular column (or set of columns) as a primary key so that it can be used to identify a row without using a ROWID.
I don't know exactly where the RowID is coming from, it could be either the TAdoQuery implementation or the Oracle Driver. But I am sure I found the reason.
From the Oracle docs:
If the database table does not contain a primary key, the ROWID must be selected explicitly when populating DataTable.
So I suspect your Table does not have a primary key, either add one or add the rowid.
Either way this will solve the duplicate query problem.
Since you are concerned about performance. In general
Using TAdoQuery you can set the CursorType to optimize different behaviors for performance. This article covers this from a TAdoQuery perspective. MSDN also has an article that covers it from from a general ADO Perspective. Finally the specifications from the Oracle Driver can be useful.
I would recommend setting the Cursor to either as they are the only supported by Oracle
ctStatic - Bi-directional query produced.
ctOpenForwardOnly - Unidirectional query produced, fastest but can't call Prior
You can also play with CursorLocation to see how it effects your speed.