Update operations in SQL Server occur in clustered index order? - sql-update

Update operations in SQL Server occur in clustered index order?
I want to set the sequential position in a table, but SQL Server don't let me use ORDER BY in an update operation.
I tested and the updates occur in the clustered index which is the position column, so everything is fine, but can I trust it will always work like that?
DECLARE #Position BIGINT = 0
UPDATE
Paginations
SET
#Position = Position = #Position + 1

Related

Django jumping model id number (AutoField) [duplicate]

This question already has answers here:
Identity increment is jumping in SQL Server database
(6 answers)
Closed 7 years ago.
I have a strange scenario in which the auto identity int column in my SQL Server 2012 database is not incrementing properly.
Say I have a table which uses an int auto identity as a primary key it is sporadically skipping increments, for example:
1,
2,
3,
4,
5,
1004,
1005
This is happening on a random number of tables at very random times, can not replicate it to find any trends.
How is this happening?
Is there a way to make it stop?
This is all perfectly normal. Microsoft added sequences in SQL Server 2012, finally, i might add and changed the way identity keys are generated. Have a look here for some explanation.
If you want to have the old behaviour, you can:
use trace flag 272 - this will cause a log record to be generated for each generated identity value. The performance of identity generation may be impacted by turning on this trace flag.
use a sequence generator with the NO CACHE setting (http://msdn.microsoft.com/en-us/library/ff878091.aspx)
Got the same problem, found the following bug report in SQL Server 2012
If still relevant see conditions that cause the issue - there are some workarounds there as well (didn't try though).
Failover or Restart Results in Reseed of Identity
While trace flag 272 may work for many, it definitely won't work for hosted Sql Server Express installations. So, I created an identity table, and use this through an INSTEAD OF trigger. I'm hoping this helps someone else, and/or gives others an opportunity to improve my solution. The last line allows returning the last identity column added. Since I typically use this to add a single row, this works to return the identity of a single inserted row.
The identity table:
CREATE TABLE [dbo].[tblsysIdentities](
[intTableId] [int] NOT NULL,
[intIdentityLast] [int] NOT NULL,
[strTable] [varchar](100) NOT NULL,
[tsConcurrency] [timestamp] NULL,
CONSTRAINT [PK_tblsysIdentities] PRIMARY KEY CLUSTERED
(
[intTableId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
and the insert trigger:
-- INSERT --
IF OBJECT_ID ('dbo.trgtblsysTrackerMessagesIdentity', 'TR') IS NOT NULL
DROP TRIGGER dbo.trgtblsysTrackerMessagesIdentity;
GO
CREATE TRIGGER trgtblsysTrackerMessagesIdentity
ON dbo.tblsysTrackerMessages
INSTEAD OF INSERT AS
BEGIN
DECLARE #intTrackerMessageId INT
DECLARE #intRowCount INT
SET #intRowCount = (SELECT COUNT(*) FROM INSERTED)
SET #intTrackerMessageId = (SELECT intIdentityLast FROM tblsysIdentities WHERE intTableId=1)
UPDATE tblsysIdentities SET intIdentityLast = #intTrackerMessageId + #intRowCount WHERE intTableId=1
INSERT INTO tblsysTrackerMessages(
[intTrackerMessageId],
[intTrackerId],
[strMessage],
[intTrackerMessageTypeId],
[datCreated],
[strCreatedBy])
SELECT #intTrackerMessageId + ROW_NUMBER() OVER (ORDER BY [datCreated]) AS [intTrackerMessageId],
[intTrackerId],
[strMessage],
[intTrackerMessageTypeId],
[datCreated],
[strCreatedBy] FROM INSERTED;
SELECT TOP 1 #intTrackerMessageId + #intRowCount FROM INSERTED;
END

Why Query takes too long to fetch data

This is my query on workbench
select t1.COL 1 from ex1 t1 left outer join ch1 t2 on t1.COL 1=t2.COL 1;
why this is taking too long to fetch data?
Outer joins can be slow since all of t1s records are returned. Since you're joining on id columns, it should be easy to index them. Without an index, when you join t2, you are evaluating each of the 142,000 records to search for matching ids. With an index, you are setting aside memory to "remember" the locations of each id in sequence. It's like using a bookmark instead of flipping through each page to find the page you want.
I don't know what database management system you're using, but here's a guide on creating clustered and unclustered indices on SQL Server:
https://learn.microsoft.com/en-us/sql/t-sql/statements/create-index-transact-sql

What is the best practice for loading data into BigQuery table?

Currently I'm loading data from Google Storage to stage_table_orders using WRITE_APPEND. Since this load both new and existed order there could be a case where same order has more than one version the field etl_timestamp tells which row is the most updated one.
then I WRITE_TRUNCATE my production_table_orders with query like:
select ...
from (
SELECT * , ROW_NUMBER() OVER
(PARTITION BY date_purchased, orderid order by etl_timestamp DESC) as rn
FROM `warehouse.stage_table_orders` )
where rn=1
Then the production_table_orders always contains the most updated version of each order.
This process is suppose to run every 3 minutes.
I'm wondering if this is the best practice.
I have around 20M rows. It seems not smart to WRITE_TRUNCATE 20M rows every 3 minutes.
Suggestion?
We are doing the same. To help improve performance though, try to partition the table by date_purchased and cluster by orderid.
Use a CTAS statement (to the table itself) as you cannot add partition after fact.
EDIT: use 2 tables and MERGE
Depending on your particular use case i.e. the number of fields that could be updated between old and new, you could use 2 tables, e.g. stage_table_orders for the imported records and final_table_orders as destination table and do
a MERGE like so:
MERGE final_table_orders F
USING stage_table_orders S
ON F.orderid = S.orderid AND
F.date_purchased = S.date_purchased
WHEN MATCHED THEN
UPDATE SET field_that_change = S.field_that_change
WHEN NOT MATCHED THEN
INSERT (field1, field2, ...) VALUES(S.field1, S.field2, ...)
Pro: efficient if few rows are "upserted", not millions (although not tested) + pruning partitions should work.
Con: you have to explicitly list the fields in the update and insert clauses. A one-time effort if schema is pretty much fixed.
There are may ways to de-duplicate and there is no one-size-fits-all. Search in SO for similar requests using ARRAY_AGG, or EXISTS with DELETE or UNION ALL,... Try them out and see which performs better for YOUR dataset.

Cassandra: How to query the complete data set?

My table has 77k entries (number of entries keep increasing this a high rate), I need to make a select query in CQL 3. When I do select count(*) ... where (some_conditions) allow filtering I get:
count
-------
10000
(1 rows)
Default LIMIT of 10000 was used. Specify your own LIMIT clause to get more results.
Let's say the 23k rows satisfied this some_condition. The 10000 count above is of the first 10k rows of these 23k rows, right? But how do I get the actual count?
More importantly, How do I get access to all of these 23k rows, so that my python api can perform some in-memory operation on the data in some columns of the rows. Are there a some sort pagination principles in Cassandra CQL 3.
I know I can just increase the limit to a very large number but that's not efficient.
Working Hard is right, and LIMIT is probably what you want. But if you want to "page" through your results at a more detailed level, read through this DataStax document titled: Paging through unordered partitioner results.
This will involve using the token function on your partitioning key. If you want more detailed help than that, you'll have to post your schema.
While I cannot see your complete table schema, by virtue of the fact that you are using ALLOW FILTERING I can tell that you are doing something wrong. Cassandra was not designed to serve data based on multiple secondary indexes. That approach may work with a RDBMS, but over time that query will get really slow. You should really design a column family (table) to suit each query you intend to use frequently. ALLOW FILTERING is not a long-term solution, and should never be used in a production system.
you just have to specify limit with your query.
let's assume your database is containing under 1 lack records so if you will execute below query it will give you the actual count of the records in table.
select count(*) ... where (some_conditions) allow filtering limit 100000;
Another way is to write python code, the cqlsh indeed is python script.
use
statement = " select count(*) from SOME_TABLE"
future = session.execute_async(statement)
rows = future.result()
count = 0
for row in rows:
count = count + 1
the above is using cassandra python driver PAGE QUERY feature.

Size of data obtained from SQL query via ODBC API

Does anybody know how I can get the number of the elements (rows*cols) returned after I do an SQL query? If that can't be done, then is there something that's going to be relatively representative of the size of data I get back?
I'm trying to make a status bar that indicates how much of the returned data I have processed, so I want to be somewhere relatively close. Any ideas?
Please note that SQLRowCount only returns returns the number of rows affected by an UPDATE, INSERT, or DELETE statement; not the number of rows returned from a SELECT statement (as far as I can tell). So I can't multiply that directly to the SQLColCount.
My last option is to have a status bar that goes back and forth, indicating that data is being processed.
That is frequently a problem when you wan to reserve dynamic memory to hold the entire result set.
One technique is to return the count as part of the result set.
WITH
data AS
(
SELECT interesting-data
FROM interesting-table
WHERE some-condition
)
SELECT COUNT(*), data.*
from data
If you don't know beforehand what columns you are selecting
or use a *, like the example above,
then number of columns can be selected out of the USER_TAB_COLS table
SELECT COUNT(*)
FROM USER_TAB_COLS
WHERE TABLE_NAME = 'interesting-table'
SQLRowCount can return the number of rows for SELECT queries if the driver supports it. Many drivers dont however, because it can be expensive for the server to compute this. If you want to guarantee you always have a count, you must use COUNT(*), thus forcing the server into doing the potentially time consuming calculation (or causing it to delay returning any results until the entire result is known).
My suggestion would be to attempt SQLRowCount, so that the server or driver can decide if the number of rows is easily computable. If it returns a value, then multiply by the result from SQLNumResultCols. Otherwise, if it returns -1, use the back and forth status bar. Sometimes this is better because you can appear more responsive to the user.