How to update all records in a table at the same time (without updating records one by one) using stored procedure - sql-update

I have a table named Emp wit (Empname,Details). There are 4 records in the table. I want to update all records with a single update statement, without updating records one by one, using a stored procedure.

UPDATE [tableName] SET [columnName] = [value] WHERE [condition]

Related

How to backfill partitioned data in BigQuery?

I am trying to backfill data from GCP billing export table to another table say T1.
Both tables are partitioned.
Below scheduled query runs everyday to get yesterday’s data.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
Now I need to backfill the data , say for 15th May - how do I do that ?
I tried the backfill feature with the below query - expecting the backfill utility to take the past date i.e. May 15th as a param for the #run_date but that didn’t help.
SELECT * FROM gcp_billing_export_v1 WHERE DATE(_PARTITIONTIME) = #run_date
The data is pulled for 15th May from the source table(gcp_billing_export_v1) but is populated against current date in the destination table i.e May 15th May data is populated against June 22nd in the destination table T1. Where am I going wrong ?
Any guidance ?
Looks like you're using ingestion partitioning.
You would need to create a new table with the partitioning you want ie EventDate and populate that new table with historical and new daily data - as you can't overwrite an existing partition.
Link here: https://cloud.google.com/bigquery/docs/querying-partitioned-tables#query_an_ingestion-time_partitioned_table
As #Lemon already pointed out that you're using Ingestion time partitioned tables(both source and dest), you need to understand how it works. Ingestion time partitioned tables are different from the Regular partitioned tables.
From the Documentation-
When you create a table partitioned by ingestion time,BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data.
This type of table has a pseudo-column named _PARTITIONTIME.The value of this column is the ingestion time for each row.
Since you are using the SELECT * FROM gcp_billing_export_v1 you are getting all the data but without any _PARTITIONTIME column. And when you are saving the same result into the destination table , it is updating the _PARTITIONTIME column as per destintaion table's data ingestion-time.
Thus you have old data with the current date in _PARTITIONTIME
To avoid this your destination table needs to be either a normal table or regular partitioned table.
Also you need to have an extra column to hold the Datetime value from the source's_PARTITIONTIME column.You can create a regular partition on this new column.
Then to get _PARTITIONTIME in your result set , in your quer you need to mention the column name specifically in your query.
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = #run_date
The above query will return all the data from the gcp_billing_export_v1 table with 1 extra column ingestionTime.
Now you can backfill the data for 15th,May and save it to the new table.
You can also tweak around this below query to achieve the same
SELECT *,_PARTITIONTIME AS ingestionTime
FROM gcp_billing_export_v1
WHERE DATE(_PARTITIONTIME) = DATE_ADD(#run_date, INTERVAL -1 DAY)
It will run daily as per your need .Now if you want to pull data for 15th,May then you have to schedule the backfill for 16th,May(as per the where clause)

How to create table based on minimum date from other table in DAX?

I want to create a second table from the first table using filters with dates and other variables as follows. How can I create this?
Following is the expected table and original table,
Go to Edit Queries. Lets say our base table is named RawData. Add a blank query and use this expression to copy your RawData table:
=RawData
The new table will be RawDataGrouped. Now select the new table and go to Home > Group By and use the following settings:
The result will be the following table. Note that I didnt use the exactly values you used to keep this sample at a miminum effort:
You also can now create a relationship between this two tables (by the Index column) to use cross filtering between them.
You could show the grouped data and use the relationship to display the RawDate in a subreport (or custom tooltip) for example.
I assume you are looking for a calculated table. Below is the workaround for the same,
In Query Editor you can create a duplicate table of the existing (Original) table and select the Date Filter -> Is Earliest option by clicking right corner of the Date column in new duplicate table. Now your table should contain only the rows which are having minimum date for the column.
Note: This table is dynamic and will give subsequent results based on data changes in the original table, but you to have refresh both the table.
Original Table:
Desired Table:
When I have added new column into it, post to refreshing dataset I have got below result (This implies, it is doing recalculation based on each data change in the original source)
New data entry:
Output:

Prevent duplicates on insert with billions of rows in SQL Data Warehouse?

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Is there any other approach for updating a row in Big Query apart from overwriting the table?

I have a package data with some of its fields as following:
packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
scanid--->string
status--->string
scannedby--->string
Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?
Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.
Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)
Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.
Your view would look something like this:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)
If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.
e.g.
Add Data to TABLE_RAW.
Create view TABLE that performs the above query over TABLE_RAW
At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.
Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.
You cannot update row in BigQuery table. You can only add one
Overwriting table on each and every transaction - kind of doesn't make sense at all from any prospective
I would suggest just adding each and every transaction as new row.
Meantime, if for any reason (storage cost, query cost, query performance etc.) you want to dedup - you can do batch dedup periodically - let's say daily. In this case, having original data partitioned in daily tables will be beneficial. As at each moment you will need only latest Deduped Table and recent Daily table to query latest transaction. And previous days daily table can be deleted if you worry of storage cost
Biquery supports updates now here, and supports transactions also.

First time Updating a table

I was recently given permissions to update a single table in our database but this is not something I have done before and I do not what to mess anything up. I have tried searching for something online that was similar to what I am wanting to do with no success.
The table name is dbo.Player_Miles and it only has two columns of data Player_ID and Miles both of which are set as (int,null).
Currently there are about 300K records in this table and I have a csv file I need to use to update this table. In the file there are 500k Records so I need to be able to:
INSERT the new records ~250k records
UPDATE the records with that have new information ~200K records
Leave untouched and record that has the same information(although updating those to the same thing would not hurt the database would be a resource hog I would guess) ~50K records
Also leave untouched any records in the table currently that are not in the updated file. ~50k records
I am using SSMS 2008 but the Server is 2000.
You could approach this in stages...
1) Backup the database
2) Create a temporary SQL table to hold your update records
create table Player_Miles_Updates (
PlayerId int not null,
Miles int null)
3) Load the records from your text file into your temporary table
bulk insert Player_Miles_Updates
from 'c:\temp\myTextRecords.csv'
with
(
FIELDTERMINATOR =' ,',
ROWTERMINATOR = '\n'
)
4) Begin a transaction
begin transaction
5) Insert your new data
insert into Player_Miles
select PlayerId, Miles
from Player_Miles_Updates
where PlayerId not in (select PlayerId from Player_Miles)
6) Update your existing data
update Player_Miles
set Player_Miles.Miles = pmu.Miles
from Player_Miles pm join Player_Miles_Updates pmu on pm.Player_Id = pmu.Player_Id
7) Select a few rows to make sure what you wanted to happen, happened
select *
from Player_Miles
where Player_Id in (1,45,86,14,83) -- use id's that you have seen in the csv file
8a) If all went well
commit transaction
8b) If all didn't go well
rollback transaction
9) Delete the temporary table
drop table Player_Miles_Updates
You should use SSIS (or DTS, which was replaced by SSIS in SQL Server 2005).
Use the CSV as your source and "upsert" the data to your destination table.
In SSIS there are different ways to get this task done.
An easy way would be to use a lookup task on Player_ID.
If there's a match update the value and if there's no match just insert the new value.
See this link for more informations on lookup-pattern-upsert