MariaDB: multiple table update does not update a single row multiple times? Why? - sql-update

Today I was just bitten in the rear end by something I didn't expect. Here's a little script to reproduce the issue:
create temporary table aaa_state(id int, amount int);
create temporary table aaa_changes(id int, delta int);
insert into aaa_state(id, amount) values (1, 0);
insert into aaa_changes(id, delta) values (1, 5), (1, 7);
update aaa_changes c join aaa_state s on (c.id=s.id) set s.amount=s.amount+c.delta;
select * from aaa_state;
The final result in the aaa_state table is:
ID
Amount
1
5
Whereas I would expect it to be:
ID
Amount
1
12
What gives? I checked the docs but cannot find anything that would hint at this behavior. Is this a bug that I should report, or is this by design?

The behavior you are seeing is consistent with two updates happening on the aaa_state table. One update is assigning the amount to 7, and then this amount is being clobbered by the second update, which sets to 5. This could be explained by MySQL using a snapshot of the aaa_state table to fetch the amount for each step of the update. If true, the actual steps would look something like this:
1. join the two tables
2. update the amount using the "first" row from the changes table.
now the cached result for the amount is 7, but this value will not actually
be written out to the underlying table until AFTER the entire update
3. update the amount using the "second" row from the changes table.
now the cached amount is 5
5. the update is over, write 5 out for the actual amount
Your syntax is not really correct for what you want to do. You should be using something like the following:
UPDATE aaa_state as
INNER JOIN
(
SELECT id, SUM(delta) AS delta_sum
FROM aaa_changes
GROUP BY id
) ac
ON ac.id = as.id
SET
as.amount = as.amount + ac.delta_sum;
Here we are doing a proper aggregation of the delta values for each id in a separate bona-fide subquery. This means that the delta sums will be properly computed and materialized in the subquery before MySQL does the join, to update the first table.

Related

BigQuery MERGE statement billing more bytes than editor shows

I have a very large (3.5B records) table that I want to update/insert (upsert) using the MERGE statement in BigQuery. The source table is a staging table that contains only the new data, and I need to check if the record with a corresponding ID is in the target table, updating the row if so or inserting if not.
The target table is partitioned by an integer field called IdParent, and the matching is done on IdParent and another integer field called IdChild. My merge statement/script looks like this:
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched and t.IdParent in unnest(parentList) then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched and IdParent in unnest(parentList) then
insert (<all the fields>)
values (<all the fields)
;
So I:
Pull the IdParent list from the staging table to know which partitions to prune
limit the partitions of the target table in the join predicate
also limit the partitions of the target table in the match/not matched conditions
The total size of dataset.Target is ~250GB. If I put this script in my BQ editor and remove all the IdParent in unnest(parentList) then it shows ~250GB to bill in the editor (as expected since there's no partition pruning). If I add the IdParent in unnest(parentList) back in so the script is exactly like you see it above i.e. attempting to partition prune, the editor shows ~97MB to bill. However, when I look at the query results, I see that it actually billed ~180GB:
The target table is also clustered on the two fields being matched, and I'm aware that the benefits of clustering are typically not shown in the editor's estimate. However, my understanding is that that should only make the bytes billed smaller... I can't think of any reason why this would happen.
Is this a BQ bug, or am I just missing something? BigQuery doesn't even say "the script is estimated to process XX MB", it says "This will process XX MB" and then it processes way more.
That's very interesting. What you did seems totally correct.
It seems BQ query planner could interpret your SQL correctly and know the partition pruning is provided, but when it executes. it failed to do so.
try removing t.IdParent in unnest(parentList) from both when matched clauses to see if the issue still happens, that is,
declare parentList array<int64>;
set parentList = array(select distinct IdParent from dataset.Staging);
merge into dataset.Target t
using dataset.Staging s
on
-- target is partitioned by IdParent, do this for partition pruning
t.IdParent in unnest(parentList)
and t.IdParent = s.IdParent
and t.IdChild = s.IdChild
when matched then
update
set t.Column1 = s.Column1,
t.Column2 = s.Column2,
...<more columns>
when not matched then
insert (<all the fields>)
values (<all the fields)
;
It would be a good idea to submit a bug to BigQuery if it couldn't be resolved.

Can we create Dynamic Date table in mapping Data Flow?

I have a query in Power BI that takes two parameter: Start Date and End Date.
Whenever I pass these Dates it return a table of Date that contain few columns created according to this range of date such as Date, QuarterofYear, Year, MonthName......etc.
Can we create a mapping data flow in ADF that takes two parameter as input and return a calculated table according to provided dates?
Is there any function that return the range of dates?
For your request: "I want that I pass two date Start Date and End Date in ADF Mapping Data Flow , and Data flow will Create a column such as "Date" that contain that number of Date rows. Is there any function for this? Exam. Start Date=20-01-2019, End Date=20-01-2020 Then Date Column Values should be: 20-01-2019 21-01-2019 ......... ......... 20-02-2020", according the Data Factory documents and my experience, the answer is no, we can't achieve it in Data Flow.
There is a solution to this, but it is a bit tricky.
TL;DR
The general data flow looks like this:
We need a dummy source with exactly one row which contains whatever.
Then we derive a column where we use the mapLoop() expression to create an array of all the dates we want to get rows for.
Finally, we need to flatten the array column which will result in one row per array entry and thus one row per date.
Walkthrough
Source dummy
Each dataflow needs a source and we need exactly one row to make our dataflow work. To achieve this I've created a dataset called empty of type CSV in my data lake which has this content:
empty
""
This is our source definition:
And its result looks like this:
Derived column days
This is where the magic happens!
We create a new column dates which is an array of all the dates we want to have in our date table:
In this scenario we want a date table starting on 2019-01-01 and reaching one year into the future. The full expression looks like this:
mapLoop(
addDays(currentDate(), 365) - toDate(2019-01-01),
addDays(toDate(2019-01-01), #index)
)
This is what happens here:
the mapLoop() function builds an array of elements. You specify the number of elements you want to have and the lambda expression to calculate each of the elements. For example, mapIndex([1, 2, 3, 4], #item + 2 + #index) results in [4, 6, 8, 10]
addDays(currentDate(), 365) - toDate('2019-01-01') is the number of days between our start (2019-01-01) and end date (1 year in the future from now) and thus the number of dates we want to have in our resulting array.
addDays(toDate(2019-01-01), #index) calculates each array item by adding #index days to our start date. This is executed for the number of days we've calculated before and #index is the array position. Thus, the first element of the array will be 2019-01-01 + 1, the second 2019-01-01 + 2 and so on.
Our stream now has these columns:
Flatten
Finally, you need a flatten transformation which will expand each item in your array to its dedicated row. We can also dismiss the useless empty column in this step:
And this finally results in what we wanted to achieve:
References
Data transformation expressions in mapping data flow

DynamoDB QuerySpec {MaxResultSize + filter expression}

From the DynamoDB documentation
The Query operation allows you to limit the number of items that it
returns in the result. To do this, set the Limit parameter to the
maximum number of items that you want.
For example, suppose you Query a table, with a Limit value of 6, and
without a filter expression. The Query result will contain the first
six items from the table that match the key condition expression from
the request.
Now suppose you add a filter expression to the Query. In this case,
DynamoDB will apply the filter expression to the six items that were
returned, discarding those that do not match. The final Query result
will contain 6 items or fewer, depending on the number of items that
were filtered.
Looks like the following query should return (at least sometimes) 0 records.
In summary, I have a UserLogins table. A simplified version is:
1. UserId - HashKey
2. DeviceId - RangeKey
3. ActiveLogin - Boolean
4. TimeToLive - ...
Now, let's say UserId = X has 10,000 inactive logins in different DeviceIds and 1 active login.
However, when I run this query against my DynamoDB table:
QuerySpec{
hashKey: null,
rangeKeyCondition: null,
queryFilters: null,
nameMap: {"#0" -> "UserId"}, {"#1" -> "ActiveLogin"}
valueMap: {":0" -> "X"}, {":1" -> "true"}
exclusiveStartKey: null,
maxPageSize: null,
maxResultSize: 10,
req: {TableName: UserLogins,ConsistentRead: true,ReturnConsumedCapacity: TOTAL,FilterExpression: #1 = :1,KeyConditionExpression: #0 = :0,ExpressionAttributeNames: {#0=UserId, #1=ActiveLogin},ExpressionAttributeValues: {:0={S: X,}, :1={BOOL: true}}}
I always get 1 row. The 1 active login for UserId=X. And it's not happening just for 1 user, it's happening for multiple users in a similar situation.
Are my results contradicting the DynamoDB documentation?
It looks like a contradiction because if maxResultSize=10, means that DynamoDB will only read the first 10 items (out of 10,001) and then it will apply the filter active=true only (which might return 0 results). It seems very unlikely that the record with active=true happened to be in the first 10 records that DynamoDB read.
This is happening to hundreds of customers that are running similar queries. It works great, when according to the documentation it shouldn't be working.
I can't see any obvious problem with the Query. Are you sure about your premise that users have 10,000 items each?
Your keys are UserId and DeviceId. That seems to mean that if your user logs in with the same device it would overwrite the existing item. Or put another way, I think you are saying your users having 10,000 different devices each (unless the DeviceId rotates in some way).
In your shoes I would just remove the filterexpression and print the results to the log to see what you're getting in your 10 results. Then remove the limit too and see what results you get with that.

PowerBI/DAX: Unable to correctly compare two dates

I have this custom date that I created as a measure:
Start Date = DATE(YEAR(MAX(Loss[dte_month_end]))-1,12,31)
So this part looks fine in PowerBI and seems to be the right format.
So now I created a new column where I'm going through my data to check whether a record is equal to my "Start Date" as defined above.
IsStart = IF(Loss[dte_month_end]=[Start Date], TRUE, FALSE)
but the weird thing is that all records are evaluated to false.
I know this is actually not the case in my actual data, and I could find actual records with dte_month_end = 12/31/2017 as shown above.
Can someone help me understand why the IF statement would not be able to evaluate this correctly? I initially thought that this may be a case of the DATETIME format being inconsistent - but I purposefully changed both formats to be the same to no avail.
Thanks.
Edit1----------- FYI:
This is the format that my dte_month_end field has:
Edit2 --
I tried changing the dte_month_end format to Date instead of DateTime, and it still doesn't seem to work:
This is happening because you are using a measure inside of a calculated column. When you do this, the filter context for the measure is the row context in the table.
To fix this, you need to modify the filter context for your measure. For example:
Start Date = DATE(YEAR(CALCULATE(MAX(Loss[dte_month_end]), ALL(Loss))) - 1, 12, 31)
or
Start Date = DATE(YEAR(MAXX(ALL(Loss), Loss[dte_month_end])) - 1, 12, 31)
If you don't do this, the MAX only looks at the current row, rather than all the rows in the table.

How do I change the individual value to be summed during annotation in Django?

I am front end developer new to django. There is a certain column(server_reach) in our postgres DB which has values of (1,2). But I need to write a query which tells me if at least one of the filtered rows has a row with reachable values( 1= not reachable, 2 = reachable).
I was initially told that the values of the column would be (0,1) based on which I wrote this:
ServerAgent.objects.values('server').filter(
app_uuid_url=app.uuid_url,
trash=False
).annotate(serverreach=Sum('server_reach'))
The logic is simple that I fetch all the filtered rows and annotate them with the sum of the server_reaches. If this is more than zero then at least one entry is non-zero.
But the issue is that the actual DB has values (1,2). And this logic will not work anymore. I want to subtract the server_reach of each row by '1' before summing. I have tried F expressions as below
ServerAgent.objects.values('server').filter(
app_uuid_url=app.uuid_url,
trash=False
).annotate(serverreach=Sum(F('server_reach')-1))
But it throws the following error. Please help me getting this to work.
AttributeError: 'ExpressionNode' object has no attribute 'split'
Use Avg instead of Sum. If average value is greater than 1 then at least one row contains value of 2.