I am running the below query to create a dimension table; however the identity column is not yielding sequential values; the values are very random. Any reason for this?
I have tried a stored procedure and also a manual insert; but the result is the same
CREATE TABLE Dim.CDIM_State
(
intStateDimKey int IDENTITY(1,1) NOT NULL,
txtState nvarchar(250),
dtCreatedOn datetime,
dtModifiedOn datetime
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
)
The output is something like this; I expect sequential values viz. 1,2,3,4,5
That is correct. There are separate counters for each distribution.
It won't affect your dimension, the values will always be unique.
Related
Please have a look at the following data example:
In this table, I have multiple columns. There is no PRIMARY KEY, as per the image I attached, there are a few duplicates in STK_CODE. Depending on the (min) column, I want to remove duplicate rows.
According to the image, one stk_code has three different rows. Corresponding to these duplicate stk_codes, value in (min) column is different, I want to keep the row which has minimum value in (min) column.
I am very new at sqlite and I am dealing with (-lsqlite3) to join cpp with sqlite.
Is there any way possible?
Your table has rowid as primary key.
Use it to get the rowids that you don't want to delete:
DELETE FROM comparison
WHERE rowid NOT IN (
SELECT rowid
FROM comparison
GROUP BY STK_CODE
HAVING (COUNT(*) = 1 OR MIN(CASE WHEN min > 0 THEN min END))
)
This code uses rowid as a bare column and a documented feature of SQLite with which when you use MIN() or MAX() aggregate functions the query returns that row which contains the min or max value.
See a simplified demo.
Is there any way to tell if the zone map is used by a specific query.
Is there a way to list block query read
My query is taking more time than expected, I just want to make the sure the query is using zone map to filter out blocks.
The table stl_scan contains this information.
is_rrscan indicates if the scan used range restriction (zone maps).
rows_pre_user_filter is the row count before zone map restrictions
rows_pre_filter is the row count after zone map restrictions
rows is the row count after all predicates were evaluated
SELECT query, segment
, tbl, perm_table_name
, is_rrscan
, SUM( rows_pre_user_filter ) rows_on_table
, SUM( rows_pre_filter ) rows_scanned
, SUM( rows ) rows_returned
FROM stl_scan
WHERE query = 999999
GROUP BY 1,2,3,4,5
ORDER BY 1,2,3,4,5
I'm using the histogram() function https://prestodb.github.io/docs/current/functions/aggregate.html
It "Returns a map containing the count of the number of times each input value occurs."
The result may look something like this:
{ORANGES=1, APPLES=165, BANANAS=1}
Is there a function that will return APPLES given the above input?
XY Problem?
The astute reader may notice the end-result of histogram() combined with what I'm trying to do, would be equivalent to the mythical Mode Function, which exists in textbooks but not in real-world database engines.
Here's my complete query at this point. I'm looking for the most frequently occurring value of upper(cmplx) for each upper(address),zip tuple:
select * from (select upper(address) as address, zip,
(SELECT max_by(key, value)
FROM unnest(histogram(upper(cmplx))) as t(key, value)),
count(*) as N
from apartments
group by upper(address), zip) t1
where N > 3
order by N desc;
And the error...
SYNTAX_ERROR: line 2:55: Constant expression cannot contain column
references
Here's what I use to get the key that corresponds to the max value from an arbitrary map:
MAP_KEYS(mapname)[
ARRAY_POSITION(
MAP_VALUES(mapname),
ARRAY_MAX(MAP_VALUES(mapname))
)
]
substitute your histogram map for 'mapname'.
Not sure how this solution compares computationally to the other answer, but I do find it easier to read.
You can convert the map you got from histogram to an array with map_entries. Then you can UNNEST that array to a relation and you can call max_by. Please see the below example:
SELECT max_by(key, value) FROM (
SELECT map_entries(histogram(clerk)) as entries from tpch.tiny.orders
)
CROSS JOIN UNNEST (entries) t(key, value);
EDIT:
As noted by #Alex R, you can also pass histogram results dirrectly to UNNEST:
SELECT max_by(key, value) FROM (
SELECT histogram(clerk) as histogram from tpch.tiny.orders )
CROSS JOIN UNNEST (histogram) t(key, value);
In your question the query part (SELECT max_by(key, value) FROM unnest(histogram(upper(cmplx)) is a correlated subquery which is not yet supported. However the error you are seeing is misleading. IIRC Athena is using Presto 0.172, and this error reporting was fixed in 0.183 (see https://docs.starburstdata.com/latest/release/release-0.183.html - that was in July 2017, btw map_entries was also added in 0.183)
I have 2 column; ID CODE, value
Remove duplicates function will remove the examples with the higher value and leave the lower one. Is there any way to remove the lower ones? The result I expected was like this.
I've tried Buffer Table function before but it doesn't work. Seems like Buffer Table just works with date-related data (newest-latest).
You could use SUMMARIZE which can be used similar to a SQL query that takes a MIN value for a column, grouped by some other column.
In the example below, MIN([value]) is taken, given a new column name "MinValue", which is grouped by IDCode. This should return the min value for each IDCode.
NewCalculatedTable =
SUMMARIZE(yourTablename, yourTablename[IDCode], "MinValue", MIN(yourTablename[value]) )
Alternatively, if you want the higher values just replace the MIN function with MAX.
I have a table with a boolean field, IsNew, that indicates whether or not the corresponding entity is new. I want to periodically query for all entities in a particular state. What are the implications of having index on boolean (or enum)? Will it create a hotspot? Any limitations on QPS?
A secondary index is implemented internally as a table that has a primary key based on the declared secondary index key, plus whatever indexed table keys weren't mentioned in the secondary index explicitly. So, say you have a table like this:
CREATE TABLE UserThings (
UserId INT64 NOT NULL,
ThingId INT64 NOT NULL,
...
IsNew BOOL NOT NULL,
...
) PRIMARY KEY(UserId, ThingId), ...
And you create an index like this:
CREATE INDEX UserThingsByIsNew ON UserThings(IsNew, ThingId)
That'll create an internal table that looks something like this:
CREATE TABLE UserThingsByStatus_Index (
IsNew BOOL,
ThingId INT64 NOT NULL,
UserId INT64 NOT NULL,
) PRIMARY KEY(new, ThingId, UserId), ...
So, when you update rows of UserThings to change the value of the IsNew column, it will delete the old row in UserThingsByIsNew_Index, and insert an additional row. This will tend to create a lot of churn in the index if the IsNew value of rows is changing at a high frequency. This might not be a problem at all, but you will only really know by testing your scenario under a real-world workload for a sustained time.
If you don't update the IsNew field of entities too frequently, then you probably won't have any hot-spotting problems. That's why I mentioned earlier that Cloud Spanner also appends the original table keys to the keys of the index: assuming that your original table rows are well-distributed by the table's keys, then the portion of the index for IsNew=true and IsNew=false, respectively, will have a similar distribution, and shouldn't cause a hotspot.