Sorting in Redshift table - amazon-web-services

I created a test table with two columns - id and name and ran following commands.
CREATE TABLE "public"."test1"(id integer, name character varying(256) encode lzo)distkey(id) compound sortkey(id);
INSERT INTO test1 VALUES (1,'First'),(2,'Second'),(4,'Fourth'),(3,'Third');
VACUUM test1
However on running SELECT * FROM test1; I am receiving following data:
Shouldn't the returned data be sorted according to id? If not, how can I make sure that a SELECT query without ORDER clause returns the data sorted according to the key: id?

You have to use order. From docs:
When a query doesn't contain an ORDER BY clause, the system returns result sets with no predictable ordering of the rows. The same query run twice might return the result set in a different order.

Related

AWS Athena partition projection

Seemingly cannot get Athena to partition projection to work.
When I add partitions the "old fashioned" way and then run a MSCK REPAIR TABLE testparts; I can query the data.
I drop the table and recreate with the partition projections below and it fails to query any data at all. The queries that I do get to run take a very very long time with no results, or they time out like below query.
For the sake of argument I followed AWS documentation:
select distinct year from testparts;
I get :
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mydb.testparts' can potentially read more than 1000000 partitions.
I have ~7500 files in there at the moment in the file structures indicated in the table setups below.
I have:
Tried entering the separated parts as date type, provided format "yyyy-MM-dd" and still it did not work (including deleting and changing my s3 structures as well). I then tried to split the dates into different folders and set as integers (which you see below) and still did not work.
Given I can get it to operate "manually" after repairing the table, then successfully querying my structures - I must be doing something wrong at a fundamental level with partition projections.
I have also changed user from injected type to enum (not ideal given it's a plain old string, but did it for the purpose of testing)
Table creation :
CREATE EXTERNAL TABLE `testparts`(
`thisdata` array<struct<thistype:string,amount:float,identifiers:map<string,struct<id:string,type:string>>,selections:map<int,array<int>>>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`user` string,
`thisid` int,
`account` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://testoutputs/partitiontests/responses'
TBLPROPERTIES (
'classification'='json',
'projection.day.digits'='2',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.account.range'='1,50',
'projection.account.type'='integer',
'projection.month.digits'='2',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.thisid.range'='1,15',
'projection.thisid.type'='integer',
'projection.user.type'='enum',
'projection.user.values'='usera,userb',
'projection.year.range'='2017,2027',
'projection.year.type'='integer',
'storage.location.template'='s3://testoutputs/partitiontests/responses/year=${year}/month=${month}/day=${day}/user=${user}/thisid=${thisid}/account=${account}/',
'transient_lastDdlTime'='1653445805')
If you run a query like SELECT * FROM testparts Athena will generate all permutations of possible values for the partition keys and list the corresponding location on S3. For your table this means doing 11,160,000 listings.
I don't believe that there's any optimization for SELECT DISTINCT year FROM testparts that would skip building the list of partition key values, so something similar would happen with that query too. Similarly, if you use "Preview table" to run SELECT * FROM testparts LIMIT 10 there is no optimization that skips building the list of partitons or skips listing the locations on S3.
Try running a query that doesn't wildcard any of the partition keys to validate that your config is correct.
Partition projection works differently from adding partitions to the catalog, and some care needs to be taken with wildcards. When partitions are in the catalog non-existent partitions can be eliminated cheaply, but with partition projection S3 has to be listed for every permutation of partition keys after predicates have been applied.
Partition projection works best when there are never wildcards on partition keys, to minimize the number of S3 listings that need to happen.

DynamoDB Query distinct attribute values

I'm trying to query DynamoDB and get a result similar to select distinct(address) from ... in SQL.
I know DynamoDB is a document-oriented DB and maybe I need to change the data structure.
I'm trying to avoid getting all the data first and filtering later.
My data looks like this:
Attribute
Datatype
ID
String
Var1
Map
VarN
Map
Address
String
So I want to get the distinct addresses in the entire table.
How it's the best way to do it?
Unfortunately, no. You'll need to Scan the entire table (you can use the ProjectionExpression or AttributesToGet options to ask just for the "Address" attribute, but anyway you'll pay for scanning the entire contents of the table).
If you need to do this scan often, you can add a secondary-index which projects only the keys and the "Address" attribute, to make it cheaper to scan. But unfortunately, using a GSI whose partition key is the "Address" does not give you an ability to eliminate duplicates: Each partition will still contain a list of duplicate items, and unfortunately there is no way to just listing the different partition keys in an index - Scaning the index will give you the same partition key multiple times, as many items there are in this partition.

Fastest way to select a lot of rows based on their ID in PostgreSQL?

I am using postgres with libpqxx, and I have a table that we will simplify down to
data_table
{
bytea id PRIMARY KEY,
BigInt size
}
If I have a set of ID's in cpp, eg std::unordered_set<ObjectId> Ids, what is the best way to get the ID and the Size parameters out of data_table?
I have so far used a prepared statement:
constexpr char* preparedStatement = "SELECT size FROM data_table WHERE id = $1";
Then in a transaction I have called that prepared statement for every entry in the set, and retrieved the result for every entry in the set,
pqxx::work transaction(SomeExistingPqxxConnection);
std::unordered_map<ObjectId, uint32_t> result;
for (const auto& id : Ids)
{
auto transactionResult = transaction.exec_prepared(preparedStatement, ToPqxxBinaryString(id));
result.emplace(id, transactionResult[0][0].as<uint32_t>());
}
return result;
Because the set can contain tens of thousands of objects, and the table can contain millions, this can take quite some time to process, and I don't think it is a particularly efficient use of postgres.
I am pretty much brand new to SQL, so I don't really know if what I am doing is the right way to go about this, or if this is a much more efficient way.
E: For what it's worth the ObjectId class is basically a type wrapper over std::array<uint8_t, 32>, aka a 256 bit cryptographic hash.
The task as I understand it:
Get id (PK) and size (bigint) for "tens of thousands of objects" from a table with millions of rows and presumably several more columns ("simplified down").
The fastest way of retrieval is index-only scans. The cheapest way to get that in your particular case would be a "covering index" for your query by "including" the size column in the PK index like this (requires Postgres 11 or later):
CREATE TEMP TABLE data_table (
id bytea
, size bigint
, PRIMARY KEY (id) INCLUDE (size) -- !
)
About covering indexes:
Do covering indexes in PostgreSQL help JOIN columns?
Then retrieve all rows in a single query (or few queries) for many IDs at once like:
SELECT id, size
FROM data_table
JOIN (
VALUES ('id1'), ('id2') -- many more
) t(id) USING (id);
Or one of the other methods laid out here:
Query table by indexes from integer array
Or create a temporary table and join to it.
But do not "insert all those IDs one by one into it". Use the much faster COPY (or the meta-command \copy in psql) to fill the temp table. See:
How to update selected rows with values from a CSV file in Postgres?
And you do not need an index on the temporary table, as that one will be read in a sequential scan anyway. You only need the covering PK index I lined out.
You may want to ANALYZE the temporary table after filling it, to give Postgres some column statistics to work with. But as long as you get the index-only scans I am aiming for, you can skip that, too. The query plan won't get any better than that.
The id is a primary key and so is indexed, so my first concern would be query setup time. A stored procedure is precompiled, for instance. A second tack is to put your set in a temp table, possibly also keyed on the id, so the two tables/indexes can be joined in one select. The indexes for this should be ordered, like tree not hash, so they can be merged.

Kettle PDI how to define parameters before Table input

I'm converting data from one database to another with a slightly different structure.
In my flow at some point I need to read data from the first database filtering on the id coming from previous steps.
This is the image of my flow
The last step is where I need to filter data. The query is:
SELECT e.*,UNIX_TIMESTAMP(v.dataInserimento)*1000 as timestamp
FROM verbale_evento ve JOIN evento e ON ve.eventi_id=e.id
WHERE ve.Verbale_id=? AND e.titolo='Note verbale'
Unfortunately ve.Verbale_id is a column of the first table (first step). How can I define to filter by that field?
Right now I've an error:
2017/12/22 15:01:00 - Error setting value #2 [Boolean] on prepared statement
2017/12/22 15:01:00 - Parameter index out of range (2 > number of parameters, which is 1).
I need to do this query at the end of the entire transformation.
You can pass previous rows of data as parameters.
However, the number of parameter placeholders in the Table input query must match the number of fields of the incoming data stream. Also, order matters.
Try trimming the data stream to only the field you want to pass using a select values step and then choose that step in the “get data from” box near the bottom of the table input. Also, check the “execute for each input row”.

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.