Pentaho PDI get SQL SUM() with conditions - kettle

I'm using Pentaho PDI 7.1. I'm trying to convert data from Mysql to Mysql changing the structure of data.
I'm reading the source table (customers) and for each row I've to run another query to calculate the balance.
I was trying to use Database value lookup to accomplish it but maybe is not the best way.
I've to run a query like this to get the balance:
SELECT
SUM(
CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END
)
FROM Movimento WHERE contoFidelizzato_id = ?
I should set the parameter taking it from the previous step. Some advice?

The Database lookup value may be a good idea, especially if you are used to database reasoning, but it may result in many queries which may not be the most efficient.
A more PDI-ish style would be to make the query like:
SELECT contoFidelizzato_id
, SUM(CASE WHEN direzione='ENTRATA' THEN -importo ELSE +importo END)
FROM Movimento
GROUP BY contoFidelizzato_id
and use it as the info source of a Lookup Stream Step, like this:
An even more PDI-ish style would be to divert the source table (customer) in two flows : one in which you keep the source rows, and one that you group by contoFidelizzato_id. Of course, you need a formula, or a Javascript, or to put a formula in the SQL of the Table input to change the sign when needed.
Test to know which strategy is better in your case. You'll soon discover that the PDI is very good at handling large data.

Related

Insert many items from list into SQLite

I have a list of lots of data (will be near 1000). I want to add it all in one go to a row. Is this straight forward like a for loop over list with multiple inserts?multiple commits? Is this bad practice?thanks
I haven’t tried yet as just setting up table columns which is many so need to know if feasible thanks
If you're using SQL to insert:
INSERT INTO 'tablename' ('column1', 'column2') VALUES
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2');
If you're using code... generate that above query using a for loop then run it.
For a more efficient approach consider a union as shown in: Is it possible to insert multiple rows at a time in an SQLite database?
insert into 'tablename' ('column1','column2')
select data1 as 'column1',data2 as 'column2'
union select data3,data4
union...
In sqlite you don't have network latency, so it does not really matter performance wise to issue many small requests toward the engine. For more reference about that you can read this page from the official documentation: https://www.sqlite.org/np1queryprob.html
But in write mode (insert or update), each individual query will have to pay the cost of an implicit transaction. To avoid that you need to gather your insert queries in an explicit transaction. Depending of your programming language, how you do that may vary. Here is a code sample on how to do that in go. I've simplified error code management, to have a better view of the gist.
tx, _ := db.Begin()
for _, item := range items {
tx.Exec(`INSERT INTO testtable (col1, col2) VALUES (?, ?)`, item.Field1, item.Field2)
}
tx.Commit()
If you detect an error in your loop instead calling tx.Commit() you need to call tx.Rollback() in order to cancel all previous writes to your database so that the final state is as if no insert query has been issued at all.

What is the best way to do intense read only queries in Django

We have a really big application in Django which uses Postgres database. We want to build an analytics module.
This module uses a base query e.g.
someFoo = SomeFoo.objects.all() # Around 100000 objects returned.
Then slice and dice this data. i.e.
someFoo.objects.filter(Q(creator=owner) | Q(moderated=False))
These queries will be very intense and as this will be an analytics and reporting dashboard the quires will hit the database very badly.
What is the best way to handle complex queries in such conditions ? i.e. when you have a base query and it will be sliced and diced very often in a span of short time and never be used again.
A few possible solutions that we have though of are
A read only database and a write only database.
Writing Raw sql queries and using them. As django ORM can be quite inefficient for certain types of queries.
Caching heavily (Have not though or done any research in this.)
Edit : E.g. query
select sport."sportName", sport.id, pop.name, analytics_query.loc_id, "new count"
from "SomeFoo_sportpop" as sportpop join "SomeFoo_pop" as pop on (sportpop.pop_id=pop.id) join "SomeFoo_sport" as sport on (sportpop.sport_id=sport.id) join
(select ref.catcher_pop_id as loc_id,
(select count(*) from "SomeFoo_pref" where catcher_pop_id=ref.catcher_pop_id and status='pending' and exists=True) as "new count"
from "SomeFoo_pref" as ref
where ref.exists=TRUE and ref.catcher_pop_id is not NULL
group by ref.catcher_pop_id) as analytics_query on (sportpop.pop_id=analytics_query.loc_id)
order by sport."sportName", pop.name asc
This is an example of a raw sql query we are planning to make and its going to have a lot of where statements and groupby. Basically we are going to slice and dice the base query a lot.
Is there any other possible solution or method that you can point us to. Any help is highly appreciated.
I can think to PREPARED STATMENT and a faster server, may be on linux...

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.

Getting generatedauto-increment ID without second query (MySQL)

I have been searching for a while on how to get the generated auto-increment ID from an "INSERT . INTO ... (...) VALUES (...)". Even on stackoverflow, I only find the answer of using a "SELECT LAST_INSERT_ID()" in a subsequent query. I find this solution unsatisfactory for a number of reasons:
1) This will effectively double the queries sent to the database, especially since it is mostly handling inserts.
2) What will happen if more than one thread access the database at the same time? What if more than one application accesses the database at the same time? It seems to me the values are bound to become erroneous.
It's hard for me to believe that the MySQL C++ Connector wouldn't offer the feature that the Java Connector as well as the PHP Connector offer.
An example taken from http://forums.mysql.com/read.php?167,294960,295250
sql::Statement* stmt = conn->createStatement();
sql::ResultSet* res = stmt->executeQuery("SELECT ##identity AS id");
res->next();
my_ulong retVal = res->getInt64("id");
In nutshell, if your ID column is not an auto_increment column then you can as well use
SELECT ##identity AS id
EDIT:
Not sure what do you mean by second query/round trip. First I thought you are trying to know a different way to get the ID of the last inserted row but it looks like you are more interested in knowing whether you can save the round trip or not?
If that's the case, then I am completely agree with #WhozCraig; you can punch in both your queries in a single statement like inser into tab value ....;select last_inserted_id() which will be a single call
OR
you can have stored procedure like below to do the same and save the round trip
create procedure myproc
as
begin
insert into mytab values ...;
select last_inserted_id();
end
Let me know if this is not what you are trying to achieve.

QSqlQuery using with indexes

I have my own data store mechanism for store data. but I want to implement standards data manipulation and query interface for end users,so I thought QT sql is suitable for my case.
but I still cannot understand how do I involved my indexes for sql query.
let say for example,
I have table with column A(int),B(int),C(int),D(int) and column A is indexed.assume I execute query like select * from Foo where A = 10;
How do I involved my index for search the results?.
You have written your own storage system and want to manipulate it using an SQL like syntax? I don't think Qt SQL is the right tool for that job. It offers connectivity to various SQL servers and is not meant for parsing SQL statements. Qt expects to "pass through" the queries and then somehow parse the result set and transform it into a Qt friendly representation.
So if you only want to have a Qt friendly representation, I wouldn't see a reason to go the indirection with SQL.
But regarding your problem:
In SQL, indexes are usually not stated in the queries, but during the creation of the table schema. But SQL server has a possibility to "hint" indexes, is that what you are looking for?
SELECT column_list FROM table_name WITH (INDEX (index_name) [, ...]);