I am configuring DAS 3.1.0 + APIM 2.0.0 with Oracle database 11g (relational database).
When I enable DAS analysis statistics to integrate with the API Manager, almost everything works fine except for the part where the DAS dramatically raises the CPU consumption level of the machine with the database.
I noticed that he always runs this query:
MERGE INTO API_REQ_USER_BROW_SUMMARY dest USING( SELECT :1 api, :2
version, :3 apiPublisher, :4 tenantDomain, :5 total_request_count, :6
year, :7 month, :8 day, :9 requestTime, :10 os, :11 browser FROM dual)
src ON(dest.api=src.api AND dest.version=src.version AND
dest.apiPublisher=src.apiPublisher AND dest.year=src.year AND
dest.month=src.month AND dest.day=src.day AND dest.os=src.os AND
dest.browser=src.browser AND dest.tenantDomain=src.tenantDomain)WHEN NOT
MATCHED THEN INSERT(api, version, apiPublisher, tenantDomain,
total_request_count, year, month, day, requestTime, os, browser)
VALUES(src.api, src.version, src.apiPublisher, src.tenantDomain,
src.total_request_count, src.year, src.month, src.day, src.requestTime,
src.os, src.browser) WHEN MATCHED THEN UPDATE SET
dest.total_request_count=src.total_request_count,
dest.requestTime=src.requestTime
I would like to know if there is a way to optimize so that the cpu of the machine on which the data dase is not so beaten up causing a performance drop.
Please, has anyone had this difficulty yet and could you help me?
What happens in the above query is, the records are inserted into database if there are no records with the same primarykey values in the database, or if there are some records with the same primarykeys then we update the existing records.
The Table "API_REQ_USER_BROW_SUMMARY" has two columns "OS" and "browser" which are a part of the primary keys of that table. It is observed that when the NULL values are inserted into "OS" and "browser" the analytics server and the database hang.
What you can do is, (You might need to shutdown the analytics server and restart the db server before following steps)
Go to {Analytics_server}/repository/deployment/server/carbonapps then open org_wso2_carbon_analytics_apim-1.0.0.car as a zip file.
Go to folder APIM_USER_AGENT_STATS_1.0.0
open APIM_USER_AGENT_STATS.xml
At the end of the script (before tag), you will see a sparkSql query like below.
INSERT INTO TABLE APIUserBrowserData SELECT api,version,apiPublisher,tenantDomain,total_request_count,year,month,day,requestTime,os,browser FROM API_REQUEST_USER_BROWSER_SUMMARY_FINAL;
Replace that line with the following.
INSERT INTO TABLE APIUserBrowserData SELECT api,version,apiPublisher,tenantDomain,total_request_count,year,month,day,requestTime, if(os is null, "UNKNOWN",os), if(browser is null, "UNKNOWN", browser) FROM API_REQUEST_USER_BROWSER_SUMMARY_FINAL;
This will prevent Spark inserting NULL values to "OS" and "browser" columns into table "API_REQ_USER_BROW_SUMMARY".
Please check if the CPU consumption is high after doing the above changes.
Edit: #artCampos, I cannot comment, so I am editing my original answer to reply to your comment. There will not be any side effect. But, Note that, We are replacing the NULL values with string value "UNKOWN". I dont think that will be a problem in this case. You dont need to discard any of the existing data. Please also note that, In any case, if the NULL values are inserted into
DB primaryKeys, it will fail in most of the RDBMS.
Related
I wanted to do an insert and update at the same time in Redshift. For this I am inserting the data into a temporary table, removing the updated entries from the original table and inserting all the new and updated entries. Since Redshift uses concurrency, sometimes entries are duplicated, because the delete started before the insert was finished. Using a very large sleep for each operation this does not happen, however the script is very slow. Is it possible to run queries in parallel in Redshift?
Hope someone can help me , thanks in advance!
You should read up on MVCC (multi-version coherency control) and transactions. Redshift can only only run one query at a time (for a session) but that is not the issue. You want to COMMIT both changes at the same time (COMMIT is the action that causes changes to be apparent to others). You do this by wrapping your SQL statement in a transaction (BEGIN ... COMMIT) and executed in the same session (not clear if you are using multiple sessions). All changes made within the transaction will only be visible to the session making the changes UNTIL COMMIT when ALL the changes made by the transaction will be visible to everyone at the same moment.
A few things to watch out for - if your connection is in AUTOCOMMIT mode then you may break out of your transaction early and COMMIT partial results. Also when you are working in transactions your source table information is unchanging (so you see consistent data during your transaction) and this information isn't allowed to change for you. This means that if you have multiple sessions changing table data you need to be careful about the order in which they COMMIT so the right version of data is presented to each other.
begin transaction;
<run the queries in parallel>
end transaction;
In this specific case do this:
create temp table stage (like target);
insert into stage
select * from source
where source.filter = 'filter_expression';
begin transaction;
delete from target
using stage
where target.primarykey = stage.primarykey;
insert into target
select * from stage;
end transaction;
drop table stage;
See:
https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html
I use the C API functions PQprepare and PQexecPrepared to insert data in PostgreSQL, the frequency is 10-100/s, with every insert statement contain 100 rows.
What confuses me is that the data I have just inserted can't be selected. About 1-3 minutes later I can select the data.
For example, I insert some data at 18:00, and in my C++ program the PQresultStatus's return is success, and then I select them at the same time, the result is nothing. A few minutes later, I can select the data.
I want know is there a queue or other things in PostgreSQL, when I insert data with a high frequency.
There are two possibilities:
You don't commit the inserting transaction right away. That would be a bug in your application.
You query is running on a standby server, and the modifications haven't been applied there yet.
I'm trying create a transformation that can change field value in DB (postgreSQL what i use).
Case :
In postgre db I have table called Monitoring and it has several field like id, date, starttime, endtime, duration, transformation name, status, desc. All those value I get from Transformation Logging.
So, when I run the transformation it will insert into Monitoring table and set value for field status with Running. And when it done it will update the status into Finish. What I'm trying is to define value in table field by myself not take it from Transformation Logging so I can customize the value like I want to.
Goal is Update transformation status value from 'running' to 'finish/error/abort etc' in my db using pentaho and display that status in web app
I have thinking to used Modified Java Script step to do it but if there any other way maybe? A better one. (Just need opinion about this)
Apart from my remark, did you try the Value Mapper?
modified javascript is not a good idea to use. Ideally, it shouldn't be used due to the performance issue. You can use "add constant" step or "User defined Java Class" for an alternative.
You cannot change the values of the built-in Logging tables, for the simple reason that they are reserved for PDI usage. This causes a known issue in case of hard error: for example the status is not set to finish when the data base server crashes, or when a NullException is not catch by the DPI code.
You have some work around.
The simplest, the one used in the ETL-Pilot is to test (Status=Finish OR LogDate< 15 minutes ago) is the web app.
You can update the table when the transformation is not running. For example, put an hourly (or less) crontab that changes to Finish the status of any transformation whose LogDate is older than 15 mn. This crontab may be a simple SQL or included in a transformation that also check the tables size and/or send an email in case of potential error.
You can copy the table (if it is a non locking operation in your DB system), modify the Status column and use this table for your web app.
i am using c++ 4.8 ( available 4.9) and pqxx driver ver. 4.0.1.
postgresdb is latest stable.
My problem is all about complexity and resource balance:
I need to execute insert to database (and there is optionally pqxx::result) and id in that table id is based on nextval(table_seq_id)
Is is possible to get id of inserted row as a result? There is a workaround on this to ask db about currentvalue in sequence and just insert query with currentvalue+1 (or +n) but this will require to do "insert and ask" chain.
Db should be able to store more than 6K large requests /per.sec. so i would like to ask about id as infrequent as possible. Bulk insert is not an option.
As documented here, you can add a RETURNING clause to the INSERT query, to return values from the inserted row(s). They give an example similar to what you want, returning an ID:
INSERT INTO distributors (did, dname) VALUES (DEFAULT, 'XYZ Widgets')
RETURNING did;
I have an update query that is based on the result of a select, typically returning more than 1000 rows.
If some of these rows are updated by other queries before this update can touch them could that cause a problem with the records? For example could they get out of sync with the original query?
If so would it be better to select and update individual rows rather than in batch?
If it makes a difference, the query is being run on Microsoft SQL Server 2008 R2
Thanks.
No.
A Table cannot be updated while something else is in the process of updating it.
Databases use concurrency control and have ACID properties to prevent exactly this type of problem.
I would recommend reading up on isolation levels. The default in SQL Server is READ COMMITTED, which means that other transactions cannot read data that has been updated but not committed by a given transaction.
This means that data returned by your select/update statement will be an accurate reflection of the database at a moment in time.
If you were to change your database to READ UNCOMMITTED then you could get into a situation where the data from your select/update is out of synch.
If you're selecting first, then updating, you can use a transaction
BEGIN TRAN
-- your select WITHOUT LOCKING HINT
-- your update based upon select
COMMIT TRAN
However, if you're updating directly from a select, then, no need to worry about it. A single transaction is implied.
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot
BUT... do NOT do the following, otherwise you'll run into a deadlock
UPDATE mytable
SET value = mot.value
FROM myOtherTable mot WITH (NOLOCK)