wso2 das purging configuration - not work? - wso2

WSO2DAS version : 3.0.1
I set up WSO2DAS after reviewing Minimum High Availability Deployment document.
(https://docs.wso2.com/display/CLUSTER44x/Minimum+High+Availability+Deployment+-+DAS+3.0.1)
and DAS product using mariaDB.
mariaDB have two databases (WSO2_ANALYTICS_EVENT_STORE_DB, WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB)
table in database have data after request api.
finally. I set up purging configuration after reviewing Purging Data document.
(https://docs.wso2.com/display/DAS301/Purging+Data)
Changed the configurations within the property as shown below.
<analytics-dataservice-configuration>
<!-- The name of the primary record store -->
<primaryRecordStore>EVENT_STORE</primaryRecordStore>
<!-- Analytics Record Store - properties related to record storage implementation -->
<analytics-record-store name="EVENT_STORE">
<implementation>org.wso2.carbon.analytics.datasource.rdbms.RDBMSAnalyticsRecordStore</implementation>
<properties>
<property name="datasource">WSO2_ANALYTICS_EVENT_STORE_DB</property>
<property name="category">large_dataset_optimized</property>
</properties>
</analytics-record-store>
<analytics-record-store name = "PROCESSED_DATA_STORE">
<implementation>org.wso2.carbon.analytics.datasource.rdbms.RDBMSAnalyticsRecordStore</implementation>
<properties>
<property name="datasource">WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB</property>
<property name="category">large_dataset_optimized</property>
</properties>
</analytics-record-store>
<!-- The data indexing analyzer implementation -->
<analytics-lucene-analyzer>
<implementation>org.apache.lucene.analysis.standard.StandardAnalyzer</implementation>
</analytics-lucene-analyzer>
<!-- The number of index data replicas the system should keep, for H/A, this should be at least 1, e.g. the value 0 means
there aren't any copies of the data -->
<indexReplicationFactor>1</indexReplicationFactor>
<!-- The number of index shards, should be equal or higher to the number of indexing nodes that is going to be working,
ideal count being 'number of indexing nodes * [CPU cores used for indexing per node]' -->
<shardCount>6</shardCount>
<!-- The amount of index data (in bytes) to be processed at a time by a shard index worker. Minimum value is 1000. -->
<shardIndexRecordBatchSize>20971520</shardIndexRecordBatchSize>
<!-- The interval in milliseconds, which a shard index processing worker thread will sleep during index processing operations. This setting
along with the 'shardIndexRecordBatchSize' setting can be used to increase the final index batched data amount the indexer processes
at a given time. Usually, higher the batch data amount, higher the throughput of the indexing operations, but will have a higher latency
of record insertion to indexing. Minimum value of this is 10, and a maximum value is 60000 (1 minute). -->
<shardIndexWorkerInterval>1500</shardIndexWorkerInterval>
<!-- Data purging related configuration -->
<analytics-data-purging>
<!-- Below entry will indicate purging is enable or not. If user wants to enable data purging for cluster then this property
need to be enable in all nodes -->
<purging-enable>true</purging-enable>
<cron-expression>0 50 11 * * ?</cron-expression>
<!-- Tables that need include to purging. Use regex expression to specify the table name that need include to purging.-->
<purge-include-tables>
<table>.*</table>
<!--<table>.*jmx.*</table>-->
</purge-include-tables>
<!-- All records that insert before the specified retention time will be eligible to purge -->
<data-retention-days>365</data-retention-days>
</analytics-data-purging>
</analytics-dataservice-configuration>
As a result of checking in DAS carbon portal ORG_* tables data was deleted after purging time.
But two databases (WSO2_ANALYTICS_EVENT_STORE_DB, WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB) data remaining.
The question is:
is purging configuration cover ORG_* tables?
Or is the setting wrong?

Could you please clarify what you mean by
But two databases (WSO2_ANALYTICS_EVENT_STORE_DB, WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB) data remaining.
Do you mean that you can still see tables in the databases? If yes, then it is by design. DAS data purging only deletes records older than the specified time, it does not delete the tables themselves.

Related

Cannot write to AWS Timestream table and dealing with ActiveMagneticStorePartitions caused throttling

I was importing some old data to the Timestream table successfully for a while but then it started to give the error:
Timestream error: com.amazonaws.services.timestreamwrite.model.ThrottlingException: Your magnetic store writes to Timestream are throttled for this database. Refer to Timestream documentation for 'ActiveMagneticStorePartitions' metric to prevent recurrence of this issue or contact AWS support.
The metrics it refers to raise to the limit of 250 but it drops to 0 after a while even after that when I start the import it immediately hits the limit and the error is raised again so nothing is imported at all.
I am not running import in parallel but only one at a time but nevertheless it still raises the error.
As a workaround, I've decided to increase the memory retention period for this table but still get the same error for some reason even when importing data within the new memory retention period.
If you're ingesting old data, you should try to sort your data by timestamp. This will help to create fewer active partitions.
Then, before inserting the old data into Timestream, you should check the active partitions.
I met with the AWS support team several times to understand the best way to ingest data into the magnetic store (the memory store doesn't have this constraint). They suggested ingesting data sorted by timestamp. So if you have multiple devices, you should ingest the data by timestamp instead of by device.
The criteria behind an active partition is not clear and they always talk about likelihood...
I've run load tests to ingest the same data into the magnetic store and ended up with different numbers of active partitions.
Here are the results of my load tests:
I ingest 2142288 records belonging to January 2022, which it will be written in the magnetic store with my current timestream configuration. Between each execution, I increased the record version to override the previous record.
January (total active partitions: 0)
Ingest 2142288 records -> new 16 active partitions (new: 16)
Ingest 2142288 records -> new 16 active partitions (new: 16, total: 32)
Ingest 2142288 records -> new 16 active partitions (new: 16, total: 48)
Ingest 2142288 records -> new 0 active partitions (new: 0, total: 48)
Ingest 2142288 records -> new 0 active partitions (new: 0, total: 48)
Without waiting for the active partitions to drop to zero, I ingested 1922784 records belonging to February 2022.
February(total active partitions: 48)
Ingest 1922784 records -> new 0 active partitions (new: 0, total:48 )
I waited until active partitions decreased to zero, increased the record version and ran the same tests
February(total active partitions: 0)
Ingest 1922784 records -> new 82 active partitions (new: 0, total:82)
As you can see, there is no clear pattern regarding the creation of active partitions but if you sorted your data by timestamp you'll get a better likelihood of success while ingesting data into the magnetic store.
have the exact same issue.
have a table that we were ingesting historical records to that was working fine until we got past some threshold. (not sure if its worth mentioning but this table is also being written to in realtime with current data as it arrives). we got ~500m rows into the table without ever hitting the 250 active partitions limit, the data is ordered, etc.
then a few weeks ago something changed and ever since then whenever writing historical rows to the table, it almost immediately jumps from 0 to 250 active magnetic partitions and historical ingestion is halted. we've been battling this for weeks.
our solution was create another empty table, import historical records to that, and then every 50m rows or so use a scheduled query to copy all the data from this "temp" table to the actual table we want to use.
this temp tables settings are basically minimal memory store and maximal magnetic store since we're only writing historical data thats 6 months old at a minumum.
for some reason this works fine, all rows are accounted for and it never hits the 250 active partition limit. it costs a bit more, but not much more in our case and its the only thing we've found that works.
if we write the same data to our original table, it immediately hits 250 active mag partitions. pause the process, change the target table, run it again and the new target table barely gets beyond the 8-12 active magnetic partion range for the same data.
running the scheduled query to copy the data from the temp table to the target table seems to have zero impact on the target tables active partition counter that i can see, im assuming its just happening behind the scenes somewhere.
at present this seems to be the only path to finishing our historical data import.
streaming realtime or "present day" data to the mem-store always works fine. this is specifically only happening when writing historical data to the magnetic storage.

ESB WSO2 | Extracting data from a database and inserting to another data base every one hour using tasks

I want to extract some data from a database and insert it to another database every hour using tasks and I don't know how to start it. The thing is that in the first database the data is uploaded every hour so I only want to extract and insert (in the second database) only the data inserted in the last hour.
I created a task like this:
<task class="org.apache.synapse.startup.tasks.MessageInjector" group="synapse.simple.quartz" name="task.sms.send" xmlns="http://ws.apache.org/ns/synapse">
<trigger interval="3600"/>
<property name="injectTo" value="sequence" xmlns:task="http://www.wso2.org/products/wso2commons/tasks"/>
<property name="message" xmlns:task="http://www.wso2.org/products/wso2commons/tasks">
<a xmlns="">task</a>
</property>
<property name="sequenceName" value="seq.test" xmlns:task="http://www.wso2.org/products/wso2commons/tasks"/>
</task>
What should the code in the sequence look like?
To answer your question. First, you will have to create two Dataservices. One for the source database and one for the destination. The source dataservice should allow you to retrieve records inserted after a given timestamp. So you should be able to pass the TIMESTAMP to the Dataservice and retrieve records inserted in the last hour. (You can also implement this logic in the Query as well. Different Database providers have inbuilt functions to work with time)
Then in the sequence, you can have 2 Call mediators, one to retrieve the records and the second to insert the records. (Assuming no data transformation is required). Also note that WSO2 supports batch processing, so no need to insert records one by one. You can refer this.
Having said that, the approach you are trying to follow is ok if you don't mind missing records or processing duplicate entries. If that's a concern you can save the last processed record ID either in the WSO2 registry or somewhere else and get the records after the last processed record.

AWS Neptune Node counts timing out

We're running a large bulk load into AWS neptune and can no longer query the graph to get node counts without the query timing out. What options do we have to ensure we can audit the total counts in the graph?
Fails on curl and sagemaker notebook.
There are a few of things you could consider.
The easiest is to just increase the timeout specified in the cluster and/or instance parameter group, so that the query can (hopefully) complete.
If your Neptune engine version is 1.0.5.x then you can use the DFE engine to improve Gremlin count performance. You just need to enable the DFE engine using DFEQueryEngine=viaQueryHint in the cluster parameter group.
If you get the status of the load it will show you a value for the number of records processed so far. In this context a record is not a row from a CSV file or RDF format file. Instead it is the count of triples loaded in the RDF case and the count of property values and labels in the property graph case. As a simple example, imagine a CSV file with 100 rows and each row has 6 columns. Not including the ID column that is a label and 4 properties. The total number of records to load will be 100*5 i.e 500. If you have sparse rows then the calculation will be approximate unless you add up every non ID column with an actual value.
If you have the Neptune streams feature enabled you can inspect the stream and find the last vertex or edge created. Note that just enabling streams for this purpose may not be the ideal choice as it will impact the speed of the load as adding to the stream adds some overhead.

Dynamodb transaction limits increase

Currently, I need to update two tables in concurrency and one table contains configurations and other table include items to the configuration link. Now whenever the configuration is updated, it provides me with the list of items that are belong to this configuration(it can be 100-1000 items or more). How can i update the dynamodb using transaction?
I need to update two tables in concurrency
See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/transaction-apis.html
TransactWriteItems is a synchronous and idempotent write operation that groups up to 25 write actions in a single all-or-nothing operation. These actions can target up to 25 distinct items in one or more DynamoDB tables within the same AWS account and in the same Region. The aggregate size of the items in the transaction cannot exceed 4 MB. The actions are completed atomically so that either all of them succeed or none of them succeeds.

Why does WSO2AM execute count query against mb_metadata table over and over again?

We have enabled advanced throttling for WSO2AM 2.6.0. Once this was enabled and the execution plans were appropriately created, we are noticing that over 35M select count queries per hour are executing against MB_METADATA table.
Also, MB_METADATA and MB_CONTENT table are constantly growing and the row count never goes down.
I have disabled all statistics as well as tracing. we have 4 WSO servers, each one running independently with the gateway, key manager, and traffic manager on the same box. The DB is oracle.
we are seeing this query run 35 million times / hr:
SELECT COUNT(MESSAGE_ID) AS count
FROM MB_METADATA
WHERE QUEUE_ID=:1
AND MESSAGE_ID BETWEEN :2 AND :3
AND DLC_QUEUE_ID=-1
I would expect the table sizes to be manageable and this query not be run at this high of a rate.
Any suggestions on what might be going on? may be a configuration that I need to disable?
Sharing the MB database is not correct. Each traffic manager node should have its own MB database, and it can be the default H2 one.
Quoted from docs:
Do not share the WSO2_MB_STORE_DB database among the nodes in an Active-Active set-up
or Traffic Manager HA scenario, because each node should have its own local WSO2_MB_STORE_DB
database to act as separate Traffic Managers.
The latter mentioned DBs can be either H2 DBs or any RDBMS such as MySQL.
If the database gets corrupted then you need to replace the database with a fresh database
that is available in the product distribution.
Ref: https://docs.wso2.com/display/AM260/Installing+and+Configuring+the+Databases