How to implement tungsten replicator BuildAuditTable filter - database-replication

This is regarding :
com.continuent.tungsten.replicator.filter.BuildAuditTable
I tried both Tungsten 2.2 and 3.0
My intial configuration
./tools/tpm install alpha \
--topology=master-slave \
--master=host1 \
--replication-user=tungsten \
--replication-password=password \
--install-directory=/opt/continuent \
--members=host1,host2 \
--start
Then i tried adding BuildAuditTable filter in the following two ways :
Try 1:
./tools/tpm update alpha \
--property='replicator.filter.bidiSlave.auditf=com.continuent.tungsten.replicator.filter.BuildAuditTable' \
--property='replicator.filter.bidiSlave.auditf.targetTableName=indiaresorts.audit_table' \
--repl-svc-applier-filters=auditf
Try 2 :
./tools/tpm update alpha \
--property='replicator.filter.auditf=com.continuent.tungsten.replicator.filter.BuildAuditTable' \
--property='replicator.filter.auditf.targetTableName=indiaresorts.audit_table' \
--repl-svc-applier-filters=auditf
But both times i got the following error :
Error on host2 (slave)
ERROR :
pendingExceptionMessage: Plugin class name property is missing or null: key=replicator.filter.auditf
Please let me know how i can get through this issue. Also i had a doubt about the audit table: whether it is automatically created or we have to create it and what its schema will be (column names e.t.c.).
Waiting for your kind response.

I had to add a .tpl (template) file prior to tpm install in order to define a new property in the static-{service_name}.properties configuration file.
Create a new directory at tungsten-replicator/filters where you extracted the Tungsten tarball.
Starting from 3.0.0, this can also be in the directory specified with --template-search-path.
Add tungsten-replicator/filters/your_name_of_choice.tpl containing custom property keys and default values.
replicator.filter.custom=com.continuent.tungsten.replicator.filter.JavaScriptFilter
replicator.filter.custom.script=
replicator.filter.custom.config=
Install:
./tools/tpm install alpha \
...
--property='replicator.filter.custom.script=path/to/script' \
--property='replicator.filter.custom.config=path/to/config' \
--repl-svc-applier-filters=custom
You can check the service configuration file at path/to/installation_directory/{service_name}/tungsten/tungsten-replicator/conf/static-{service_name}.properties on one of the nodes to see if the template file was incorporated.
As for the audit table, a casual glance at the source code seems to indicate that
You need to create the table yourself
The schema is the same as that of the incoming table
Which means either the replication must be restricted to a single table or the audit table must contain all possible columns contained in the database.

Related

BigQuery Query from terminal

I've a multi-table that came from a sink between logging and bigquery and I'm trying to exec this query from my terminal to get the results locally (for export it, because I can't add new SA or other stuff).
So this's the query:
bq query --use_legacy_sql=true \
'
SELECT
timestamp AS Time,
logName AS Log,
textPayload AS Message
FROM (TABLE_DATE_RANGE([mytable.stdout_], DATE_ADD(CURRENT_TIMESTAMP(), -1, 'MONTH'), CURRENT_TIMESTAMP()))
'
It's working perfectly fine on GoogleCloud website, but unfortunately when I'm trying to exec it in my terminal it give me that error:
Error in query string: Error processing job 'mytable_0000017e01909804_1': Timestamp evaluation error: (L2:1): SELECT query which references non constant fields or uses aggregation functions or has one or more of WHERE, OMIT IF, GROUP
BY, ORDER BY clauses must have FROM clause.
How I can fix it?
BigQuery supports two SQL languages: StandarSQL and LegacySQL. In this case, I’m using StandarSQL. To use this language, I set this flag to false use_legacy_sql=false because some commands and syntax are different between each other (StandarSQL and LegacySQL).
Here, you can see more information about this flag.
bq query --use_legacy_sql=false \
'
SELECT TABLE_DATE_RANGE([myproject:mydataset.people],
DATE_ADD(DATE (CAST(DATE(CURRENT_TIMESTAMP) AS DATE))),
CURRENT_TIMESTAMP)
'
You can see this documentation about the commands of legacy that BigQuery supports.

How to set write preference when creating a BigQuery Data Transfer through command line

I'm trying to create a data transfer with bq mk --transfer_config
I want the write preference to be MIRROR, but I can't find how to set this value through the params?
--params='{"data_path_template":"gs://",
"destination_table_name_template":"",
"file_format":"CSV",
"max_bad_records":"1",
"ignore_unknown_values":"true",
"field_delimiter":"^",
"skip_leading_rows":"1",
"write_preference":"MIRROR",
"allow_quoted_newlines":"true",
"allow_jagged_rows":"true",
"delete_source_files":"false"}' \
Doesn't work:
BigQuery error in mk operation: Data source definition doesn't define this parameter Id: write_preference
And I can't find any documentation about how to set this value.
"write_disposition":"MIRROR"
Seems to work.

Big query EXPORT DATA statement creating mutiple files with no data and just header record

I have read similar issue here but not able to understand if this is fixed.
Google bigquery export table to multiple files in Google Cloud storage and sometimes one single file
I am using below big query EXPORT DATA OPTIONS to export the data from 2 tables in a file. I have written select query for the same.
EXPORT DATA OPTIONS(
uri='gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_'||CURRENT_DATE()||'*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter='|') AS
SELECT
I have only 2 rows returning from my select query and I assume that only one file should be getting created in google cloud storage. Multiple files are created only when data is more than 1 GB. thats what I understand.
However, 3 files got created in cloud storage where 2 files just had the header record and the third file has 3 records(one header and 2 actual data record)
radhika_sharma_ibm#cloudshell:~ (whr-asia-datalake-nonprod)$ gsutil ls gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000000.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000001.csv
gs://whr-asia-datalake-dev-standard/outbound/Adobe/Customer_Master_2021-02-04000000000002.csv
Why empty files are getting created?
Can anyone please help? We don't want to create empty files. I believe only one file should be created when it is 1 GB. more than 1 GB, we should have multiple files but NOT empty.
You have to force all data to be loaded into one worker. In this way you will be exporting only one file (if <1Gb).
My workaround: add a select distinct * on top of the Select statement.
Under the hood, BigQuery utilizes multiple workers to read and process different sections of data and when we use wildcards, each worker would create a separate output file.
Currently BigQuery produces empty files even if no data is returned and thus we get multiple empty files. The Bigquery product team is aware of this issue and they are working to fix this, however there is no ETA which can be shared.
There is a public issue tracker that will be updated with periodic progress. You can STAR the issue to receive automatic updates and give it traction by referring to this link.
However for the time being I would like to provide a workaround as follows:
If you know that the output will be less than 1GB, you can specify a single URI to get a single output file. However, the EXPORT DATA statement doesn’t support Single URI.
You can use the bq extract command to export the BQ table.
bq --location=location extract \
--destination_format format \
--compression compression_type \
--field_delimiter delimiter \
--print_header=boolean \
project_id:dataset.table \
gs://bucket/filename.ext
In fact bq extract should not have the empty file issue like the EXPORT DATA statement even when you use Wildcard URI.
I faced the same empty files issue when using EXPORT DATA.
After doing a bit of R&D found the solution. Put LIMIT xxx in your SELECT SQL and it will do the trick.
You can find the count, and put that as LIMIT value.
SELECT ....
FROM ...
WHERE ...
LIMIT xxx
It turns out you need to enforce multiple files, wildcard syntax. Either a file for CSV or folder for other like AVRO.
The uri option must be a single-wildcard URI as described
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements
Specifying a wildcard seems to start several workers to work on the extract, and as per the documentation, size of the exported files will vary.
Zero-length files is unusual but technically possible if the first worker is done before any other really get started. Hence why the wildcard is expected to be used only when you think your exported data will be larger than the 1 GB
I have just faced the same with Parquet but found out that bq CLI works, which should do for any format.
See (and star for traction) https://issuetracker.google.com/u/1/issues/181016197

Doctrine is not retrieving new columns

I'm using the default find() method to get data, every thing was ok until I've added a new property to the entity : the same code is not retrieving the new column, I don't see it in the generated SQL query !
I've added other properties with different types and the problem remains : new properties aren't visible in the SQL query !
This is the result of doctrine:schema:validate :
Mapping
-------
[OK] The mapping files are correct.
Database
--------
[ERROR] The database schema is not in sync with the current mapping file.
So it seems the schema is not ok, how can I find the problem ?
UPDATE 1
I have another project with no database error. I've updated an existing entity :
bin/console make:entity
Then updated the schema :
bin/console doctrine:schema:update --force
But the new column is not retrieved !
UPDATE 2
I've generated a new entity with the same properties and Doctrine returns all the columns. I've made a diff between the two entities and the two repositories : they are identical !
I've cleared Doctrine's cache :
php bin/console doctrine:cache:clear-metadata
php bin/console doctrine:cache:clear-query
php bin/console doctrine:cache:clear-result
...but the problem persists
It was a Doctrine cache issue, here is a part on my doctrine.yaml :
orm:
metadata_cache_driver: apcu
result_cache_driver: apcu
query_cache_driver: apcu
I've tried to clear the cache php -r "apcu_clear_cache();" : no effect ! So I've changed the cache :
orm:
metadata_cache_driver: array
result_cache_driver: array
query_cache_driver: array
And now it works fine :)

Attribute selection with Filtered classifier for saved model Weka

I trained my model on a FilteredClassifier with Attribute selection in Weka. Now, I am unable to use the serialized model for Test data classification, I searched a lot but really couldn't figure out. This is what I am doing at the moment:
java -cp $CLASSPATH weka.filters.supervised.attribute.AddClassification\
-serialized Working.model \
-classification \
-remove-old-class \
-i full_data.arff \
-c last
It gives me an error saying
weka.core.WekaException: Training header of classifier and filter dataset don't match
But they aren't supposed to right? Since the Test data shouldn't have the class in the header. How should I use it? Also, I hope the selected attributes will be serialized and saved in the model, since the same attribute selection needs to be done on the test data.
I prefer not using Batch classifier since it defeats the point of a feature of saving the model and needs me to run the whole training each time.
One easy way to get it to work is by adding the nominal class to the ARFF file you created with a random class with dummy values, and then removing it with the -remove-old-class option.
So your command would remain the same, but your ARFF file will have the class this time.