How to set up replication in BerkeleyDB - database-replication

I've been struggling for some time now on setting up a "simple" BerkeleyDB replication using the db_replicate utility.
However no luck in making it actually work, and I'm not finding any concrete example on how thing should be set up.
Here is the setup I have so far. Environment is a Debian Wheezy with BDB 5.1.29
Database generation
A simple python script reading "CSV" files and inserting each line into the BDB file
from glob import glob
from bsddb.db import DBEnv, DB
from bsddb.db import DB_CREATE, DB_PRIVATE, DB_INIT_MPOOL, DB_BTREE, DB_HASH, DB_INIT_LOCK, DB_INIT_LOG, DB_INIT_TXN, DB_INIT_REP, DB_THREAD
env = DBEnv()
env.set_cachesize(0, 1024 * 1024 * 32)
env.open('./db/', DB_INIT_MPOOL | DB_INIT_LOCK | DB_INIT_LOG |
DB_INIT_TXN | DB_CREATE | DB_INIT_REP | DB_THREAD)
db = DB(env)
db.open('apd.db', dbname='stuff', flags=DB_CREATE, dbtype=DB_BTREE)
for csvfile in glob('Stuff/*.csv'):
for line in open(csvfile):
db.put(line.strip(), None)
db.close()
env.close()
DB Configuration
In the DB_CONFIG file, this is where I'm missing the most important part I guess
repmgr_set_local_site localhost 6000
Actual replication try
# Copy the database file to begin with
db5.1_hotbackup -h ./db/ -b ./other-place
# Start replication master
db5.1_replicate -M -h db
# Then try to connect to it
db5.1_replicate -h ./other-place
The only thing I currently get from the replicate tool is:
db5.1_replicate(20648): DB_ENV->open: No such file or directory
edit after stracing the process I found out it was trying to access to __db.001, so I've copied those files manually. The current output is:
db5.1_replicate(22295): repmgr is already started
db5.1_replicate(22295): repmgr is already started
db5.1_replicate(22295): repmgr_start: Invalid argument
I suppose I'm missing the actual configuration value for the client to connect to the server, but so far no luck as all the settings yielded unrecognized name-value pair errors
Does anyone know how this setup might be completed? Maybe I'm not even headed in the right direction an this should be something completely different?

Related

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.
As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

Increase Haddop_HEAPSIZE in amazon EMR to run job with a few million input files

I am running into an issue with my EMR jobs where too many input files throws out of memory errors. Doing some research I think changing the HADOOP_HEAPSIZE config parameter is the solution. Old amazon forums from 2010 say it cannot be done.
can we do that now in 2018??
I run my jobs using the C# API for EMR and normally I set configurations using statements like below. can I set HADOOP_HEAPSIZE using similar commands.
config.Args.Insert(2, "-D");
config.Args.Insert(3, "mapreduce.output.fileoutputformat.compress=true");
config.Args.Insert(4, "-D");
config.Args.Insert(5, "mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec");
config.Args.Insert(6, "-D");
config.Args.Insert(7, "mapreduce.map.output.compress=true");
config.Args.Insert(8, "-D");
config.Args.Insert(9, "mapreduce.task.timeout=18000000");
If I need to bootstrap using a file, I can do that too. If someone can show me the contents of the file for the config change.
Thanks
I figured it out...
I created a shell script to increase the memory size on the master machine (code at the end)...
I run a bootstrap action like this
ScriptBootstrapActionConfig bootstrapActionScriptForHeapSizeIncrease = new ScriptBootstrapActionConfig
{
Path = "s3://elasticmapreduce/bootstrap-actions/run-if",
Args = new List<string> { "instance.isMaster=true", "<s3 path to my shell script>" },
};
The shell script code is this
#!/bin/bash
SIZE=8192
if ! [ -z $1 ] ; then
SIZE=$1
fi
echo "HADOOP_HEAPSIZE=${SIZE}" >> /home/hadoop/conf/hadoop-user-env.sh
Now I am able to run a EMR job with master machine tye as r3.xlarge and process 31 million input files

Qt - pdftocairo pdf conversion process not working when application on auto start

I am running my Qt (4.8, QWS server, QWidget app) application on an ARM/embedded linux platform. On my application, I have a module/widget to view PDF files.
Being a slower processor, it was better to go for a conversion of the PDF file to image files using pdftocairo. The module also has a feature to import any pdf file from a flash drive and convert it to images using pdftocairo. The entire module works as expected when I manually start the application from command line. Here is the code that imports the pdf file into the device in the form of images:
QString CacheName = PDFList->currentItem()->text(); //name of PDF file without ".pdf"
QString PDFString = "pdftocairo -jpeg -r 200 \"/media/usb/" + CacheName + ".pdf\" \"/opt/.pdf/" + CacheName + "\"";
qDebug() << PDFString;
QProcess PDFCacheprocess;
PDFCacheprocess.startDetached(PDFString); //or PDFCacheprocess.start(PDFString)
The ultimate goal of the project is to have the application to auto-start when the device boots up. However, when starting the application automatically, the import feature doesn't seem to do anything. I am stumped with not being able to identify the problem because I do not have any debug output (which I do have when executing the app normally).
I normally execute the application manually with
/opt/[path]/[application name] -qws
When auto-starting, I put the application out into a file, log.txt by adding &>/opt/log.txt. The output seems to be the same as when I am running with the manual command. This is the content of the file during the import process (no error being reported).
"pdftocairo -jpeg -r 200 "/media/usb/manual.pdf" "/opt/.pdf/manual"
Strangely enough, every other command (other than pdftocairo) is working. I tried to replace this command with QString PDFString = "/opt/./importPDF.sh". The script was being executed for any command (like reboot), but again, it would fail if it contained the pdftocairo command.
I also tried to add a slot connected to QProcess::finished(int) to show the QProcess output:
connect(&PDFCacheprocess, SIGNAL(finished(int)), this, SLOT(pdfImportStatus(int)));
void UserManual::pdfImportStatus(int)
{
qDebug()<<PDFCacheprocess.errorString()<<'\t'<<PDFCacheprocess.exitCode();
}
For the manual execution ( when import works), I would get:
"pdftocairo -jpeg -r 200 "/media/usb/manual.pdf" "/opt/.pdf/manual""
"Unknown error" 0
For the auto-start, log.txt only shows this (seems like the slot isn't being triggered?)
"pdftocairo -jpeg -r 200 "/media/usb/manual.pdf" "/opt/.pdf/manual""
Any help is appreciated. Thanks in advance :)
Apparently the problem was that the command was not being recognized in the working directory (only when auto-starting for some reason). When using a Qprocess, it turns out that it is always good to give the path even if the file/command exists in the environment variables - as was in my case ($PATH).
I had to replace the QString with:
QString PDFString = "/usr/local/bin//pdftocairo -jpeg -r 200 \"/media/usb/" + CacheName + ".pdf\" \"/opt/.pdf/" + CacheName + "\"";

FileName Port is not supported with connection or merge option

I need to create a csv flat file and need to store in particular path in ftp .
File name should be dynmaically created with timestamp . i have created the filename port in informatica and mapped to expression which i created. when i ran the workflow , am getting below error
Severity Timestamp Node Thread Message Code Message
ERROR 28-06-2017 07:31:19 PM node01_oktst93 WRITER_1_*_1 WRT_8419 Flat File Target [NewOrders] FileName Port is not supported with connection or merge option.
Please help to resolve without deleting filename port .
Thanks
If your requirement is to create dynamic file during each session run. Please check the below steps:
1) Connect the source qualifier to an expression transformation. In the expression transformation create an output port (call it as File_Name) and assign the expression as 'FileNameXXX'||to_char(sessstarttime, 'YYYYMMDDHH24MISS')||'.csv'
2) Now connect the expression transformation to the target and connect eh File_Name port of expression transformation to the FileName port of the target file definition.
3) Create a workflow and run the workflow.
I have used sessstarttime, as it is constant throughout the session run. If you have used sysdate, a new file will be created whenever a new transaction occurs in the session run
file port option dosen't work with the FTP target option. If you are simply using a local flat file: please disable the append if exists option at session level.
Please refer the below informatica KB :
https://kb.informatica.com/solution/11/Pages/102937.aspx
Late answer but may help some.
Since file port option dosen't work with the FTP target option.
Another way is to
Create a variable in workflow
Then create an assignment in between
Then you may set the $variable with full path i.e.
'/path/to_drop/file/name_of_file_'||to_char(SYSDATE, 'YYYYMMDD')||'.csv'
Use that $variable now in your session under workflows.
add it in your mappings now
Late answer but may help some.

How to set up autoreload with Flask+uWSGI?

I am looking for something like uWSGI + django autoreload mode for Flask.
I am running uwsgi version 1.9.5 and the option
uwsgi --py-autoreload 1
works great
If you're configuring uwsgi with command arguments, pass --py-autoreload=1:
uwsgi --py-autoreload=1
If you're using a .ini file to configure uwsgi and using uwsgi --ini, add the following to your .ini file:
py-autoreload = 1
For development environment you can try using
--python-autoreload uwsgi's parameter.
Looking at the source code it may work only in threaded mode (--enable-threads).
You could try using supervisord as a manager for your Uwsgi app. It also has a watch function that auto-reloads a process when a file or folder has been "touched"/modified.
You will find a nice tutorial here: Flask+NginX+Uwsgi+Supervisord
The auto-reloading functionality of development-mode Flask is actually provided by the underlying Werkzeug library. The relevant code is in werkzeug/serving.py -- it's worth taking a look at. But basically, the main application spawns the WSGI server as a subprocess that stats every active .py file once per second, looking for changes. If it sees any, the subprocess exits, and the parent process starts it back up again -- in effect reloading the chages.
There's no reason you couldn't implement a similar technique at the layer of uWSGI. If you don't want to use a stat loop, you can try using underlying OS file-watch commands. Apparently (according to Werkzeug's code), pyinotify is buggy, but perhaps Watchdog works? Try a few things out and see what happens.
Edit:
In response to the comment, I think this would be pretty easy to reimplement. Building on the example provided from your link, along with the code from werkzeug/serving.py:
""" NOTE: _iter_module_files() and check_for_modifications() are both
copied from Werkzeug code. Include appropriate attribution if
actually used in a project. """
import uwsgi
from uwsgidecorators import timer
import sys
import os
def _iter_module_files():
for module in sys.modules.values():
filename = getattr(module, '__file__', None)
if filename:
old = None
while not os.path.isfile(filename):
old = filename
filename = os.path.dirname(filename)
if filename == old:
break
else:
if filename[-4:] in ('.pyc', '.pyo'):
filename = filename[:-1]
yield filename
#timer(3)
def check_for_modifications():
# Function-static variable... you could make this global, or whatever
mtimes = check_for_modifications.mtimes
for filename in _iter_module_files():
try:
mtime = os.stat(filename).st_mtime
except OSError:
continue
old_time = mtimes.get(filename)
if old_time is None:
mtimes[filename] = mtime
continue
elif mtime > old_time:
uwsgi.reload()
return
check_for_modifications.mtimes = {} # init static
It's untested, but should work.
py-autoreload=1
in the .ini file does the job
import gevent.wsgi
import werkzeug.serving
#werkzeug.serving.run_with_reloader
def runServer():
gevent.wsgi.WSGIServer(('', 5000), app).serve_forever()
(You can use an arbitrary WSGI server)
I am afraid that Flask is really too bare bones to have an implementation like this bundled by default.
Dynamically reloading code in production is generally a bad thing, but if you are concerned about a dev environment, take a look at this bash shell script http://aplawrence.com/Unixart/watchdir.html
Just change the sleep interval to whatever suits your needs and substitute the echo command with whatever you use to reload uwsgi. I run uwsgi un master mode and just send a killall uwsgi command.