Incremental backups: how to track file deletions - c++

I have an offsite backup solution which runs on C++ to break the files into blocks, and keeps track of the blocks using md5 hashes on a SQLITE3 database. And it transfers the blocks along with the database to a remote site.
So, when I want to do a restore, it queries the SQLITE3 database and restores the blocks accordingly.
When the first backup runs, it creates a big table called the base_backup. Every subsequent file changes or new files are added as new records in a new table. If I want to do a restore, I query the base_backup table plus all the differences and restore the files.
The way the backup runs, it scans for all the files in a given folder for the archive bit, and if it is cleared, then verifies if a record does not already exist in the database and decides whether to back it up or not.
Coming to my question, if a file is deleted on the local computer, how do I keep track of it and update the offsite backup accordingly? Because when I do a restore, I don't want to restore all the garbage files. Is there anyway of knowing if files have been deleted from a folder or not? I do not want to run a verify check from the database since it will take too long.

inotify with IN_DELETE?

Create a Service to monitor the directory (Use FindFirstChangeNotification or ReadDirectoryChangesW)

You could add a new piece of information to your database which lists which files existed during the last backup. Then, even if a file had not changed, a new (small) entry would be made during the backup, indicating that it still existed.
When restoring a backup from a given date in the past, only select the files which had entries specifying that they existed during the previous backup.
For example, a pair of tables like this might work:
Path(text) BackupIndex(int)
path/to/file1 1
path/to/file2 1
path/to/file1 2
Notice that path/to/file2 does not appear in backup #2, as it was not in the directory during the backup (it must have been deleted).
BackupIndex(int) Timestamp(timestamp)
1 2011-03-12 7:42:31 UTC
2 2011-03-20 8:21:56 UTC
Somebody wants to restore as files existed on March 15th, you look at the table of backup indices, see that backup #1 was the most recent, and look up all paths that existed in backup 1 from the paths table.
So basically, you are pushing off deciding whether a file was deleted onto the restore operation, rather than the backup operation.

Related

S3 Folder Containing Redshift Spectrum Table Deleted Randomly

I have an external table in Redshift. When I use UNLOAD to fill this table, sometimes the S3 folder that contains the data gets deleted randomly (or I couldn’t figure out the reason).
Here's the script I use to fill the external table:
UNLOAD ('SELECT * FROM PUBLIC.TABLE_NAME T1 INNER JOIN EXTERNAL_SCHEMA.TABLE_NAME')
TO 's3://bucket-name/main_folder/folder_that_gets_deleted/'
IAM_ROLE 'arn:aws:iam::000000000000:role/my_role'
FORMAT AS PARQUET
CLEANPATH
PARALLEL OFF;
I'm not sure there's enough info the question to uniquely identify a solution. What's jumping out to me is: the CLEANPATH parameter deletes file that's targeted by TO. If something fails elsewhere that causes the UNLOAD not to complete (e.g. if it's a big file, maybe competing resources are slowing things down), then perhaps the CLEANPATH deletion completes and no file is created to replace the deleted one.
Perhaps try, instead of CLEANPATH, use ALLOWOVERWRITE. This parameter means that you overwrite any existing files with the output of the UNLOAD command. So if the UNLOAD fails then nothing gets deleted.

How snowflake internally performs updates?

As far as I know, underlying files (columnar format) is immutable. My question is, if files are immutable, how the updates are being performed. Do Snowflake maintains different versions of the same row, and returns the latest version based on key? or it inserts the data into new files behind the scene and deletes old files? How performance gets affected in these scenarios (querying current data), if time travel is set to 90 days as Snowflake need to maintain different version of the same row. But as Snowflake doesn't respect keys, how even different versions are detected. Any insights (document/video) on the detailed internals is appreciated.
It's a complex question, but a basic ideas are as follows (quite a bit simplified):
records are stored in immutable micro-partitions on S3
a table is a list of micro-partitions
when a record is modified
its old micro-partition is marked as inactive (from that moment),
a new micro-partition is created, containing the modified record, but also other records from that micro-partition.
the new micro-partition is added to the table's list (marked as active from that moment)
inactive micro-partitions are not deleted for some time, allowing time-travel
So Snowflake doesn't need a record key, as each record is stored in only one file active at a given time.
The impact of performing updates on querying is marginal, the only visible impact might be that the files need to be fetched from S3 and cached on the warehouses.
For more info, I'd suggest going to Snowflake forums and asking there.

C++ and Sqlite DELETE query doesn`t actually delete the value from the database file

I`ve came accross this issue on SQlite and c++ and i can't find any answer on it.
Everything is working fine in SQlite and c++ all queries all outputs all functions but i have this question that can`t find any solution around it.
I create a database MyTest.db
I create a table test with an id and a name as fields
I enter 2 values to each id=1 name=Name1 and id=2 name=Name2
I delete the 2nd value
The data inside table now says that i have only the id=1 with name=Name1
When i open my Mytest.db with notepad.exe the values that i have deleted such as id=2 name=Name2 are still inside the database file though that it doesn`t come to the data results of this table but still exists though.
What i like to ask from anyone that knows about it is this :
Is there any other way that the value has to be deleted also from the database file or is it my mistake with the DELETE option of SQLITE (that i doubt it)
Its like the database file keeps collecting all the trash inside it without removing DELETED values from its tables...
Any help or suggestion is much appreciated
If you use "PRAGMA secure_delete=ON;" then SQLite overwrites deleted content with zeros. See https://www.sqlite.org/pragma.html#pragma_secure_delete
Even with secure_delete=OFF, the deleted space will be reused (and overwritten) to store new content the next time you INSERT. SQLite does not leak disk space, nor is it necessary to VACUUM in order to reclaim space. But normally, deleted content is not overwritten as that uses extra CPU cycles and disk I/O.
Basically all databases only mark rows active or inactive, they won't delete the actual data from the file immediately. It would be a huge waste of time and resources, since that part of the file can just be reused.
Since your queries show that the row isn't active in results, is this in some way an issue? You can always run a VACUUM on the database if you want to reclaim the space, but I would just let the database engine handle everything by itself. It won't "keep collecting all the trash inside it", don't worry.
If you see that the file size is growing and the space is not reused, then you can issue vacuums from time to time.
You can also test this by just inserting other rows after deleting old ones. The engine should reuse those parts of the file at some point.

Sitecore media conversion tool eating storage space

I have a question regarding the media conversion tool for Sitecore.
With this module you can convert media items between a hard drive location and a Sitecore database, and vice versa. But each time I convert some items it keeps taking additional harddrive space.
So when I convert 3gb to the hard drive it adds an additional 3gb (which seems logic -> 6gb total) but then when I convert them back to the blob format it adds another 3gb (9gb total). Instead of overwriting the previous version in the database.
Is there a way to clean the previous blobs or something? Because now it is using too much hard drive space.
Thanks in advance.
Using "Clean Up Databases" should work, but if the size gets too large, as my client's blob table did, the clean up will fail due to either a SQL timeout or because SQL Server uses up all the available locks.
Another solution is to run a script to manually clean up the blobs table. We had this issue previously and Sitecore support was able to provide us with a script to do so:
DECLARE #UsableBlobs table(
ID uniqueidentifier
);
I-N-S-E-R-T INTO
#UsableBlobs
S-E-L-E-C-T convert(uniqueidentifier,[Value]) as EmpID from [Fields]
where [Value] != ''
and (FieldId='{40E50ED9-BA07-4702-992E-A912738D32DC}' or FieldId='{DBBE7D99-1388-4357-BB34-AD71EDF18ED3}')
D-E-L-E-T-E from [Blobs]
where [BlobId] not in (S-E-L-E-C-T * from #UsableBlobs)
This basically looks for blobs that are still in use and stores them in a temp table. It them compares the items in this table to the Blobs table and deletes the ones that aren't in the temp table.
In our case, even this was bombing out due to the SQL Server locks problem, so I updated the delete statement to be delete top (x) from [Blobs] where x is a number you feel is more appropriate. I started at 1000 and eventually went up to deleting 400,000 records at a time. (Yes, it was that large)
So try the built-in "Clean Up Databases" option first and failing that, try to run the script to manually clean the table.
Edit note: Sorry, had to change the "insert", "select" and "delete" commands to allow SO to save the entry.

Sqlite3/C++ executes DELETE statement without changing the db size

How is it possible? I have a simple C++ app that is using SQLite3 to INSERT/DELETE records.
I use a single database and a single table inside. Then after I choose to store some data into the db, it does and the size of my.db increases naturally.
While there is a problem with DELETE - it does not. But if I do:
sqlite3 my.db
sqlite> select count(*) from mytable;
there is 0 returned which is okay, but if do ls -l on the folder containing my.db, the size
is the same.
Can anybody explain?
When you execute a DELETE query, Sqlite does not actually delete the records and rearrange the data. That would take too much time. Instead, it just marks deleted records and ignore them from then on.
If you actually want to reduce the data size, execute VACUUM command. There is also an option for auto vacuuming. See http://www.sqlite.org/lang_vacuum.html.
The scenario is listed in the SQLite Frequently Asked Questions:
(12) I deleted a lot of data but the database file did not get any
smaller. Is this a bug?
No. When you delete information from an SQLite database, the unused disk space is added to an internal "free-list" and is reused
the next time you insert data. The disk space is not lost. But neither
is it returned to the operating system.
If you delete a lot of data and want to shrink the database file, run the VACUUM command. VACUUM will reconstruct the database from
scratch. This will leave the database with an empty free-list and a
file that is minimal in size. Note, however, that the VACUUM can take
some time to run (around a half second per megabyte on the Linux box
where SQLite is developed) and it can use up to twice as much
temporary disk space as the original file while it is running.
As of SQLite version 3.1, an alternative to using the VACUUM command is auto-vacuum mode, enabled using the auto_vacuum pragma.
The documentation is your friend; please use it.
Also from the documentation:
When information is deleted in the database, and a btree page becomes
empty, it isn't removed from the database file, but is instead marked
as 'free' for future use. When a new page is needed, SQLite will use
one of these free pages before increasing the database size. This
results in database fragmentation, where the file size increases
beyond the size required to store its data, and the data itself
becomes disordered in the file.
Another side effect of a dynamic database is table fragmentation. The
pages containing the data of an individual table can become spread
over the database file, requiring longer for it to load. This can
appreciably slow database speed because of file system behavior.
Compacting fixes both of these problems.
The easiest way to remove empty pages is to use the SQLite command
VACUUM. This can be done from within SQLite library calls or the
sqlite utility.
In-depth examples follow.