are flatpak tmp/cache files supposed to be 5 times as large as the objects? - flatpak

I know flatpak uses a lot of disk space due to the sandbox design and I am okay with that. But I recently found that /var/lib/flatpak/repo/tmp on my Debian system is full of cache directories and much larger than the objects directory.
I associate tmp and cache with things that can be deleted. Is the /tmp directory supposed to be that large or are these leftovers that can be deleted?

Related

How to copy Gulden blockchain data from one wallet to another

Which files should I copy from the blocks directory to copy all available blockchain data to another wallet.
Should I include:
The blk*.dat files?
The rev*.dat files?
The index directory and its contents?
Furthermore would it be safe to symlink all but the latest .dat files instead of copying to save storage (assuming that the earlier .dat files never change).
After copying/linking, should I run the GuldenD with additional command line arguments like -rescan?
I'm working off https://github.com/Gulden/gulden-official/blob/master/doc/files.md as a reference of the data directory structure for Gulden.
Relevant parts below:
blocks/blk000??.dat: block data (custom, 128 MiB per file);
blocks/rev000??.dat; block undo data (custom);
blocks/index/*; block index (LevelDB);
chainstate/*; block chain state database (LevelDB);
The above four essentially contain the blockchain, so would be required:
blocks/blk*.dat - The actual blocks are written here
blocks/index/* - A fast index into the above files is stored here, technically it is possible to find the blocks without this but it becomes a slow process
chainstate/* - The UTXO (list of all unspent transactions) is stored here this is needed for verifying blocks - this can be regenerated if lost but is an expensive process.
blocks/rev*.dat - This contains chainstate 'undo' information for the most recent blocks so that if the chain is reorganised the changes to the chainstate can be rolled back easily.
Under usual circumstances the older blocks/*.dat files would not be touched but there are possibly edge cases (pruning) where they might be, and future developments may want to touch them in order to free up space etc. - so I don't know if this is a safe assumption to rely on.
If the aim is to save space while hosting multiple wallets on one server a dedup filesystem is perhaps a safer way to do this without relying on a symlink.
If you are setting up a new GuldenD then no rescan or additional options are necessary after copying into place, if it is a GuldenD with existing addresses which may have received funds in the past then you would want to run a rescan.

Is there a way to completely remove an inode when the Link count is 2?

Currently my data is organised in a volume which has a cache directory (where all the files are first created or transferred). After that there are suitable directories on the volume which in their subdirs, contain files hardlinked to files in the cache.
This is done so that the same inode (file) can be hardlinked multiple times in multiple directories.
Now when trying to clean up the volume, I recurively go through the dirs(not the cache) and based on certain criterion, unlink the files (which basically reduces the inode count of the cache entry by 1). Is there a way for me to delete the cache entry directly, when I am deleting the last hardlink (that is bringing down the count from 2 to 1). This way I would not have to manually parse through the whole cache directory to clear any inodes from it, which have a link count of just 1.
I have gone through unlink/remove functions, and could not find anything specific of use. Is there some purging algorithm that internally takes care of this, then I can try to implement that.
Any help on this would be highly appreciated. In anticipation of a prompt reply.
I saw this and a few other places which instruct you how to delete all hardlinks from shell (use find -samefile and call remove on each file). You could call it via system although that might be frowned on by some people).
No, there isn't anything that does what you want out of the box.
It might be useful to do the deletion when unlinking the hardlink and noticing that the link count is 1, since at that point the inode should be in the page cache; this of course is dependent on knowing the name of the file in the cache directory.

How can I quickly enumerate directories on Win32?

I'm trying to speedup directory enumeration in C++, where I'm recursing into subdirectories. I currently have an app which spends 95% of it's time in FindFirst/FindNextFile APIs, and it takes several minutes to enumerate all the files on a given volume. I know it's possible to do this faster because there is an app that does: Everything. It enumerates my entire drive in seconds.
How might I accomplish something like this?
I realize this is an old post, but there is a project on source forge that does exactly what you are asking and the source code is available.
You can find the project here: NTFS-Search
"Everything" builds an index in the background, so queries are against the index not the file system itself.
There are a few improvements to be made - at least over the straight-forward algorrithm:
First, breadth search over depth search. That is, enumerate and process all files in a single folder before recursing into the sub folders you found. This improves locality - usually a lot.
On Windows 7 / W2K8R2, you can use FindFirstFileEx with FindExInfoBasic, the main speedup being omitting the short file name on NTFS file systems where this is enabled.
Separate threads help if you enumerate different physical disks (not just drives). For the same disk it only helps if it's an SSD ("zero seek time"), or you spend significant time processing a file name (compared to the time spent on disk access).
[edit] Wikipedia actually has some comments -
Basically, they are skipping the file system abstraction layer, and access NTFS directly. This way, they can batch calls and skip expensive services of the file system - such as checking ACL's.
A good starting point would be the NTFS Technical Reference on MSDN.
"Everything" accesses directory information at a lower level than the Win32 FindFirst/FindNext APIs.
I believe it reads and interprets the NTFS MFT structures directly, and that this is one of the main reasons for its performance. It's also why it requires admin privileges and why "Everything" only indexes local or removable NTFS volumes (not network drives, for example).
A couple other utilities that do the similar things are:
FindOnClick by 2Brightsparks
Search GT
A little reverse engineering with a debugger on these tools might give you some insight on the techniques they use.
Don't recurse immediately, save a list of directories you find and dive into them when finished. You want to do linear access to each directory, to take advantage of locality of reference and any caching the OS is doing.
If you're already doing the best you can to get the maximum speed from the API, the next step is to do low-level disk accesses and bypass Windows altogether. You might get some guidance from the NTFS drivers for Linux, or perhaps you can use one directly.
If you are doing this on NTFS, here's a lib for low level access: NTFSLib.
You can enumerate through all file records in $MFT, each representing a real file on disk. You can get all file attributes from the record, including $DATA.
This may be the fastest way to enumerate all files/directories on NTFS volumes, 200k~300k files per minute as I tested.

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.

Quick file access in a directory with 500,000 files

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).
How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).
The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt
BTW this is all on Linux.
Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.
If the directory is indexed, your file naming scheme shouldn't matter.
http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/
A couple of ideas:
a) If you can control the directory layout then put the files into subdirectories.
b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?
If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.
If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.
Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.
The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.
As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.
Another question is how much data is in the files? Is an SQL back end an option?