Python - Set file EOF - python-2.7

Say, for some reason, I have a 1 TB file. In Python, if I wanted to add 10 bytes, I could just seek to the end, and write a 10 byte string. However, say I'd like to cut 10 bytes off the end of it. Obviously, it would take a ridiculous amount of time (and there may not even be HDD space) to copy this file without the excess 10 bytes, then delete the old one.
In c++ for Windows, there's a function, SetEndOfFile, that lets me change file size to something smaller without file rewriting.
Is there a similar function in python that will do this? I've researched and cannot find anything...

Wow, I guess I hadn't looked hard enough: truncate
f = open(fname)
f.seek(-10,2) # jump 10 bytes before end
f.truncate() # truncate it!!
f.close()

Related

Regex on binary data in python 3.6 shows weird behaviour

I'm trying to write a small script that reads binary data from a file before processing it further, with e.g. regex for simplifying some steps.
During the regex I'm seeing some weird behavior I just can't figure out of. Code basically goes like this (heavily stripped down to just include the relevant part):
fh = open(filename,'rb')
bd = fh.read(32) # binary data
xlen = bd[3] # byte that specifies length of a command - this may vary for each 32h byte read
bd_x = bd[4:4+xlen] # pick out interesting part of the data. for the data I see the weird behavior the length of bd_x will always be 7
if re.match(b'\x00((.*?){%d})\x30'%(xlen-2),bd_x):
update some other lists, etc
Just need to check if start of interesting data is \x00 and if end is \x30 with 5 other elements in between, for which value is irrelevant. Total length, including start and end I'm trying to match is thus 7, as mentioned.
In a sample file I have with random data, this works on about 100 of 130 32h byte chunks, for which it should match on all 130 and not just 100.
I did print out out the content of bd_x for both cases, e.g. for chunks where it worked and chunks where it didn't. Output from print(xlen,hexlify(bd_x)) (n for negative, p for positive).
n 7 b'000000290a0030'
n 7 b'0000002b0a0030'
n 7 b'0000002d0a0030'
n 7 b'0000002f0a0030'
n 7 b'000000310a0030'
n 7 b'000000330a0030'
p 7 b'00000003000030'
p 7 b'00000005000030'
p 7 b'00000000000030'
p 7 b'00000000020030'
As far as I can see, all samples should have matched in the regex. If I change to re.search it matches on all 130 chunks, but tbh I don't know why re.match isn't working for all chunks of data as the start always matches with \x00 and the rest should match too.
I've manually checked all the entries of the test file that fail in a hex-editor, and I just can't see why it doesn't work on those entries.
I know I can probably just do hexlify(bd_x) and just operate on the output from that function instead, but for now I'm interested in figuring out why this doesn't work.
Suggestions / solutions are appreciated.

how to delete the last line in a text file with 100M lines without having to rewrite the whole file?

Suppose I have a really large text file, say 100 million lines or 1 GB and I want to delete the last line. Is there anyway to do this without having to rewrite 99,999,999 lines to a new file and delete the old one? Suppose the file is really really large that the rewrite option is prohibitively expensive. What would you do to delete the last line then? Thank you.
You can open the file, read from the end backwards until you find the first line delimiter (normally LF or CR/LF, depending on platform), calculate the file offset at that point, and truncate the file to that file offset.
You should use a truncation function, but neither FILE* nor iostream support it.
However, there are usually OS-specific functions at the lower level to truncate a file.
If Unix, you may use ftruncate, but you'll need to find the offset where you want to truncate first (does each line have a fixed size?).
Be careful that, if you have opened a FILE* for finding the offset, you need to be sure to synchronize it with the lower level. You may simply fclose the file, then reopen it with open for the ftruncate of the file at the decided offset.
Similar questions: https://stackoverflow.com/a/873653/2741329 and https://stackoverflow.com/a/15154682/2741329

Algorithm for writing limited number of lines to text file

I have a program where I need to write text lines to a log file very frequently. I would like to limit the number of lines in the log file to 1000. When I write lines to the file, it should append them normally. Once the file reaches 1000 lines, I'd like to get rid of the first line and then append the new one. Does anyone know if there is a way to do this without rewriting the entire file each time?
Generally it's a little bit better for a case like this to remove more than one line at a time from the beginning.
That is, if your limit is 1000 lines, and you hit 1000 lines, delete the first 300 or so, and then resume writing. That way, you're not performing the delete operation with every single line written thereafter, only every 300 times. If you need to persist 1000 lines, then instead keep up to 1300 and delete 300 when 1300 is reached.
All files have to be aligned to FS cluster size. So, no, there's no way. You can append a line to a file, but you can't delete the first line without file rewriting.
You can use 2 files by turns.
Or use some buffer in memory and flush it periodically.
I think you still have to scan the file to find out how many lines are in the file at this moment. In that case, you can put it in some sort of buffer that you could easily add and delete from.
Then you do your logging and when you are done, you could "re-write" the file with the buffer (or only last 1000 lines).
Other alternatives are discussed above.
And yeah, try to avoid deleting line-by-line. Generally, it is a costly operation.
I've found some similar topics here and on CodeProject:
Small logger class;
Flexible logger class using standard streams in C++
http://www.codeproject.com/Articles/584794/Simple-logger-for-Cplusplus
Hope you find them useful :)
Any time you want to log, you can open the file, read your write index, jump to the position, and write the fixed-width log entry. When your index hits your upper threshold, simply set it back to 0.
There are a lot of warnings with this, though - first is that each proper log entry (assuming you close the file in between) will require an open, a read, a seek, a write, a seek, a write and a close - to find your index, go to it, write the new entry, then update your index. You also have the inherent issues of writing a fixed-size data element. Also, a human reader will depend on your content to know where the "beginning" of the file is. Most people expect "line 1" to be the first line.
I'm a much bigger advocate for simply having a few files and "rolling" them, so that each file on its own is coherent, but if you want just one file with a fixed number of lines, the circular buffer idea can work.
When you only want to use one file, and the length of the lines are not constant, there is no way without rewriting the whole file.
Depending on how often you are appending to the file, I don't see any problem doing so. 1000 lines of approx 100 chars are only approx 100kb, which is not to much. Additionally you may add some hysteresis.
However:
If the line length is constant (or you hard-limit the line length to some constant), you could just overwrite the oldest line. But then you have to keep track of the log file positions of old/new lines
I would use two files: The first one where you append lines. When the file gets full, rename it to a second one, and fill the first one from the beginning.

Adding specific record to binary file c++

Suppose I have a binary file and text file of all record workers.
The default total month hours are all set to 0.
How to I actually access to the particular month in the binary and change it to the desired value?
This is in text file format
ID Name J F M
1 Jane 0 0 0
2 Mark 0 0 0
3 Kelvin 0 0 0
to
ID Name J F M
1 Jane 0 0 25
2 Mark 0 0 30
3 Kelvin 0 0 40
The 25 is actually the amount of hours worked in march.
I think the first question here is what you mean by "binary". Are you showing the format of the file literally? In other words, at input, is the character going to be '0' or '\0'? When you're done, do you want the file to contain the two digits '3' and '0' or the single character '\25', '\30' or '\40'?
If you're dealing with a single character at a known offset in each record for input, and want to replace it by a single character for the result, things are pretty easy: seek to the right offset in the file, write a byte, seek to the next offset, and continue 'til you've updated all the records.
If the input file contains character strings, so when you update the value its length will (probably) change, then you're pretty much stuck with reading data in, modifying it in memory, and writing the new data back out (usually to a new file). This is pretty easy too, but can be slow if your file is large.
If you're doing this in a real program, I'd think twice about doing it on your own at all. I'd consider using something like SQLite to handle the data instead. This not only allows you to simplify your code, but also makes life quite a bit nicer for your clients. It uses a known/documented file format, so other tools can work with the data, do backups, etc. It supports transactions, logging, roll-backs, etc. In short, they get a robust solution instead of yet another fragile problem.
A file is a stream of bytes. You can access a file by using the c family of functions fopen fread fwrite. Or though c++ iostream operations. In each case you will need to find the record usually by knowing its position and then reading and writing that record. If the records are not of fixed size you will have to handle moving all subsequent records.

Remove duplicates from two large text files using unordered_map

I am new to a lot of these C++ libraries, so please forgive me if my questions comes across as naive.
I have two large text files, about 160 MB each (about 700000 lines each). I need to remove from file2 all of the duplicate lines that appear in file1. To achieve this, I decided to use unordered_map with a 32 character string as my key. The 32 character string is the first 32 chars of each line (this is enough to uniquely identify the line).
Anyway, so I basically just go through the first file and push the 32 char substring of each line into the unordered_map. Then I go through the second file and check whether the line in file2 exists in my unordered_map. If it doesn't exist, the I write the full line to a new text file.
This works fine for the smaller files.. (40 MB each), but for this 160 MB files.. it takes very long to insert into the hashtable (before I even start looking at file2). At around 260,000 inserts.. it seems to have halted or is going very slow. Is it possible that I have reached my memory limitations? If so, can anybody explain how to calculate this? If not, is there something else that I could be doing to make it faster? Maybe choosing a custom hash function, or specifying some parameters that would help optimize it?
My key object pair into the hash table is (string, int), where the string is always 32 chars long, and int is a count I use to handle duplicates.
I am running a 64 bit Windows 7 OS w/ 12 GB RAM.
Any help would be greatly appreciated.. thanks guys!!
You don't need a map because you don't have any associative data. An unordered set will do the job. Also, I'd go with some memory efficient hash set implementation like Google's sparse_hash_set. It is very memory efficient and is able to store contents on disk.
Aside from that, you can work on smaller chunks of data. For example, split your files into 10 blocks, remove duplicates from each, then combine them until you reach the a single block with no duplicates. You get the idea.
I would not write a C++ program to do this, but use some existing utilities.
In Linux, Unix and Cygwin, perform the following:
cat the two files into 1 large file:
# cat file1 file2 > file3
Use sort -u to extract the unique lines:
# sort -u file3 > file4
Prefer to use operating system utilities rather than (re)writing your own.