I'm trying to write a small script that reads binary data from a file before processing it further, with e.g. regex for simplifying some steps.
During the regex I'm seeing some weird behavior I just can't figure out of. Code basically goes like this (heavily stripped down to just include the relevant part):
fh = open(filename,'rb')
bd = fh.read(32) # binary data
xlen = bd[3] # byte that specifies length of a command - this may vary for each 32h byte read
bd_x = bd[4:4+xlen] # pick out interesting part of the data. for the data I see the weird behavior the length of bd_x will always be 7
if re.match(b'\x00((.*?){%d})\x30'%(xlen-2),bd_x):
update some other lists, etc
Just need to check if start of interesting data is \x00 and if end is \x30 with 5 other elements in between, for which value is irrelevant. Total length, including start and end I'm trying to match is thus 7, as mentioned.
In a sample file I have with random data, this works on about 100 of 130 32h byte chunks, for which it should match on all 130 and not just 100.
I did print out out the content of bd_x for both cases, e.g. for chunks where it worked and chunks where it didn't. Output from print(xlen,hexlify(bd_x)) (n for negative, p for positive).
n 7 b'000000290a0030'
n 7 b'0000002b0a0030'
n 7 b'0000002d0a0030'
n 7 b'0000002f0a0030'
n 7 b'000000310a0030'
n 7 b'000000330a0030'
p 7 b'00000003000030'
p 7 b'00000005000030'
p 7 b'00000000000030'
p 7 b'00000000020030'
As far as I can see, all samples should have matched in the regex. If I change to re.search it matches on all 130 chunks, but tbh I don't know why re.match isn't working for all chunks of data as the start always matches with \x00 and the rest should match too.
I've manually checked all the entries of the test file that fail in a hex-editor, and I just can't see why it doesn't work on those entries.
I know I can probably just do hexlify(bd_x) and just operate on the output from that function instead, but for now I'm interested in figuring out why this doesn't work.
Suggestions / solutions are appreciated.
Related
So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)
Say, for some reason, I have a 1 TB file. In Python, if I wanted to add 10 bytes, I could just seek to the end, and write a 10 byte string. However, say I'd like to cut 10 bytes off the end of it. Obviously, it would take a ridiculous amount of time (and there may not even be HDD space) to copy this file without the excess 10 bytes, then delete the old one.
In c++ for Windows, there's a function, SetEndOfFile, that lets me change file size to something smaller without file rewriting.
Is there a similar function in python that will do this? I've researched and cannot find anything...
Wow, I guess I hadn't looked hard enough: truncate
f = open(fname)
f.seek(-10,2) # jump 10 bytes before end
f.truncate() # truncate it!!
f.close()
So first a sample of the actual data mangled (data is originally a mix of text and numbers, there's no significance to any of the data at this point and some of the patterns are just because I replaced most of the characters with 0s, 1s and Zs because the random number generator in my brain is broken):
011.0ZN1ZZ 001.F5ZS1Z 001.ZO5ZY0
014.5ZZZ1Z 001.1SZZOZ 001.ZLMZY0
016.01NM1SU54 001.EX0Z1Z 001.LIZZOZ
018.01NM1SS41 001.F83Z1Z 001.0011M1SU54
014.ZZ1YZZ 001.ZZZ1IZ 001.0011M1SS41
013.2EBSIZ 001.ZZZ11Z 001.0011SE4
01N.ZINSIZ 001.ZZZZ1Z P01.ZZZZ1Z
01N.01NSE4 001.LSZZHG N01.ZZZZ1Z
001.01ON5O 001.5Z21OL F01.ZZZZ1Z
001.NE5ZO1 001.ZOM05O D01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.011ZOZ 001.01NZ0Y
Some additional comments.. I can clean up whitespace and deal with record length with no issues, so I'd like to simplify the question to this, I'm just including the above in case there's a solution to the simplified version that can't be easily extended to a more complex version.
1 7 13
2 8 14
3 9 15
4 10 16
5 11 17
6 12 18
19 25
20 26
21 27
22 28
23 29
24
So there will be a variable number of pages, but the same number of columns and rows on each page (although, in case it matters significantly, it's actually 12x3 instead of 6x3 but I wanted to keep it simple if possible), although the last page may be some empty rows/columns.
I'm using notepad++ but I have access to various gnutilities so if there's a solution that's way, way better than a regular expression I don't mind, although since I'll be using this a lot and use notepad++ a lot I'd appreciate a regex solution if it isn't too insane.
If you've got Git installed on your Windows machine, you may use Perl bundled with it from Git bash. Provided your input file is named data, try the following command (caution: it will orverwrite the input file):
echo >>data ; \
perl -i -lane'
$i=0;
push #{$c[$i++]}, $_ foreach #F;
if (/^\s*$/) {
push #l, #{$_} foreach #c;
print "#l\015";
#l=#c=();
}' data
The Perl command treats each line of input as space delimited fields and accumulates the fields in the #c matrix. When encounters an empty line (if (/^\s*$/) ...), it prints the matrix columns concatenated in a list.
The input file is changed in-place. A backup copy data.bak is created.
The input file may not end with an empty line so I add one with echo >>data. This makes the Perl script shorter and easier.
Another trick is the trailing \015 in print "#l\015";. This allows us to get Windows CRLF line endings in Unix-flavoured Git bash environment.
A demo can be found here: https://ideone.com/vnYoOd. But since Ideone forbids file read/write, the original command has been modified to make the code run there.
Suppose I have a binary file and text file of all record workers.
The default total month hours are all set to 0.
How to I actually access to the particular month in the binary and change it to the desired value?
This is in text file format
ID Name J F M
1 Jane 0 0 0
2 Mark 0 0 0
3 Kelvin 0 0 0
to
ID Name J F M
1 Jane 0 0 25
2 Mark 0 0 30
3 Kelvin 0 0 40
The 25 is actually the amount of hours worked in march.
I think the first question here is what you mean by "binary". Are you showing the format of the file literally? In other words, at input, is the character going to be '0' or '\0'? When you're done, do you want the file to contain the two digits '3' and '0' or the single character '\25', '\30' or '\40'?
If you're dealing with a single character at a known offset in each record for input, and want to replace it by a single character for the result, things are pretty easy: seek to the right offset in the file, write a byte, seek to the next offset, and continue 'til you've updated all the records.
If the input file contains character strings, so when you update the value its length will (probably) change, then you're pretty much stuck with reading data in, modifying it in memory, and writing the new data back out (usually to a new file). This is pretty easy too, but can be slow if your file is large.
If you're doing this in a real program, I'd think twice about doing it on your own at all. I'd consider using something like SQLite to handle the data instead. This not only allows you to simplify your code, but also makes life quite a bit nicer for your clients. It uses a known/documented file format, so other tools can work with the data, do backups, etc. It supports transactions, logging, roll-backs, etc. In short, they get a robust solution instead of yet another fragile problem.
A file is a stream of bytes. You can access a file by using the c family of functions fopen fread fwrite. Or though c++ iostream operations. In each case you will need to find the record usually by knowing its position and then reading and writing that record. If the records are not of fixed size you will have to handle moving all subsequent records.
Imagine two characters n and 1, where I need to insert a new character between them. We just need to input commands (end with Esc) like iāinsert before cursor. This command leaves vi in input mode until you press Esc.
Now let's say there are range of two characters:
n and 1
n and 2
n and 3
n and 4
n and 5
n and 6
n and 8
n and 9
...so on.
e.g. "ginBulk1" (added Bulk between n and 1)
Now I need to insert a UNIQUE character between these. So instead of manually going to each line one-by-one, pressing i, then inserting, can I just do it with simple command in vi?
Try this:
:g/n and 1/s//n and x 1/g
If you do not understand this, then post a few lines of actual before and after data.
I'm not sure I 100% understand, but try a regex replace:
:%s/n\([0-9]\)/nBulk\1/g
Which will replace all instances of n followed by a number with nBulk followed by the same number. I notice you say UNIQUE in your question, so if by this you mean that the word to be inserted is different every time (so n1 -> nBulk1, n2 -> nCat2 for example), then you need to explain your question more clearly, like is there some sort of pattern in the replacements?