Adding specific record to binary file c++ - c++

Suppose I have a binary file and text file of all record workers.
The default total month hours are all set to 0.
How to I actually access to the particular month in the binary and change it to the desired value?
This is in text file format
ID Name J F M
1 Jane 0 0 0
2 Mark 0 0 0
3 Kelvin 0 0 0
to
ID Name J F M
1 Jane 0 0 25
2 Mark 0 0 30
3 Kelvin 0 0 40
The 25 is actually the amount of hours worked in march.

I think the first question here is what you mean by "binary". Are you showing the format of the file literally? In other words, at input, is the character going to be '0' or '\0'? When you're done, do you want the file to contain the two digits '3' and '0' or the single character '\25', '\30' or '\40'?
If you're dealing with a single character at a known offset in each record for input, and want to replace it by a single character for the result, things are pretty easy: seek to the right offset in the file, write a byte, seek to the next offset, and continue 'til you've updated all the records.
If the input file contains character strings, so when you update the value its length will (probably) change, then you're pretty much stuck with reading data in, modifying it in memory, and writing the new data back out (usually to a new file). This is pretty easy too, but can be slow if your file is large.
If you're doing this in a real program, I'd think twice about doing it on your own at all. I'd consider using something like SQLite to handle the data instead. This not only allows you to simplify your code, but also makes life quite a bit nicer for your clients. It uses a known/documented file format, so other tools can work with the data, do backups, etc. It supports transactions, logging, roll-backs, etc. In short, they get a robust solution instead of yet another fragile problem.

A file is a stream of bytes. You can access a file by using the c family of functions fopen fread fwrite. Or though c++ iostream operations. In each case you will need to find the record usually by knowing its position and then reading and writing that record. If the records are not of fixed size you will have to handle moving all subsequent records.

Related

Problems with format descriptors in Fortran

I'm learning Fortran and found some strange things when writing with a format (I'm using Fortran onlinegdb)
Program Hello
real, dimension(3,2):: array
array = 0
write(*, '(A,/, A,/, F5.2, F5.2)') &
"1","2",((array(i, j), i = 1,3), j = 1,2)
End Program Hello
I expected
1
2
0.00 0.00
0.00 0.00
0.00 0.00
I get
1
2
0.00 0.00
0.00 0.00
What's wrong?
Vladimir F is correct in saying that the format given does not suit the items that are provided for output: with format reversion after writing two real values, the control goes back to looking at the edit descriptor A but what corresponds to that isn't another character variable. This is not allowed.
However, the format suggested in an earlier revision of that other answer also does not give the output that you expect. If you want to write pairs of numbers on each line relying on an unlimited repeat specification, you'll need to explicitly put the file positioning into the format:
write(*, '(2(A,/),*(2F5.2,:,/))') "1", "2", transpose(array)
Without the / edit in there at the end, the repeat will mean that all elements of the array go in the same record. We also have : there, so that we don't get an extra line break after the final array element.
(I've also transposed the output array, as that's probably what you really mean. The implied do loop in the original output is a little unexpected and makes more sense moving over the final index first.)
With a limited repeat specification, as shown in the corrected form of that answer, format reversion does imply positioning:
write(*, '(2(A,/), 2(F5.2))') "1","2", transpose(array)
After processing the 2(F5.2) reversion has this reused while there are still elements to write out.
In summary, if you are relying on format reversion to "skip" earlier parts of a format while keeping new records, you must correctly mark the part of the whole format to revert to using parentheses. With just the whole format surrounded by parentheses, and no others, format reversion reuses the whole format.
You requested two strings to be printed, each on a new line, then two floating point numbers. That happened correctly.
But then there were still remaining items in the array. The format interpretation started again from the beginning with a new line and again for two strings and two numbers. But the array did not contain any strings...
Try '(A,/, A,/, (F5.2, F5.2))' instead. That will repeat the two-times floating point number format group until there are still numbers to be processed but this time the format does not return to the very beginning. (Note: the old and untested revision of the answer featured an extra repeat count - I did not realize this will disable the format reversion.)

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

Controlling newlines when writing out arrays in Fortran

So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)

Meaning of 3F7.1 in Fortran data format

I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf

Python - Set file EOF

Say, for some reason, I have a 1 TB file. In Python, if I wanted to add 10 bytes, I could just seek to the end, and write a 10 byte string. However, say I'd like to cut 10 bytes off the end of it. Obviously, it would take a ridiculous amount of time (and there may not even be HDD space) to copy this file without the excess 10 bytes, then delete the old one.
In c++ for Windows, there's a function, SetEndOfFile, that lets me change file size to something smaller without file rewriting.
Is there a similar function in python that will do this? I've researched and cannot find anything...
Wow, I guess I hadn't looked hard enough: truncate
f = open(fname)
f.seek(-10,2) # jump 10 bytes before end
f.truncate() # truncate it!!
f.close()