Crystal reading x bytes from file - crystal-lang

I have this code:
a = File.open("/dev/urandom")
b = a.read(1024)
a.close
puts b
I expected to get the first 1024 bytes from the /dev/urandom device\file, instead I got an Error which says read accepts only slice and not Integer.
So I tried to do it like that:
b = a.read(("a" * 1000).to_slice)
But then I got back "1000" in the output.
What is the right way to read x bytes from a file in Crystal ?

What you did is not very ideal, but it actually worked. IO#read(Slice(UInt8)) returns the number of bytes actually read, in case the file is smaller than what you requested or the data isn't available for some other reason. In other words, it's a partial read. So you get 1000 in b because the slice you passed was filled with 1000 bytes. There's IO#read_fully(Slice(UInt8)) which blocks until it fulfilled as much of the request as possible, but too can't guarantee it in any case.
A better approach looks like this:
File.open("/dev/urandom") do |io|
buffer = Bytes.new(1000) # Bytes is an alias for Slice(UInt8)
bytes_read = io.read(buffer)
# We truncate the slice to what we actually got back,
# /dev/urandom never blocks, so this isn't needed in this specific
# case, but good practice in general
buffer = buffer[0, bytes_read]
pp buffer
end
IO also provides various convenience functions for reading strings until specific tokens or up to a limit, in various encodings. Many types also implement the from_io interface, which allows you to easily read structured data.

Related

Efficient ways to filter unwanted data from a buffer in c++

Lets assume, data is stored in a character buffer in below format
===========================================================================
|length| message1| length| message2| length| message3|...|length |messagen|
===========================================================================
Length - indicates size of following message
From this character buffer, suppose only message2 is unwanted and rest all are relevant,
How efficiently, message2 can be removed so that rest all data in buffer can be used happily?
I have come accross, in-place algorithm, where we can shift the messages in the buffer itself without any additional copy
But with this also, there is an overhead of shifting (n-2) messages because message2 is irrelevant
Is there any better approach/ solution to it to do in c++?
Let me add more details-
Requirement here is need to remove/filter irrelevant data from the buffer and then pass it as input to another function for further processing
irrelevant data can come at any position in the character buffer. Just for example, referred as message 2
You don't say how many bits are in your length field, but assuming you can spare an extra bit to make the length-values signed rather than unsigned, I'd be tempted to adopt a convention that says "if a length-header has a negative value, that indicates that the its message-body is invalid and should be ignored".
Once you've adopted that convention, then flagging message 2 as invalid is just a matter of overwriting its length-header with the negation of its current value.
Of course, the code that is reading the buffer later on has to also follow the convention, so if e.g. if it sees a length-header with value of -57 then it should just skip ahead by 57 bytes without processing them.

How to decompress less than original size with Lz4 library?

I'm using LZ4 library and when decompressing data with:
int LZ4_decompress_fast_continue (void* LZ4_streamDecode, const char* source, char* dest, int originalSize);
I need only first n bytes of the originally encoded N bytes, where n < N. So in order to improve the performance, it makes sense to decompress only a part of the original buffer.
I wonder if I can pass n instead of N to the originalSize argument of the function?
My initial test showed, that it's not possible (I got incorrectly decompressed data). Though maybe there is a way, for example if n is a multiple of some CHUNK_SIZE? All original N bytes were compressed with 1 call of a compress function.
LZ4_decompress_safe_continue() and LZ4_decompress_fast_continue() can only decode full blocks. They consider a partial block as an error, and report it as such. They also consider that if there is not enough room to decompress a full block, it's also an error.
The functionality you are looking for doesn't exist yet. But there is a close cousin that might help.
LZ4_decompress_safe_partial() can decode a part of a block.
Note that, in contrast with _continue() variants, it only works on independent blocks.
Note also that the compressed block must nonetheless be complete, and the output buffer must nonetheless have enough space to decode the entire block. So the only advantage provided by this function is speed : if you want only the first 10 bytes, it will stop as soon as it has generated enough bytes.
"as soon as" doesn't mean "exactly at 10". It could be much later, and in the worst case, it could be after decoding the entire block. That's because the internal decoding engine is still the same : it decodes entire sequences, and doesn't "break them" in the middle, for speed considerations.
If you need to extract less bytes than a full block in order to save some memory, I'm afraid there is no solution yet. Report it as a feature request to upstream.
This seems to have been implemented in lz4 1.8.3.

C++ reading buffer size

Suppose that this file is 2 and 1/2 blocks long, with block size of 1024.
aBlock = 1024;
char* buffer = new char[aBlock];
while (!myFile.eof()) {
myFile.read(buffer,aBlock);
//do more stuff
}
The third time it reads, it is going to write half of the buffer, leaving the other half with invalid data. Is there a way to know how many bytes did it actually write to the buffer?
istream::gcount returns the number of bytes read by the previous read.
Your code is both overly complicated and error-prone.
Reading in a loop and checking only for eof is a logic error since this will result in an infinite loop if there is an error while reading (for whatever reason).
Instead, you need to check all fail states of the stream, which can be done by simply checking for the istream object itself.
Since this is already returned by the read function, you can (and, indeed, should) structure any reader loop like this:
while (myFile.read(buffer, aBlock))
process(buffer, aBlock);
process(buffer, myFile.gcount());
This is at the same time shorter, doesn’t hide bugs and is more readable since the check-stream-state-in-loop is an established C++ idiom.
You could also look at istream::readsome, which actually returns the amount of bytes read.

Adding control symbols to a byte stream

Problem: sometimes we have to interleave multiple streams into one,
in which case its necessary to provide a way to identify block
boundaries within a stream. What kind of format would be good
for such a task?
(All processing has to be purely sequential and i/o operations are
blockwise and aligned.)
From decoding side, the best way is to have length prefixes
for blocks. But at encoding side, it requires either random access
to output file (seek to stream start and write the header), or
being able to cache the whole streams, which is, in general, impossible.
Alternatively, we can add length headers (+ some flags) to
blocks of cacheable size. Its surely a solution, but handling is much more
complex than [1], especially at encoding (presuming that i/o operations
are done with aligned fixed-size blocks).
Well, one possible implementation is to write a 0 byte into the buffer,
then stream data until its filled. So prefix byte = 0 would mean that
there're bufsize-1 bytes of stream data next, and !=0 would mean that
there's less... in which case we would be able to insert another prefix
byte if end-of-stream is reached. This would only work with bufsize=32k
or so, because otherwise the block length would require 3+ bytes to store,
and there would be a problem with handling of the case with end-of-stream
when there's only one byte of free space in the buffer.
(One solution to that would be storing 2-byte prefixes to each buffer
and adding 3rd byte when necessary; another is to provide a 2-byte encoding
for some special block lengths like bufsize-2).
Either way its no so good, because even 1 extra byte per 64k would accumulate
to a noticeable number with large files (1526 bytes per 100M). Also hardcoding
of the block size into format is bad too.
Escape prefix. Eg. EC 4B A7 00 = EC 4B A7, EC 4B A7 01 = end-of-stream.
Now this is really easy to encode, but decoding is pretty painful - requires
a messy state machine even to extract single bytes.
But overall it adds least overhead, so it seem that we still need to find
a good implementation for buffered decoding.
Escape prefix with all same bytes (Eg. FF FF FF). Much easier to check,
but runs of the same byte in the stream would produce a huge overhead (like 25%),
and its not unlikely with any byte value chosen for escape code.
Escape postfix. Store the payload byte before the marker - then decoder
just has to skip 1 byte before masked marker, and 4 bytes for control code.
So this basically introduces a fixed 4-byte delay for decoder, while [3]
has a complex path where marker bytes have to be returned one by one.
Still, with [3] encoder is much simpler (it just has to write an extra 0
when marker matches), and this doesn't really simplify the buffer processing.
Update: Actually I'm pretty sure that [3] or [5] would be the option I'd use,
I only listed other options in hope to get more alternatives (for example, it
would be ok if redundancy is 1 bit per block on average). So the main question
atm is how to parse the stream for [3]... current state machine looks like this:
int saved_c;
int o_last, c_last;
int GetByte( FILE* f ) {
int c;
Start:
if( o_last>=10 ) {
if( c_last>=(o_last-10) ) { c=saved_c; o_last=0; }
else c=byte("\xEC\x4B\xA7"[c_last++]);
} else {
c = getc(f);
if( o_last<3 ) {
if( char(c)==("\xEC\x4B\xA7"[o_last]) ) { o_last++; goto Start; }
else if( o_last>0 ) { saved_c=c; c_last=0; o_last+=10; goto Start; } // 11,12
// else just return c
} else {
if( c>0 ) { c=-1-c, o_last=0; printf( "c=%i\n", c ); }
else { saved_c=0xA7; c_last=0; o_last+=10-1; goto Start; } // 12
}
}
return c;
}
and its certainly ugly (and slow)
How about using blocks of fixed size, e.g. 1KB?
Each block would contain a byte (or 4 bytes) indicating which stream it is, and then it follows with just data.
Benefits:
You don't have to pay attention regarding the data itself. Data cannot accidently trigger any behaviour from your system (e.g. accidently terminate the stream)
I does not require random file access when encoding. In particular, you don't store the length of the block as it is fixed.
If data goes corrupt, the system can recover in the next block.
Drawbacks:
If you have to switch from stream to stream a lot, with each having only few bytes of data, the block may be underutilised. Lots of bytes will be empty.
If the block size is too small (e.g. if you want to solve the above problem), you can get huge overhead from the header.
From a standpoint of simplicity for sequential read and write, I would take solution 1, and just use a short buffer, limited to 256 bytes. Then you have a one-byte length followed by data. If a single stream has more than 256 consecutive bytes, you simply write out another length header and the data.
If you have any further requirements though you might have to do something more elaborate. For instance random-access reads probably require a magic number that can't be duplicated in the real data (or that's always escaped when it appears in the real data).
So I ended up using the ugly parser from OP in my actual project
( http://encode.ru/threads/1231-lzma-recompressor )
But now it seems that the actual answer is to let data type handlers
to terminate their streams however they want, which means that escape coding
in case of uncompressed data, but when some kind of entropy coding is
used in the stream, its usually possible to incorporate a more efficient EOF code.

Python and C++ Sockets converting packet data

First of all, to clarify my goal: There exist two programs written in C in our laboratory. I am working on a Proxy Server (bidirectional) for them (which will also mainpulate the data). And I want to write that proxy server in Python. It is important to know that I know close to nothing about these two programs, I only know the definition file of the packets.
Now: assuming a packet definition in one of the C++ programs reads like this:
unsigned char Packet[0x32]; // Packet[Length]
int z=0;
Packet[0]=0x00; // Spare
Packet[1]=0x32; // Length
Packet[2]=0x01; // Source
Packet[3]=0x02; // Destination
Packet[4]=0x01; // ID
Packet[5]=0x00; // Spare
for(z=0;z<=24;z+=8)
{
Packet[9-z/8]=((int)(720000+armcontrolpacket->dof0_rot*1000)/(int)pow((double)2,(double)z));
Packet[13-z/8]=((int)(720000+armcontrolpacket->dof0_speed*1000)/(int)pow((double)2,(double)z));
Packet[17-z/8]=((int)(720000+armcontrolpacket->dof1_rot*1000)/(int)pow((double)2,(double)z));
Packet[21-z/8]=((int)(720000+armcontrolpacket->dof1_speed*1000)/(int)pow((double)2,(double)z));
Packet[25-z/8]=((int)(720000+armcontrolpacket->dof2_rot*1000)/(int)pow((double)2,(double)z));
Packet[29-z/8]=((int)(720000+armcontrolpacket->dof2_speed*1000)/(int)pow((double)2,(double)z));
Packet[33-z/8]=((int)(720000+armcontrolpacket->dof3_rot*1000)/(int)pow((double)2,(double)z));
Packet[37-z/8]=((int)(720000+armcontrolpacket->dof3_speed*1000)/(int)pow((double)2,(double)z));
Packet[41-z/8]=((int)(720000+armcontrolpacket->dof4_rot*1000)/(int)pow((double)2,(double)z));
Packet[45-z/8]=((int)(720000+armcontrolpacket->dof4_speed*1000)/(int)pow((double)2,(double)z));
Packet[49-z/8]=((int)armcontrolpacket->timestamp/(int)pow(2.0,(double)z));
}
if(SendPacket(sock,(char*)&Packet,sizeof(Packet)))
return 1;
return 0;
What would be the easiest way to receive that data, convert it into a readable python format, manipulate them and send them forward to the receiver?
You can receive the packet's 50 bytes with a .recv call on a properly connected socked (it might actually take more than one call in the unlikely event the TCP packet gets fragmented, so check incoming length until you have exactly 50 bytes in hand;-).
After that, understanding that C code is puzzling. The assignments of ints (presumably 4-bytes each) to Packet[9], Packet[13], etc, give the impression that the intention is to set 4 bytes at a time within Packet, but that's not what happens: each assignment sets exactly one byte in the packet, from the lowest byte of the int that's the RHS of the assignment. But those bytes are the bytes of (int)(720000+armcontrolpacket->dof0_rot*1000) and so on...
So must those last 44 bytes of the packet be interpreted as 11 4-byte integers (signed? unsigned?) or 44 independent values? I'll guess the former, and do...:
import struct
f = '>x4bx11i'
values = struct.unpack(f, packet)
the format f indicates: big-endian, 4 unsigned-byte values surrounded by two ignored "spare" bytes, 11 4-byte signed integers. Tuple values ends up with 15 values: the four single bytes (50, 1, 2, 1 in your example), then 11 signed integers. You can use the same format string to pack a modified version of the tuple back into a 50-bytes packet to resend.
Since you explicitly place the length in the packet it may be that different packets have different lenghts (though that's incompatible with the fixed-length declaration in your C sample) in which case you need to be a bit more accurate in receiving and unpacking it; however such details depend on information you don't give, so I'll stop trying to guess;-).
Take a look at the struct module, specifically the pack and unpack functions. They work with format strings that allow you to specify what types you want to write or read and what endianness and alignment you want to use.