Question about file transfer for socket programming

Question about file transfer for socket programming - c++

Is there a good method on how to transfer a file from say... a client to a server?
Probably just images, but my professor was asking for any type of files.
I've looked around and am a little confused as to the general idea.
So if we have a large file, we can split that file into segments...? Then send each segment off to the server.
Should I also use a while loop to receive all the files / segments on the server side? Also, how will my server know if all the segments were received without previously knowing how many segments there are?
I was looking on the Cplusplus website and found that there is like a binary transfer of files...
Thanks for all the help =)

If you are using TCP:
You are right, there is no way to "know" how much data you will be receiving. This gives you a few options:
1) Before transmitting the image data, first send the number of bytes to be expected. So your first 4 bytes might be the 4-byte integer "4096". Then your client can read the first 4 bytes, "know" that it is expecting 4096 bytes, and then malloc(4096) so it can expect the rest. Then, your server can send() 4096 bytes worth of image data.
When you do this, be aware that you might have to recv() multiple times - for one reason or another, you might not have received all 4096 bytes. So you will need to check the return value of recv() to make sure you have gotten everything.
2) If you are just sending one file, you could just have your receiver read it. And it can keep recv()ing from the socket until the server closes the connection. This is a bit harder - you will have to keep track of how much you have received, and then if your buffer is full, you will have to reallocate it. I don't recommend this method, but it would technically accomplish the task.
If you are using UDP:
This means that you don't have reliable transfer. So packets might be dropped. They might also arrive out of order. So if you are going to use UDP, you must fragment your data into little segments. Both the sender and receiver must have agreement on how large a segment is (100 bytes? 1000 bytes?)
Not only that, but you must also transmit a sequence number with each packet - that is, label each packet #1, #2, etc. Because your client must be able to tell: if any packets are missing (you receive packets 1, 2 and 4 - and are thus missing #3) and to make sure they are in order (you receive 3, 2, then 1 - but when you save them to the file, you must make sure the packets are saved in the correct order, 1, 2, then 3).
So for your assignment, well, it will depend on what protocol you have to/are allowed to use.

If you use a UDP-based transfer protocol, you will have to break the file up into chunks for network transmission. You'll also have to reassemble them in the correct order on the receiving end and verify the results. If you use a TCP-based transfer protocol, all of this will be taken care of under the hood.
You should consult Beej's Guide to Network Programming for how best to send and receive data and use sockets in general. It explains most of the things about which you are asking.

There are many ways of transferring files. If your transferring files in a lossless manor, then your basically going to divide the file into chunks. Tag each chunk with a sequence number. Send the chunks to the other side and reconstitute the file. Stream oriented protocols are simpler since packets will be retransmitted if lost. If your using an unreliable protocol, then you will need to retransmit missing packets and resequenced chunks which are not in the correct order.
If lossy transfer is acceptable (like transferring video or on-line game data), then use an unreliable protocol. Lossy transfer is simpler because you don't have to retransmit missing chunks. All you need to do is make sure the chunks are processed in the proper sequence.
Many protocols send a terminator packet to indicate the end of transmission. You could use this strategy if you don't want to send the number of chunks to the other side before transmission.

Related

What the receiver do if the message is segmented by TCP

I raised this question when reading the source code of muduo (C++ network library).
If a client sends a big size message which will be segmented by TCP, what happens in server side? (Does server know this message is already segmented?)
And is it necessary for network library to wait for the whole message and do not interrupt the upper layer?

When dealing with a stream protocol like TCP, you already have to reassemble received data into chunks of your own choosing. That's either a fixed number of bytes per chunk, or it's decided dynamically by parsing the data in terms of your application's protocol (e.g. HTTP).
You don't know when you receive a packet from the network layer that it has been segmented: you only know that you received some data. You may know (because you understand your own protocol) that you're expecting more data to finish the chunk, but you won't know whether there is any more data until you receive it. If you do receive it.
Conversely, a single TCP packet may well contain more than a single chunk of your application-layer data! Again, you need to be aware that there is no direct relationship between the two things.
You can, however, depend on the TCP packets being delivered in the same order in which they were sent, which is nice.
Simple analogy: a big ol' ship, carrying cargo. It may be carrying 40 cars, or it may be carrying just half the quantity of parts required to construct an airplane. Or it may be carrying both! You don't know until you read the shipping manifest and consult your own records on delivery. It's then your responsibility to unpack what you've received and do what you need to do with it.
And is it necessary for network library to wait for the whole message and do not interrupt the upper layer?
If the library wants to pass a full "message" to the upper layer, then usually yes. Some approaches will just block waiting for a full message, but that's not common nowadays. Asynchronous I/O is your friend.
(This was a generic answer, written with no knowledge of what muduo does specifically.)

Efficiently send a stream of UDP packets

I know how to open an UDP socket in C++, and I also know how to send packets through that. When I send a packet I correctly receive it on the other end, and everything works fine.
EDIT: I also built a fully working acknowledgement system: packets are numbered, checksummed and acknowledged, so at any time I know how many of the packets that I sent, say, during the last second were actually received from the other endpoint. Now, the data I am sending will be readable only when ALL the packets are received, so that I really don't care about packet ordering: I just need them all to arrive, so that they could arrive in random sequences and it still would be ok since having them sequentially ordered would still be useless.
Now, I have to transfer a big big chunk of data (say 1 GB) and I'd need it to be transferred as fast as possible. So I split the data in say 512 bytes chunks and send them through the UDP socket.
Now, since UDP is connectionless it obviously doesn't provide any speed or transfer efficiency diagnostics. So if I just try to send a ton of packets through my socket, my socket will just accept them, then they will be sent all at once, and my router will send the first couple and then start dropping them. So this is NOT the most efficient way to get this done.
What I did then was making a cycle:
Sleep for a while
Send a bunch of packets
Sleep again and so on
I tried to do some calibration and I achieved pretty good transfer rates, however I have a thread that is continuously sending packets in small bunches, but I have nothing but an experimental idea on what the interval should be and what the size of the bunch should be. In principle, I can imagine that sleeping for a really small amount of time, then sending just one packet at a time would be the best solution for the router, however it is completely unfeasible in terms of CPU performance (I probably would need to busy wait since the time between two consecutive packets would be really small).
So is there any other solution? Any widely accepted solution? I assume that my router has a buffer or something like that, so that it can accept SOME packets all at once, and then it needs some time to process them. How big is that buffer?
I am not an expert in this so any explanation would be great.
Please note, however, that for technical reasons there is no way at all I can use TCP.

As mentioned in some other comments, what you're describing is a flow control system. The wikipedia article has a good overview of various ways of doing this:
http://en.wikipedia.org/wiki/Flow_control_%28data%29
The solution that you have in place (sleeping for a hard-coded period between packet groups) will work in principle, but in order to get reasonable performance in a real-world system you need to be able to react to changes in the network. This means implementing some kind of feedback where you automatically adjust both the outgoing data rate and packet size in response to to network characteristics, such as throughput and packetloss.
One simple way of doing this is to use the number of re-transmitted packets as an input into your flow control system. The basic idea would be that when you have a lot of re-transmitted packets, you would reduce the packet size, reduce the data rate, or both. If you have very few re-transmitted packets, you would increase packet size & data rate until you see an increase in re-transmitted packets.
That's something of a gross oversimplification, but I think you get the idea.

Sending large chunks of data over Boost TCP?

I have to send mesh data via TCP from one computer to another... These meshes can be rather large. I'm having a tough time thinking about what the best way to send them over TCP will be as I don't know much about network programming.
Here is my basic class structure that I need to fit into buffers to be sent via TCP:
class PrimitiveCollection
{
std::vector<Primitive*> primitives;
};
class Primitive
{
PRIMTYPES primType; // PRIMTYPES is just an enum with values for fan, strip, etc...
unsigned int numVertices;
std::vector<Vertex*> vertices;
};
class Vertex
{
float X;
float Y;
float Z;
float XNormal;
float ZNormal;
};
I'm using the Boost library and their TCP stuff... it is fairly easy to use. You can just fill a buffer and send it off via TCP.
However, of course this buffer can only be so big and I could have up to 2 megabytes of data to send.
So what would be the best way to get the above class structure into the buffers needed and sent over the network? I would need to deserialize on the recieving end also.
Any guidance in this would be much appreciated.
EDIT: I realize after reading this again that this really is a more general problem that is not specific to Boost... Its more of a problem of chunking the data and sending it. However I'm still interested to see if Boost has anything that can abstract this away somewhat.

Have you tried it with Boost's TCP? I don't see why 2MB would be an issue to transfer. I'm assuming we're talking about a LAN running at 100mbps or 1gbps, a computer with plenty of RAM, and don't have to have > 20ms response times? If your goal is to just get all 2MB from one computer to another, just send it, TCP will handle chunking it up for you.
I have a TCP latency checking tool that I wrote with Boost, that tries to send buffers of various sizes, I routinely check up to 20MB and those seem to get through without problems.
I guess what I'm trying to say is don't spend your time developing a solution unless you know you have a problem :-)
--------- Solution Implementation --------
Now that I've had a few minutes on my hands, I went through and made a quick implementation of what you were talking about: https://github.com/teeks99/data-chunker There are three big parts:
The serializer/deserializer, boost has its own, but its not much better than rolling your own, so I did.
Sender - Connects to the receiver over TCP and sends the data
Receiver - Waits for connections from the sender and unpacks the data it receives.
I've included the .exe(s) in the zip, run Sender.exe/Receiver.exe --help to see the options, or just look at main.
More detailed explanation:
Open two command prompts, and go to DataChunker\Debug in both of them.
Run Receiver.exe in one of the
Run Sender.exe in the other one (possible on a different computer, in which case add --remote-host=IP.ADD.RE.SS after the executable name, if you want to try sending more than once and --num-sends=10 to send ten times).
Looking at the code, you can see what's going on, creating the receiver and sender ends of the TCP socket in the respecitve main() functions. The sender creates a new PrimitiveCollection and fills it in with some example data, then serializes and sends it...the receiver deserializes the data into a new PrimitiveCollection, at which point the primitive collection could be used by someone else, but I just wrote to the console that it was done.
Edit: Moved the example to github.

Without anything fancy, from what I remember in my network class:
Send a message to the receiver asking what size data chunks it can handle
Take a minimum of that and your own sending capabilities, then reply saying:
What size you'll be sending, how many you'll be sending
After you get that, just send each chunk. You'll want to wait for an "Ok" reply, so you know you're not wasting time sending to a client that's not there. This is also a good time for the client to send a "I'm canceling" message instead of "Ok".
Send until all packets have been replied with an "Ok"
The data is transfered.
This works because TCP guarantees in-order delivery. UDP would require packet numbers (for ordering).
Compression is the same, except you're sending compressed data. (Data is data, it all depends on how you interpret it). Just make sure you communicate how the data is compressed :)
As for examples, all I could dig up was this page and this old question. I think what you're doing would work well in tandem with Boost.Serialization.

I would like to add one more point to consider - setting TCP socket buffer size in order to increase socket performance to some extent.
There is an utility Iperf that let test speed of exchange over the TCP socket. I ran on Windows a few tests in a 100 Mbs LAN. With the 8Kb default TCP window size the speed is 89 Mbits/sec and with 64Kb TCP window size the speed is 94 Mbits/sec.

In addition to how to chunk and deliver the data, another issue you should consider is platform differences. If the two computers are the same architecture, and the code running on both sides is the same version of the same compiler, then you should, probably, be able to just dump the raw memory structure across the network and have it work on the other side. If everything isn't the same, though, you can run into problems with endianness, structure padding, field alignment, etc.
In general, it's good to define a network format for the data separately from your in-memory representation. That format can be binary, in which case numeric values should be converted to standard forms (mainly, changing endianness to "network order", which is big-endian), or it can be textual. Many network protocols opt for text because it eliminates a lot of formatting issues and because it makes debugging easier. Personally, I really like JSON. It's not too verbose, there are good libraries available for every programming language, and it's really easy for humans to read and understand.
One of the key issues to consider when defining your network protocol is how the receiver knows when it has received all of the data. There are two basic approaches. First, you can send an explicit size at the beginning of the message, then the receiver knows to keep reading until it's gotten that many bytes. The other is to use some sort of an end-of-message delimiter. The latter has the advantage that you don't have to know in advance how many bytes you're sending, but the disadvantage that you have to figure out how to make sure the the end-of-message delimiter can't appear in the message.
Once you decide how the data should be structured as it's flowing across the network, then you should figure out a way to convert the internal representation to that format, ideally in a "streaming" way, so you can loop through your data structure, converting each piece of it to network format and writing it to the network socket.
On the receiving side, you just reverse the process, decoding the network format to the appropriate in-memory format.
My recommendation for your case is to use JSON. 2 MB is not a lot of data, so the overhead of generating and parsing won't be large, and you can easily represent your data structure directly in JSON. The resulting text will be self-delimiting, human-readable, easy to stream, and easy to parse back into memory on the destination side.

What should i know about UDP programming?

I don't mean how to connect to a socket. What should I know about UDP programming?
Do I need to worry about bad data in my socket?
I should assume if I send 200bytes I may get 120 and 60 bytes separately?
Should I worry about another connection sending me bad data on the same port?
If data doesnt arrive typically how long may I (typically) not see data for (250ms? 1 second? 1.75sec?)
What do I really need to know?

"i should assume if i send 200bytes i
may get 120 and 60bytes separately?"
When you're sending UDP datagrams your read size will equal your write size. This is because UDP is a datagram protocol, vs TCP's stream protocol. However, you can only write data up to the size of the MTU before the packet could be fragmented or dropped by a router. For general internet use, the safe MTU is 576 bytes including headers.
"i should worry about another
connection sending me bad data on the
same port?"
You don't have a connection, you have a port. You will receive any data sent to that port, regardless of where it's from. It's up to you to determine if it's from the right address.
If data doesnt arrive typically how
long may i (typically) not see data
for (250ms? 1 second? 1.75sec?)
Data can be lost forever, data can be delayed, and data can arrive out of order. If any of those things bother you, use TCP. Writing a reliable protocol on top of UDP is a very non trivial task and there is no reason to do so for almost all applications.

Should I worry about another
connection sending me bad data on the
same port?
Yes you should worry about it. Any application can send data to your open UDP port at any time. One of the big uses of UDP is many to one style communications where you multiplex communications with several peers on a single port using the addressed passed back during the recvfrom to differentiate between peers.
However, if you want to avoid this and only accept packets from a single peer you can actually call connect on your UDP socket. This cause the IP stack to reject packets coming from any host:port combo ( socket ) other than the one you want to talk to.
A second advantage of calling connect on your UDP socket is that in many OS's it gives a significant speed / latency improvement. When you call sendto on an unconnected UDP socket the OS actually temporarily connects the socket, sends your data and then disconnects the socket adding significant overhead.
A third advantage of using connected UDP sockets is it allows you to receive ICMP error messages back to your application, such as routing or host unknown due to a crash. If the UDP socket isn't connected the OS won't know where to deliver ICMP error messages from the network to and will silently discard them, potentially leading to your app hanging while waiting for a response from a crashed host ( or waiting for your select to time out ).

Your packet may not get there.
Your packet may get there twice or even more often.
Your packets may not be in order.
You have a size limitation on your packets imposed by the underlying network layers. The packet size may be quite small (possibly 576 bytes).
None of this says "don't use UDP". However you should be aware of all the above and think about what recovery options you may want to take.

Fragmentation and reassembly happens at the IP level, so you need not worry about that (Wikipedia). (This means that you won't receive split or truncated packets).
UDP packets have a checksum for the data and the header, so receiving bogus data is unlikely, but possible. Lost or duplicate packets are also possible. You should check your data in any case anyway.
There's no congestion control, so you may wish to consider that, if you plan on clogging the tubes with a lot of UDP packets.

UDP is a connectionless protocol. Sending data over UDP can get to the receiver, but can also get lost during transmission. UDP is ideal for things like broadcasting and streaming audio or video (i.e. a dropped packet is never a problem in those situations.) So if you need to ensure your data gets to the other side, stick with TCP.
UDP has less overhead than TCP and is therefore faster. (TCP needs to build a connection first and also checks data packets for data corruption which takes time.)
Fragmented UDP packets (i.e. packets bigger than about half a Kb) will probably be dropped by routers, so split your data into small chuncks before sending it over. (In some cases, the OS can take care of that.) Note that it is allways a packet that might make it, or not. Half packets aren't processed.
Latency over long distances can be quite big. If you want to do retransmission of data, I would go with something like 5 to 10 times the agerage latency time over the current connection. (You can measure the latency by sending and receiving a few packets.)
Hope this helps.

I won't follow suit with the other people who answered this, they all seem to push you toward TCP, and that's not for gaming at all, except maybe for login/chat info. Let's go in order:
Do I need to worry about bad data in my socket?
Yes. Even though UDP contains an extremely simple checksum for routers and such, it is not 100% efficient. You can add your own checksum device, but most of the time UDP is used when reliability is already not an issue, so data that doesn't conform should just be dropped.
I should assume if I send 200bytes I may get 120 and 60 bytes separately?
No, UDP is direct data write and read. However, if the data is too large, some routers will truncate and you lose part of the data permanently. Some have said roughly 576 bytes with header, I personally wouldn't use more than 256 bytes (nice round log2 number).
Should I worry about another connection sending me bad data on the same port?
UDP listens for any data from any computer on a port, so on this sense yes. Also note that UDP is a primitive and a raw format can be used to fake the sender, so you should use some sort of "key" in order for the listener to verify the sender against their IP.
If data doesnt arrive typically how long may I (typically) not see data for (250ms? 1 second? 1.75sec?)
Data sent on UDP is usually disposable, so if you don't receive data, then it can easily be ignored...however, sometimes you want "semi-reliable" but you don't want 'ordered reliable' like TCP uses, 1 second is a good estimate of a drop. You can number your packets on a rotation and write your own ACK communication. When a packet is received, it records the number and sends back a bitfield letting the sender know which packets it received. You can read this unfinished document for more information (although unfinished, it still yields valiable info):
http://gafferongames.com/networking-for-game-programmers/

The big thing to know when attempting to use UDP is:
Your packets might not all make it over the line, which means there is going to be possible data corruption.
If you're working on an application where 100% of the data needs to arrive reliably to provide functionality, use TCP. If you're working on an application where some loss is allowable (streaming media, etc.) then go for UDP but don't expect everything to get from one of the pipe to the other intact.

One way to look at the difference between applications appropriate for UDP vs. TCP is that TCP is good when data delivery is "better late than never", UDP is good when data delivery is "better never than late".
Another aspect is that the stateless, best-effort nature of most UDP-based applications can make scalability a bit easier to achieve. Also note that UDP can be multicast while TCP can't.

In addition to don.neufeld's recommendation to use TCP.
For most applications TCP is easier to implement. If you need to maintain packet boundaries in a TCP stream, a good way is to transmit a two byte header before the data to delimit the messages. The header should contain the message length. At the receiving end just read two bytes and evaluate the value. Then just wait until you have received that many bytes. You then have a complete message and are ready to receive the next 2-byte header.
This gives you some of the benefit of UDP without the hassle of lost data, out-of-order packet arrival etc.

And don't assume that if you send a packet it got there.

If there is a packet size limitation imposed by some router along the way, your UDP packets could be silently truncated to that size.

Two things:
1) You may or may not received what was sent
2) Whatever you receive may not be in the same order it was sent.

Count the number of packets sent to a server from a client?

So I'm almost done an assignment involving Win32 programming and sockets, but I have to generate and analyze some statistics about the transfers. The only part I'm having trouble with is how to figure out the number of packets that were sent to the server from the client.
The data sent can be variable-length, so I can't just divide the total bytes received by a #define'd value.
We have to use asynchronous calls to do everything, so I've been trying to increment a counter with every FD_READ message I get for the server's socket. However, because I have to be able to accept a potentially large file size, I have to call recv/recvfrom with a buffer size around 64k. If I send a small packet (a-z), there are no problems. But if I send a string of 1024 characters 10x, the server reports 2 or 3 packets received, but 0% data loss in terms of bytes sent/received.
Any idea how to get the number of packets?
Thanks in advance :)

This really boils down to what you mean by 'packet.'
As you are probably aware, when a TCP/UDP message is sent on the wire, the data being sent is 'wrapped,' or prepended, with a corresponding TCP/UDP header. This is then 'wrapped' in an IP header, which is in turn 'wrapped' in an Ethernet frame. You can see this breakout if you use a sniffing package like Wireshark.
The point is this. When I hear the term 'packet,' I think of data at the IP level. IP data is truly packetized on the wire, so packet counts make sense when talking about IP. However, if you're using regular sockets to send and receive your data, the IP headers, as well as the TCP/UDP headers, are stripped off, i.e., you don't get this information from the socket. And without that information, it is impossible to determine the number of 'packets' (again, I'm thinking IP) that were transmitted.
You could do what others are suggesting by adding your own header with a length and a counter. This information will help you accurately size your receive buffers, but it won't help you determine the number of packets (again, IP...), especially if you're doing TCP.
If you want to accurately determine the number of packets using Winsock sockets, I would suggest creating a 'raw' socket as suggested here. This socket will collect all IP traffic seen by your local NIC. Use the IP and TCP/UDP headers to filter the data based on your client and server sockets, i.e., IP addresses and port numbers. This will give an accurate picture of how many IP packets were actually used to transmit your data.

Not a direct answer to your question but rather a suggestion for a different solution.
What if you send a length-descriptor in front of the data you want to transfer? That way you can already allocate the correct buffer size (not too much, not too little) on the client and also check if there were any losses when the transfer is over.
With TCP you should have no problem at all because the protocol itself handles the error-free transmission or otherwise you should get a meaningful error.
Maybe with UDP you could also split up your transfer into fixed-size chunks with a propper sequence-id. You'd have to accumulate all incoming packages before you sort them (UDP makes no guarantee on the receive-order) and paste the data together.
On the other hand you should think about it if it is really necessary to support UDP as there is quite some manual overhead if you want to get that protocol error-safe... (see the Wikipedia Article on TCP for a list of the problems to get around)

Do your packets have a fixed header, or are you allowed to define your own. If you can define your own, include a packet counter in the header, along with the length. You'll have to keep a running total that accounts for rollover in your counter, but this will ensure you're counting packets sent, rather than packets received. For an simple assignment, you probably won't be encountering loss (with UDP, obviously) but if you were, a packet counter would make sure your statistics reflected the sent message accurately.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js