What would you use to implement a fast and lightweight file server? - c++

I need to have as part of a desktop application a file server which should respond as fast as possible to file transfer requests (from remote clients, usually located on the same LAN). There will be many file requests for small sized files. The server should be able to provide both upload and download services.
I am not tight to any particual technology so I am open to any programming language, toolkits, libraries as long as they can run on Windows.
My initial take is to go with a C/C++ implementation using Windows Sockets or use the services provided by libraries such as Boost (asio or such). I have also thought of Erlang but that I'll have to learn and so the performance benefits should justify the increased development time due to having to learn the language.
LATER EDIT: I appreciate the answers that say use FTP or HTTP or basically anything that has been already created but considering you still want to write one from scratch, what would you do?

Why not just go with FTP? You should be able to find an adequate server implementation in any language, and client access libraries too.
It sounds like a lot of wheel-reinvention. Granted, FTP is not ideal, and has a few odd spots, but ... it's there, it's standard, well-known, and already very widely implemented.

For frequent uploads of small files, the fastest way would be to implement your own proprietary protocol, but that would require a considerable amount of work - and also it would be non-standard, meaning future integration would be difficult unless you are able to implement your protocol in any client you'll support. If you choose to do it anyway, this is my suggestion for a simple protocol:
Command: 1 byte to identify what'll be done: (0x01 for upload request, 0x02 for download request, 0x11 for upload response, 0x12 for download response, etc).
File name: can be fixed-size or prefixed with a byte for the length (assuming the name is less than 255 bytes)
Checksum, MD5 for instance (if upload request or download response)
File size (if upload request or download response)
payload (if upload request or download response)
This could be implemented on top of a simple TCP socket. You can also use UDP, avoiding the cost of establishing a connection but in this case you have to deal with retransmission control.
Before deciding to implement your own protocol, take a look at HTTP libraries like libcurl, you could make your server use standard HTTP commands like GET for download and POST for upload. This would save a lot of work and you'll be able to test the download with any web browser.
Another suggestion to improve performance is to use as the file repository not the filesystem, but something like SQLite. You can create a single table containing one char column for the file name and one blob column for the file contents. Since SQLite is lightweight and does an efficient caching, you'll most of the time avoid the disk access overhead.
I'm assuming you don't need client authentication.
Finally: although C++ is your preference to give you raw native code speed, rarely this is the major bottleneck in this kind of application. Most probably will be disk access and network bandwidth. I'm mentioning this because in Java you'll probably be able to make a servlet to do exactly the same thing (using HTTP GET for download and POST for upload) with less than 100 lines of code. Use Derby instead of SQLite in this case, put that servlet in any container (Tomcat, Glassfish, etc) and it's done.

If all the machines are running on Windows on the same LAN, why do you need a server at all? Why not simply use Windows file sharing?

I would suggest not to use FTP, or SFTP, or any other connection oriented technique. Instead, go for a connectionless protocol or technique.
The reason is that, if you require lots of small files to be uploaded or downloaded, and the response should be as fast as possible, you want to avoid the cost of setting up and destroying connections.
I would suggest that you look at either using an existing implementation or implementing your own HTTP or HTTPS server/service.

Your bottlenecks are likely to come from one of the following sources:
Harddisk I/O - The WD velociraptor is supposed to have a random access speed of about 100MB/s. Also, it is important whether you set it up as RAID0,1,5 or what nots. Some read fast but write slow. Trade-offs.
Network I/O - Assuming that you have the fastest harddisks in a fast RAID setup, unless you use Gbit I/O, your network will be slow. If your pipes are big, you still need to supply it with data.
Memory cache - The in-memory file-system cache will need to be big enough to buffer all the network I/O so that it does not slow you down. That will require large amounts of memory for the kind of work you're looking at.
File-system structure - Assuming that you have gigabytes worth of memory, then the bottleneck will most likely be the data-structure that you use for the file-system. If the file-system structure is cumbersome it will slow you down.
Assuming that all the other problems are solved, then do you worry about your application itself. Notice, that most of the bottlenecks are outside your software control. Therefore, whether you code it in C/C++ or use specific libraries, you will still be at the mercy of the OS and hardware.

Sounds like you should use an SFTP (SSH) server, it's firewall/NAT safe, secure, and already does what you want and more. You could also use SAMBA or windows file sharing for an even more simple implementation.

Why not use something existing, for example a normal Web server handles a lot of small files (images) very well and fast.
And lots of people already spent time in optimizing the code.
And the second benefit is that the transfer is done with HTTP which is an established protocol. And is easily switched to SSL if you need more security.
For the uploads, they are also no problem with a script or custom module - with the same method you can also add authorization.
As long as you don't need to dynamically seek the files i guess this would be one of the best solutions.

It's a new part to an existing desktop application? What's the goal of the server? Is it protecting the files that are uploaded/downloaded and providing authentication and/or authorisation? Does it provide some kind of structure for the uploads to be stored in?
One option may be to install Apache HTTP Server on the machine and serve the file via that. Use POST to upload and GET to download.
If the clients are within a LAN could you not just share a drive?

Related

Segmented FTP upload

How can I upload a file in FTP in segmented way ? Is there any open source tool/library so that I can use it?.
Is there any server side change needed to combine the uploads? Currently I am using vsftpd.
The first thing to consider is that segmented transfers are not considered to be good net citizen behaviour. (i.e. you are gaming the system by setting up multi downloads on a shared link, gaining more than your fair share of bandwidth) As such, the protocol definitions do not support specifically segmented upload. (Or download for that matter) Resume yes.
Segmented DOWNLOAD is a hack by some tools that use the RESUME function of the protocol to transfer different parts of the same file at the same time.. this behaviour has a "NON-STANDARD" and not the intention of the protocol specifications.
Segmented UPLOAD is possible but the client AND ftpd server (or whatever protocol server your using) would need to support this NO-STANDARD and frowned upon implementation.
Again, this is not supported specifically in any standards as such poor behaviour is not encouraged by an open standard.
HOWEVER, you will find tools like lftp that support segmented ftp downloads. But currently, I have not seen any implementation of segmented upload that uses common open protocols like ftp.
I did find a java (Custom open source) based udp tool that did this, but udp needs tcp fallback if you want reliability in the internet. (udp is dropped by some internet gateways)
In FTP protocol, you can implement a transfer by parts using REST command.
The REST command defines offset in a file, where transfer starts. You then transfer as many bytes as you want. And then you can restart the transfer again from a further offset.
vsftpd server supports REST command.

How to communicate between C++ server app and django web app

I have some framework doing specific task in C++ and a django-based web app. The idea is to launch this framework, receive some data from it, send some data or request and check it's status in some period.
I'm looking for the best way of communication. Both apps run on the same server. I was wondering if a json server in C++ is a good idea. Django would send a request to this server, and server would parse it, and delegate a worker thread to complete the task. Almost all data that need to be send is string-like. Other data will be stored in database so there is no problem with that.
Is JSON a good idea? Maybe you know some better mechanism for local communication between C++ and django?
If your requirement is guaranteed to always have the C++ application on the same machine as the Django web application, include the C++ code by converting it into a shared library and wrapping python around it. Just like this Calling C/C++ from python?
JSON and other serializations make sense if you are going to do remote calls and the code needs to communicate across machines.
JSON seems like a fair enough choice for data serialization - it's good at handling strings and there's existing libraries for encoding/decoding JSON in both python & C++.
However, I think your bigger problem is likely to be the transport protocol that you use for transferring JSON between your client and server. Here's some options:
You could build an HTTP server into your C++ application (which I think might be what you mean by "JSON server" in your question), which would work fine, though might be a bit of a pain to implement unless you get a hold of a library to handle the hard work for you.
Another option might be to use the 0MQ library to send JSON (or otherwise) messages between your client and server. I think this would probably be a lot easier than implementing a full HTTP server, and 0MQ has some interprocess communication code that would likely be a lot faster than sending things over the network.
A third option would just be to run your C++ as a standalone application and pass the data in to it via stdin or command line parameters. This is probably the simplest way to do things, though may not be the most flexble. If you were to go this way, you might be better off just building a Python/C++ binding as suggested by ablm.
Alternatively you could attempt to build some sort of job queue based on redis or something other database system. The idea being that your django application puts some JSON describing the job into the job queue, and then the C++ application could periodically poll the queue, and use a seperate redis entry to pass the results back to the client. This could have the advantage that you could reasonably easily have several "workers" reading from the job queue with minimal effort.
There's almost definitely some other ways to go about it, but those are the ones that immediately spring to mind.

Jabber and expensive data (xml its trash)

Im working in a project that has jabber has communication platform.
The thing is that i need clients (a lot of clients) to communicate between each other not only for signalization, but to change data between them.
Imagine that the client A has 3 services available. The client B could request to A to start sending him info from each service (like a stream service) until the client B says to A to stop the services.
These services could only send one character with 100ms interval or 1000characters with 100ms interval or even send some data when its needed.
When the info sended to B, arrives it has to know what service corresponds, what action and the values (example), so im using json over jabber.
My problem is that im wasting a lot of bandwith with jabber xmpp protocol just to send a message with a body like:
{"s":"x", "x":5} //each 100ms (5 represents any number)
I really don't want to have parallel communication (like direct sockets), because jabber has all of that implemented and its easy scalable, firewall problems, sometimes i use http communications (im using BOSH in this case).
I know that there is some compression that i can do, but im wondering if you recommends something else that could not have such ammount of xml behind my message and still, using jabber.
Thanks a lot for your help.
Best Regards,
Eduardo
It sounds like, except for your significant data transfer, XMPP suits your application well.
As you probably know, XMPP was never designed or intended to be used as a big pipe for data transfer. Most applications that involve significant data transfer, such as file transfers and voice/video, use XMPP just for negotiation of a separate "out of band" stream. You say this might cause problems for you because of firewalls and web clients.
If your application is mostly transferring text, then you really should try out compression... it offers significant savings on bandwidth, if that's your most constrained resource. The downside is that it will take more client and server memory (around 300KB by default, but that can be reduced with marginal compression loss).
Alternatively you can look at tunneling your data base64-encoded using In-Band Bytestreams. I don't have your sample data, or know how you are wrapping them for transport, and this could come off worse or better. I would say it would come off better if you stripped out your JSON and made it into a more efficient binary format instead. Base64 data will not compress so well, and is roughly 33% larger than the raw data. The savings would be in being able to strip out JSON and any other extraneous wrappings, while keeping the data within the XMPP stream.
In the end scaling most applications is hard, whichever technologies you use. It requires primarily insight - you shouldn't change anything without testing it first, and you should be testing beforehand to find out what you ought to change. You should be analyzing your system for the primary bottlenecks (is it really the client's bandwidth??). Rarely in my experience has XML itself been the direct bottleneck. However ultimately all these things are unique to your application, it's not easy to give generic advice at scale.
No, Xml is no trash. Its human readable, very extensible and can be compressed extremely well.
XMPP supports stream compression, and this stream compression (mostly zlib) works extremely well according to all my tests. So if its important for you that you optimize the number of bytes you send over the wire or are on low bandwidth then use stream compression when you are on sockets. When you are on Bosh then you have to use either a server which supports HTTP compression or use a proxy in between to enable compression. But keep in mind that BOSH has also lots of overhead with all the HTTP headers.

Socket Server vs. Standard Servers

I'm working on a project of which a large part is server side software. I started programming in C++ using the sockets library. But, one of my partners suggested that we use a standard server like IIS, Apache or nginx.
Which one is better to do, in the long run? When I program it in C++, I have direct access to the raw requests where as in the case of using standard servers I need to use a scripting language to handle the requests. In any case, which one is the better option and why?
Also, when it comes to security for things like DDOS attacks etc., do the standard servers already have protection? If I would want to implement it in my socket server, what is the best way?
"Server side software" could mean lots of different things, for example this could be a trivial app which "echoes" everything back on a specific port, to a telnet/ftp server to a webserver running lots of "services".
So where in this gamut of possibilities does your particular application lie? Without further information, it's difficult to make any suggestions, but let's see..
Web Services, i.e. your "server side" requirement is to handle individual requests and respond having done some set of business logic. Typically communication is via SOAP/XML, and this is ideal if you bave web based clients (though nothing prevents your from accessing these services via standalone clients). Typially you host these on web servers as you mentioned, and often they are easiest written in Java (I've yet to come across one that needed to be written in C++!)
Simple web site - slightly different to the above, respods to HTML get/post requests and serves up static or dymanic content (I'm guessing this is not what you're after!)
Standalone server which responds to something specific, here you'd have to implement your own "messaging"/protocols etc. and the server will carry out a specific function on incoming request and potentially send responses back. Key thing here is that the server does something specific, and is not a generic container (at which point 1 makes more sense!)
So where does your application lie? If 1/2 use Java or some scripting language (such as Perl/ASP/JSP etc.) If 3, you can certainly use C++, and if you do, use a suitable abstraction, such as boost::asio and Google Protocol buffers, save yourself a lot of headache...
With regards to security, ofcourse bugs and security holes are found all the time, however the good thing with some of these OS projects is that the community will tackle and fix them. Let's just say, you'll be safer using them than your own custom handrolled imlpementation, the likelyhood that you'll be able to address all the issues that they would have encountered in the years they've been around is very small (no disrespect to your abilities!)
EDIT: now that there's a little more info, here is one possible approach (this is what I've done in the past, and I've jused Java most of the way..)
The client facing server should be something reliable, esp. if it's over the internet, here I would use a proven product, something like Apache is good or IIS (depends on which technologies you have available). IMHO, I would go for jBoss AS - really powerful and easily customisable piece of kit, and integrates really nicely with lots of different things (all Java ofcourse!) You could then have a simple bit of Java which can then delegate to your actual Server processes that do the work..
For the Server procesess you can use C++ if that's what you are comfortable with
There is one key bit which I left out, and this is how 1 & 2 talk to each other. This is where you should look at an open source messaging product (even more higher level than asio or protocol buffers), and here I would look at something like Zero MQ, or Red Hat Messaging (both are MQ messaging protocols), the great advantage of this type of "messaging bus" is that there is no tight coupling between your servers, with your own handrolled implementation, you'll be doing lots of boilerplate to get the interaction to work just right, with something like MQ, you'll have multiplatform communication without having to get into the details... You wil save yourself a lot of time and bother if you elect to use something like that.. (btw. there are other messaging products out there, and some are easier to use - such as Tibco RV or EMS etc, however they are commercial products and licenses will cost a lot of money!)
With a messaging solution your servers become trivial as they simply handle incoming messagins and send messages back out again, and you can focus on the business logic...
my two pennies... :)
If you opt for 1st solution in Nim's list (web services) I would suggest you to have a look at WSO's web services framework for C++ , Axis CPP and Axis2/C web services framework (if you are not restricted to C++). Web Services might be the best solution for your requirement as you can quickly build them and use either as processing or proxy modules on the server side of your system.

Increasing SSL handshaking performance

I've got a short-lived client process that talks to a server over SSL. The process is invoked frequently and only runs for a short time (typically for less than 1 second). This process is intended to be used as part of a shell script used to perform larger tasks and may be invoked pretty frequently.
The SSL handshaking it performs each time it starts up is showing up as a significant performance bottleneck in my tests and I'd like to reduce this if possible.
One thing that comes to mind is taking the session id and storing it somewhere (kind of like a cookie), and then re-using this on the next invocation, however this is making me feel uneasy as I think there would be some security concerns around doing this.
So, I've got a couple of questions,
Is this a bad idea?
Is this even possible using OpenSSL?
Are there any better ways to speed up the SSL handshaking process?
After the handshake, you can get the SSL session information from your connection with SSL_get_session(). You can then use i2d_SSL_SESSION() to serialise it into a form that can be written to disk.
When you next want to connect to the same server, you can load the session information from disk, then unserialise it with d2i_SSL_SESSION() and use SSL_set_session() to set it (prior to SSL_connect()).
The on-disk SSL session should be readable only by the user that the tool runs as, and stale sessions should be overwritten and removed frequently.
You should be able to use a session cache securely (which OpenSSL supports), see the documentation on SSL_CTX_set_session_cache_mode, SSL_set_session and SSL_session_reused for more information on how this is achieved.
Could you perhaps use a persistent connection, so the setup is a one-time cost?
You could abstract away the connection logic so your client code still thinks its doing a connect/process/disconnect cycle.
Interestingly enough I encountered an issue with OpenSSL handshakes just today. The implementation of RAND_poll, on Windows, uses the Windows heap APIs as a source of random entropy.
Unfortunately, due to a "bug fix" in Windows 7 (and Server 2008) the heap enumeration APIs (which are debugging APIs afterall) now can take over a second per call once the heap is full of allocations. Which means that both SSL connects and accepts can take anywhere from 1 seconds to more than a few minutes.
The Ticket contains some good suggestions on how to patch openssl to achieve far FAR faster handshakes.