Jabber and expensive data (xml its trash) - compression

Im working in a project that has jabber has communication platform.
The thing is that i need clients (a lot of clients) to communicate between each other not only for signalization, but to change data between them.
Imagine that the client A has 3 services available. The client B could request to A to start sending him info from each service (like a stream service) until the client B says to A to stop the services.
These services could only send one character with 100ms interval or 1000characters with 100ms interval or even send some data when its needed.
When the info sended to B, arrives it has to know what service corresponds, what action and the values (example), so im using json over jabber.
My problem is that im wasting a lot of bandwith with jabber xmpp protocol just to send a message with a body like:
{"s":"x", "x":5} //each 100ms (5 represents any number)
I really don't want to have parallel communication (like direct sockets), because jabber has all of that implemented and its easy scalable, firewall problems, sometimes i use http communications (im using BOSH in this case).
I know that there is some compression that i can do, but im wondering if you recommends something else that could not have such ammount of xml behind my message and still, using jabber.
Thanks a lot for your help.
Best Regards,
Eduardo

It sounds like, except for your significant data transfer, XMPP suits your application well.
As you probably know, XMPP was never designed or intended to be used as a big pipe for data transfer. Most applications that involve significant data transfer, such as file transfers and voice/video, use XMPP just for negotiation of a separate "out of band" stream. You say this might cause problems for you because of firewalls and web clients.
If your application is mostly transferring text, then you really should try out compression... it offers significant savings on bandwidth, if that's your most constrained resource. The downside is that it will take more client and server memory (around 300KB by default, but that can be reduced with marginal compression loss).
Alternatively you can look at tunneling your data base64-encoded using In-Band Bytestreams. I don't have your sample data, or know how you are wrapping them for transport, and this could come off worse or better. I would say it would come off better if you stripped out your JSON and made it into a more efficient binary format instead. Base64 data will not compress so well, and is roughly 33% larger than the raw data. The savings would be in being able to strip out JSON and any other extraneous wrappings, while keeping the data within the XMPP stream.
In the end scaling most applications is hard, whichever technologies you use. It requires primarily insight - you shouldn't change anything without testing it first, and you should be testing beforehand to find out what you ought to change. You should be analyzing your system for the primary bottlenecks (is it really the client's bandwidth??). Rarely in my experience has XML itself been the direct bottleneck. However ultimately all these things are unique to your application, it's not easy to give generic advice at scale.

No, Xml is no trash. Its human readable, very extensible and can be compressed extremely well.
XMPP supports stream compression, and this stream compression (mostly zlib) works extremely well according to all my tests. So if its important for you that you optimize the number of bytes you send over the wire or are on low bandwidth then use stream compression when you are on sockets. When you are on Bosh then you have to use either a server which supports HTTP compression or use a proxy in between to enable compression. But keep in mind that BOSH has also lots of overhead with all the HTTP headers.

Related

Web LiveStreaming WebRTC and Sockets (Flask Backend)

I want to build a live streaming app.
My thought process:
Get the Video/Audio data from the
navigator.mediaDevices.getUserMedia(constraints); [client-streamer]
create rooms using sockets(Socket.IO or WebSockets from flask) [backend]
Send the data in 1 to the room members using sockets.
display the media on the client-side.
Is that correct? How should I do it?
how do I broadcast data to specific room members and not to everyone? (flask)
How to consistently send data from the streamer -> server -> room members. the stream is given from 1 is an object, where is the data?
any other better ideas will be great! thanks.
I need to implement the server-side by myself without help from libraries that will do the work for me.
Implementing a streaming platform is not trivial. Unfortunately, it is not as simple as emitting chunks received from the MediaRecorder with onndatavailable and forwarding them to users using a WebSocket server - this is not scalable nor efficient nor reliable.
Below are some strategies you can try for different types of scenarios:
P2P: If you want to have simple peer-to-peer streaming, you can use WebRTC to achieve that with a simple socket.io server for signaling purposes.
Conference: Here things start to get more complicated. You will need a media server if you want to be somewhat scalable. One approach is to route your stream to the users using an SFU or MCU. This will take care of forwarding/processing media to different peers efficiently.
Broadcast: Here things are also non-trivial. Common WebRTC-based architectures include ingesting the WebRTC stream and forward that to an HLS server which will let your stream chunks available for clients through a CDN, or perform RTP forwarding of the WebRTC stream, convert it to RTMP using something like FFmpeg and deliver it through Youtube Live or Twitch to leverage from their infrastructure.
Be aware that the last 2 items are resource-intensive and will certainly not be cheap to maintain.
Below are some open source projects that could help you along the way:
Janus
MediaSoup
AntMedia
Jitsi
Good luck!
Explaining all this is far beyond the scope of a Stack Overflow answer.
Here are a few hints:
You need to use the MediaRecorder API to capture compressed data from your gUM (getUserMedia) stream. MediaRecorder support is inconsistent between makes and models of browser. though.
It kicks a Blob into its onndatavailable handler every so often.
They're compressed as a webm data stream.
You can push those Blobs to a server with socket.io, and the server can turn around and push them to whatever clients you want to.
Playing the webm on the clients is tricky. You may, on some makes and models of browsers, be able to feed the webm stream to the Media Source API using appendBuffer(). But some browsers cannot consume the webm streams.
These webm streams are useless to a player without all their Blob data in order. You can't just start sending a new client the Blobs of the stream when they sign in; you have to restart the MediaRecorder.
(You may be able to make it work without a MediaRecorder restart if you send the first few k bytes of the stream to each new client before sending the current Blob. Extracting those bytes is an intricate programming job involving the ebml package to parse the webm stream and extract the prologue. I have not proven this concept.)
Because getting all this to work -- originator -- server -- viewer is such a pain in the xxx neck, you may want to investigate using something like mediasoup instead. It uses WebRTC transport rather than socket.io, and works cross-platform.

Real-time duplication of data among EC2 instances located in different regions

I'm new to AWS and back-end architecture in general. My current configuration is an EC2 instance (south-east region Singapore) running a Twisted real-time server for a real-time chat app.
Currently, in my implementation, whenever a sender sends a message to the server, it is stored in a python dictionary on the server if the receiver is not online. So basically it is storing this message in the instance's RAM. Now, I want to make the app available worldwide, so I'll be running it on instances of different regions. So my question is, how am I supposed to duplicate/replicate this dictionary stored in RAM of one instance to all the other instance, so it is readily available in all regions? (The reason of storing the messages in RAM and not in a database is the nature of the app. The app involves a large volume of messages sent in bursts, which requires it to be considerably faster than speeds offered by a persistent DB store's I/O read-writes.) My aim is to make the app available globally, and having real-time performance.
(Kindly don't flag this question as an "opinion-based" question and close it. I'm new to server side architecture and I really need someone to at least just point me in the right direction. And I don't think I'll be able to find help on this anywhere other than StackOverflow.)
Here's a few things I would think of if I had to build it myself (I've implemented most of these pointers in our own project and it took me quite a while).
If you really really need all servers to be in sync you'll need a consensus protocol. If you do. Don't built this yourself. It's going to take a lot of time and errors.
If you can, partition your chat data into chatrooms and have only a few servers handle one chatroom.
I've used msgpack to encode my data. It's faster and smaller than json.
You'll benefit a lot of compressing your data before you send it over the wire. Have a look at something like zlib or lz4
Even though the size of compressed msgpack is almost the same of that compressed json. I'd choose msgpack because it's faster. It's easier to parse because it's length prefixed encoded.
I would try to send messages together. Batch up all messages every x ms. In my project I chose 100ms batching up messages will save you a lot of bandwidth since your compression algorithm can remove more duplication.
You'll have to handle connection timeouts. Only regard a message as sent and done when you get a reply back (you'll have to design/choose your protocol to handle that)
Think of what is acceptable, how much data you're willing to loose when something crashes or otherwise fails. If you're not willing to loose data you'll have to implement something that stores data to disk.
I've had the problem that writes to database we use (Google Cloud Datastore) take a long time as well. Like somewhere between 100ms and 900ms depending on how much I store. What I did was only store this data every x seconds and set flags on objects that need to be saved next run. Of course you can only do this if you're willing to loose some data when your program crashes.
You'll need something to keep track of what servers are running and which server is responsible for which piece of data
Set up something that checks whether your connection is alive. For example send echoRequests and echos every x time. The sooner you detect a faillure the better. Note however if your reactor is blocked by some cpu intensive task it will not send your echo in time.
If you're not in control of how much data comes in you'll have to slow down or penalize connections that would otherwise take up all of your server time.
EDIT: I only now see that you're looking into redis. As far as I know it's a good queueing system. Use that if you can. Implementing the stuff above would take a lot of time to get it right.

Creating C++ client app for some abstract windows server - how to manage TCP connection to server speed?

So we have some server with some address port and ip. we are developing that server so we can implement on it what ever we need for help. What are standard/best practices for data transfer speed management between C++ windows client app and server (C++)?
My main point is in how to get how much data can be uploaded/downloaded from/to client via his low speed network to my relatively super fast server. (I need it for set up of his live stream Audio/Video bit rate)
My try on explaining number 3.
We do not care how fast is our server. It is always faster than needed. We care about client tyring to stream out to our server his media. he streams encoded (via ffmpeg) live video data to our server. But he has say ADSL with 500kb/s of outgoing traffic. Also he uses some ICQ or what so ever so he has less than 500 kb/s per second. And he wants to stream live video! So we need to set up our ffmpeg to encode video with respect to the bit rate user can provide. We develop server side and client side. We need a way of finding out how much user can upload per second currently (so value can change dynamically over time)
Check this CodeProject Article
it's dot-net but you can try figure out the technique from there.
I found what I wanted. "thrulay, network capacity tester" A C++ code library for Available bandwidth tracking in real time on clients. And there is "Spruce" and it is also oss. It is made using some of linux code but I use Boost library so it will be easy to rewrite.
Offtop: I want to report that there is some group of people on SO down voting on all questions on this topic - I do not know why they are so angry but they deffenetly exist.

What would you use to implement a fast and lightweight file server?

I need to have as part of a desktop application a file server which should respond as fast as possible to file transfer requests (from remote clients, usually located on the same LAN). There will be many file requests for small sized files. The server should be able to provide both upload and download services.
I am not tight to any particual technology so I am open to any programming language, toolkits, libraries as long as they can run on Windows.
My initial take is to go with a C/C++ implementation using Windows Sockets or use the services provided by libraries such as Boost (asio or such). I have also thought of Erlang but that I'll have to learn and so the performance benefits should justify the increased development time due to having to learn the language.
LATER EDIT: I appreciate the answers that say use FTP or HTTP or basically anything that has been already created but considering you still want to write one from scratch, what would you do?
Why not just go with FTP? You should be able to find an adequate server implementation in any language, and client access libraries too.
It sounds like a lot of wheel-reinvention. Granted, FTP is not ideal, and has a few odd spots, but ... it's there, it's standard, well-known, and already very widely implemented.
For frequent uploads of small files, the fastest way would be to implement your own proprietary protocol, but that would require a considerable amount of work - and also it would be non-standard, meaning future integration would be difficult unless you are able to implement your protocol in any client you'll support. If you choose to do it anyway, this is my suggestion for a simple protocol:
Command: 1 byte to identify what'll be done: (0x01 for upload request, 0x02 for download request, 0x11 for upload response, 0x12 for download response, etc).
File name: can be fixed-size or prefixed with a byte for the length (assuming the name is less than 255 bytes)
Checksum, MD5 for instance (if upload request or download response)
File size (if upload request or download response)
payload (if upload request or download response)
This could be implemented on top of a simple TCP socket. You can also use UDP, avoiding the cost of establishing a connection but in this case you have to deal with retransmission control.
Before deciding to implement your own protocol, take a look at HTTP libraries like libcurl, you could make your server use standard HTTP commands like GET for download and POST for upload. This would save a lot of work and you'll be able to test the download with any web browser.
Another suggestion to improve performance is to use as the file repository not the filesystem, but something like SQLite. You can create a single table containing one char column for the file name and one blob column for the file contents. Since SQLite is lightweight and does an efficient caching, you'll most of the time avoid the disk access overhead.
I'm assuming you don't need client authentication.
Finally: although C++ is your preference to give you raw native code speed, rarely this is the major bottleneck in this kind of application. Most probably will be disk access and network bandwidth. I'm mentioning this because in Java you'll probably be able to make a servlet to do exactly the same thing (using HTTP GET for download and POST for upload) with less than 100 lines of code. Use Derby instead of SQLite in this case, put that servlet in any container (Tomcat, Glassfish, etc) and it's done.
If all the machines are running on Windows on the same LAN, why do you need a server at all? Why not simply use Windows file sharing?
I would suggest not to use FTP, or SFTP, or any other connection oriented technique. Instead, go for a connectionless protocol or technique.
The reason is that, if you require lots of small files to be uploaded or downloaded, and the response should be as fast as possible, you want to avoid the cost of setting up and destroying connections.
I would suggest that you look at either using an existing implementation or implementing your own HTTP or HTTPS server/service.
Your bottlenecks are likely to come from one of the following sources:
Harddisk I/O - The WD velociraptor is supposed to have a random access speed of about 100MB/s. Also, it is important whether you set it up as RAID0,1,5 or what nots. Some read fast but write slow. Trade-offs.
Network I/O - Assuming that you have the fastest harddisks in a fast RAID setup, unless you use Gbit I/O, your network will be slow. If your pipes are big, you still need to supply it with data.
Memory cache - The in-memory file-system cache will need to be big enough to buffer all the network I/O so that it does not slow you down. That will require large amounts of memory for the kind of work you're looking at.
File-system structure - Assuming that you have gigabytes worth of memory, then the bottleneck will most likely be the data-structure that you use for the file-system. If the file-system structure is cumbersome it will slow you down.
Assuming that all the other problems are solved, then do you worry about your application itself. Notice, that most of the bottlenecks are outside your software control. Therefore, whether you code it in C/C++ or use specific libraries, you will still be at the mercy of the OS and hardware.
Sounds like you should use an SFTP (SSH) server, it's firewall/NAT safe, secure, and already does what you want and more. You could also use SAMBA or windows file sharing for an even more simple implementation.
Why not use something existing, for example a normal Web server handles a lot of small files (images) very well and fast.
And lots of people already spent time in optimizing the code.
And the second benefit is that the transfer is done with HTTP which is an established protocol. And is easily switched to SSL if you need more security.
For the uploads, they are also no problem with a script or custom module - with the same method you can also add authorization.
As long as you don't need to dynamically seek the files i guess this would be one of the best solutions.
It's a new part to an existing desktop application? What's the goal of the server? Is it protecting the files that are uploaded/downloaded and providing authentication and/or authorisation? Does it provide some kind of structure for the uploads to be stored in?
One option may be to install Apache HTTP Server on the machine and serve the file via that. Use POST to upload and GET to download.
If the clients are within a LAN could you not just share a drive?

Best messaging medium for real-time SOA applications?

I'm working on a real time application implemented using in a SOA-style (read loosely coupled components connected via some messaging protocol - JMS, MQ or HTTP).
The architect who designed this system opted to use JMS to connect the components. This system is real time so there no need to queue up messages should one component fail (the transaction will simply time out). Further, there is no need for guaranteed delivery or rollback.
In this instance, is there any benefit to using JMS over something like an HTTP web service (speed, resource footprint, etc)?
One thing that I'm thinking is since the JMS approach requires us to set a thread pool size (the number of components listening to a JMS topic/queue), wouldn't a HTTP service be a better fit since this additional configuration is not needed (a new thread is created for each HTTP request making the application scalable to an "unlimited" number of requests until the server runs out of resources).
Am I missing something?
I don't disagree with the points made by S.Lott at all, but here are a couple of points to consider regarding HTTP web services:
Your clients only need to know how to communicate via HTTP - a protocol well supported by just about every modern langauge in one form or another. JMS, though popular, is more specialist than HTTP, and so restricts the languages your interconnected systems can use. Perhaps not an issue for your system at the moment, but will you need to plug in other systems later that might struggle to support JMS connectivity?
Standards like WSDL and SOAP which you could levarage for your services are well supported by many langauges and there are plenty of tools around that will generate code to implement both ends of the pipeline (client and server) for you from a WSDL file, reducing the amount of dev you'll have to do. These standards also make it relatively simple to define and publish the specification of the data you'll be passing between your systems, something you'll presumably have to do by hand using a queueing technology like JMS.
On the downside, as pointed out by S.Lott, JMS gives you functionality that you throw away using the (stateless) HTTP protocol: guaranteed ordering & reliability; monitoring; scalability; etc. Are you sure you don't need these, and won't need these going forward?
Great question, btw.
I think it's really dependent on the situation. Where I work, we support Remoting, JMS, MQ, HTTP, and sFTP. We are implementing a middleware appliance that speaks Remoting, JMS, MQ, and HTTP, and a software middleware component that speaks JMS, MQ, and HTTP.
As sgreeve alluded to above, standards help us become flexible, but proprietary formats allow more functionality.
In a nutshell, I'd say use HTTP for stateless calls (which could end up meeting almost all of your needs), and whatever proprietary formats you need for stateful calls. If you work in a big enterprise, a hardware appliance is usually a great fit as middleware: Lightning fast compression, encryption, transformation, and translation, with very low total cost of ownership.
I don't know enough about your requirements, but you may be overlooking Manageability, Flexibility and Performance.
JMS allows you to monitor and manage the queue. These are features HTTP lacks, and you'd have to build rather than buy from a vendor.
Also, There are queues and topics in JMS, allowing multiple subscribers to a single publisher. Not possible in HTTP.
While you may not need those things in release 1.0, you might want them in the future.
Also, JMS may be able to use other transport mechanisms like named sockets, which reduces the overheads if there isn't all that socket negotiation going on with (almost) every request.
If you go down the HTTP route and you want to support more than one machine or some kind of reliability - you are going to need a load balancer capable of discovering the available web servers and loading requests across them - then failing over to another web server if a particular box/process dies. Clients making HTTP requests are also going to have to deal with servers failing and retrying operations in some loop.
This is one of the main features of a message queue - reliable load balancing with failover and loose coupling among the producers and consumers without them having to include retry logic - so your client or server code doesn't have to worry about this kinda thing. This is totally separate to whether or not you want message persistence or want to use ACID transactions to produce/consume messages (which can be very handy BTW).
If you focus just on the server side using Java - whether Servlets or MessageListener/MDBs they are kinda similar either way really. The difference is the load balancer.
So maybe the question should really be - is a JMS broker easier to setup & work with than setting up your DNS/NAT/IP/HTTP load balancer infrastructure?
I suppose it depends on what you mean by real-time... Neither JMS nor HTTP in my opinion support "real-time" applications well, meaning they cannot offer predictable/deterministic performance nor properly prioritize flows in the presence of contention.
Part of it is that these technologies are built on top of TCP which serializes all traffic into a single FIFO meaning that different traffic flows cannot be easily prioritized. Moreover TCP timers are not easily controlled resulting unpredictable blocking and timeouts... For this reason many streaming applications use UDP instead of TCP as an underlying protocol.
Another problem with JMS is that typical implementations use a broker that centralizes message dispatch. This is not the best architecture to get deterministic performance.
If you are looking for a middleware that can offer you the kind of reliability guarantees and publish-subscribe semantics you get with JMS, but was developed to fit the real-time application domain I recommend you take a look at the OMG Data-Distribution Service (DDS). See dds.omg.org and this article I wrote arguing why DDS is the best middleware to implement a real-time SOA. http://soa.sys-con.com/node/467488