Uploading large files to server

Uploading large files to server - web-services

The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?

You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.

Related

Security techniques to identify the client, when there is no login function

I'm working on my own "Auto update service", to support automatic updates for every desktop application I create.
Below is my basic idea.
Client
A program that can be run as an independent process that included in every products I make.
When my product runs, it runs first and queries the server for a newer version of my product.
If there is a new version, it downloads the binary file from the server and replaces it to my product.
All of the above processes should not require any user input other than choosing whether or not to proceed with the update.
Server
Product-id are stored in database.
For each product, binary files and release information of each version are stored.
Support querying product and version with REST-style http request and send binary file.
On the server, I need to do something to check that the requesting client is a valid one. So I came up with a way to issue a secret key to each distribution of my product, just like a game CD key, and check it in header of http request. This is the best I've come up with, but I'm still concerned:
On server-side, is it safe to store secret keys in database? If not, how the server stores and remember them?
On client-side, is it safe to store secret keys in client? What if an attacker tries to decompile the client program?
Any other better ideas?
I am new to develop web services so I don't have much knowledge. please understand.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!

What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

C++ runtime API

I want to create an application that, when executed, has runtime functions that are accessible by other applications.
For example, a C++ application that stores values in files and retrieves this information. While this application is running, any other C++ applications could access it's save and retrieve functionality to save and retrieve data, but it should have no other connection to this system.

Sounds like a simple job for web services, or a remote database, or even an LDAP server.
Store and retrieve are operations common to all of these.
If the goal is to learn some specific technology, then ask a more specific question. Otherwise, don't reinvent any wheels. There are plenty of things out there for store and retrieve.
One of the simplest "store and retrieve" APIs I know of is Berkeley DB or Sleepycat.
We built a giant, clustered, simple key based database for a major telecom company using LDAP on top of Berkeley DB (aka Sleepycat). All open-source software and commodity hardware and it supports mission critical operations for millions of customers.
A more modern rendition of this might use memcached as well.
If you go HTTP based, you can use something simple as libcurl against an Apache web server to implement "RESTful" services with GET and PUT commands.
If you run it locally (same server), and access via localhost (127.0.0.1) then there is very little latency in the TCP stack, and it amounts to little more than memcpys at the kernel level.

simple message passing would do, say, JSON over ØMQ, or i.e. all in all, msgpack-rpc or protobuf-remote or Cap'n Proto RPC

Efficient way to transfer data from one django application to another

Currently, I'm working on a project where I have a server - client relationship between two django applications running on separate hosts.
The server has to store and provide a large amount of relational data, eg: Suppliers, Companys, Products, etc etc..
The client downloads data on request from the server and adds it to their database. clients can also upload from their station to the database to expand it.
The previous person that developed this used XMLRPC to transfer the vast (13MB typical) XML file from server to client. now really all we're sending are database agnostic objects to be stored in a database so i wondered if there was a more efficient way of doing it?
Please ask for more details if you need them, I wasn't really sure what you'd need to know
EDIT: Efficient in terms of Networking, and Server Side Processing. Clients can do the heavy lifting.

A shared database design seems more suitable. But of course there may be security, political or organisational reasons ruling that out. Plus there would be significant re-design required.
To reduce network bandwidth first check that HTTP gzip compression is enabled.
If it's just a dumb data transfer JSON would generally be a lot more compact than XMLRPC. Does the data look amenable to a straight translation to JSON? This would still require some server-side processing.
For minimal server-side processing (if the database tables are relatively similar) it may be very efficient to just send the client a dump of the relevant db query. Of course unless the tables have the same schema you would have to do some client-side processing of raw SQL, which is not ideal.

sftp versus SOAP call for file transfer

I have to transfer some files to a third party. We can invent the file format, but want to keep it simple, like CSV. These won't be big files - a few 10s of MB at most and there won't be many - 3 files per night.
Our preference for the protocol is sftp. We've done this lots in the past and we understand it well.
Their preference is to do it via a web service/SOAP/https call.
The reasons they give is reliability, mainly around knowing that they've fully received the file.
I don't buy this as a killer argument. You can easily build something into your file transfer process using sftp to make sure the transfer has completed, e.g. use headers/footers in the files, or move file between directories, etc.
The only other argument I can think of is that over http(s), ports 80/443 will be open, so there might be less firewall work for our infrastructure guys.
Can you think of any other arguments either way on this? Is there a consensus on what would be best practice here?
Thanks in advance.

File completeness is a common issue in "managed file transfer". If you went for a compromise "best practice", you'd end up running either AS/2 (a web service-ish way to transfer files that incorporates non-repudiation via signed integrity checks) or AS/3 (same thing over FTP or FTPS).
One of the problems with file integrity and SFTP is that you can't arbitrarily extend the protocol like you can FTP and FTPS. In other words, you can't add an XSHA1 command to your SFTP transfer just because you want to.
Yes, there are other workarounds (like transactional files that contain hashes of files received), but at the end of the day someone's going to have to do some work...but it really shouldn't be this hard.
If the third party you're talking to really doesn't have a non-web service call to accept large files, you might be their guinea pig as they try to navigate a brand new world. (Or, they may have jsut fired all their transmissions folks and are not just realizing that the world doesn't operate on SOAP...yet - seen that happen too.)
Either way, unless they GIVE you the magic code/utility/whatever to do the file-to-SOAP transaction for them (and that happens too), I'd stick to your sftp guns until they find the right guy on their end to talk bulk data transmissions.

SFTP is the protocol for secure file transfer, soap is an API protocol - which can be used for sending file attachments (i.e. MIME attachments), or as Base64 encoded data.
SFTP adds additional potential complexity around separate processes for encrypting/decrypting files (at-rest, if they contain sensitive data), file archiving, data latency, coordinating job scheduling, and setting-up FTP service accounts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js