sftp versus SOAP call for file transfer - web-services

I have to transfer some files to a third party. We can invent the file format, but want to keep it simple, like CSV. These won't be big files - a few 10s of MB at most and there won't be many - 3 files per night.
Our preference for the protocol is sftp. We've done this lots in the past and we understand it well.
Their preference is to do it via a web service/SOAP/https call.
The reasons they give is reliability, mainly around knowing that they've fully received the file.
I don't buy this as a killer argument. You can easily build something into your file transfer process using sftp to make sure the transfer has completed, e.g. use headers/footers in the files, or move file between directories, etc.
The only other argument I can think of is that over http(s), ports 80/443 will be open, so there might be less firewall work for our infrastructure guys.
Can you think of any other arguments either way on this? Is there a consensus on what would be best practice here?
Thanks in advance.

File completeness is a common issue in "managed file transfer". If you went for a compromise "best practice", you'd end up running either AS/2 (a web service-ish way to transfer files that incorporates non-repudiation via signed integrity checks) or AS/3 (same thing over FTP or FTPS).
One of the problems with file integrity and SFTP is that you can't arbitrarily extend the protocol like you can FTP and FTPS. In other words, you can't add an XSHA1 command to your SFTP transfer just because you want to.
Yes, there are other workarounds (like transactional files that contain hashes of files received), but at the end of the day someone's going to have to do some work...but it really shouldn't be this hard.
If the third party you're talking to really doesn't have a non-web service call to accept large files, you might be their guinea pig as they try to navigate a brand new world. (Or, they may have jsut fired all their transmissions folks and are not just realizing that the world doesn't operate on SOAP...yet - seen that happen too.)
Either way, unless they GIVE you the magic code/utility/whatever to do the file-to-SOAP transaction for them (and that happens too), I'd stick to your sftp guns until they find the right guy on their end to talk bulk data transmissions.

SFTP is the protocol for secure file transfer, soap is an API protocol - which can be used for sending file attachments (i.e. MIME attachments), or as Base64 encoded data.
SFTP adds additional potential complexity around separate processes for encrypting/decrypting files (at-rest, if they contain sensitive data), file archiving, data latency, coordinating job scheduling, and setting-up FTP service accounts.

Related

C++: MD5 Value of uploaded file on Azure

I've seen the example from Azure github and you can see that when uploading a data it uses stream and also provides md5 hash of that stream.
My main purpose is to upload file into azure and provide md5, if local md5 and one calculated by Azure doesn't match it will return an error.
I also know that I can use uploadFrom function which takes a filename and opens it and takes care of the "chunking" etc. The main problem is that it doesn't allow me to specify md5 hash as it does in upload case. uploadFrom accepts different type of options structure which doesn't have TransactionalContentHash member.
Is there any functionality in Azure which would allow me to send a file and provide md5? I know I can open it and read chunks and calculate md5 and send them one-by-one but I want to avoid that headache if it's possible.
It depends what you want to do. If you want to do an online checksum calculator, it's not worth it.
If you want that, the simplest way is probably adding 2 uploads: one of the file (data) and one of a checksum file (the "control" file), with input validation, of course, then proceed from there, where you calculate the md5 and check the control file.
If you want to create a repository for another business process, then it's a different story. In fairness, what you described is not how designing a validation flow works at all. You don't ask a client for data, their version of a checksum, then you check the data and validate if the two checksums are identical. The reason behind this is simple: a server should never ever trust a client.
Imagine this scenario: There is data. It can be a file or transient data, it's irrelevant. I edited it (again, the purpose is irrelevant, I can do it to sabotage, for fun or because I am infecting the data with malware), I know the new checksum is incorrect and I know you will catch that. So what I do is I send you the altered file and the new checksum, not the one that the file had before I edited it. Now you check the data and see the checksums are identical. You tell me "thank you sir, you are honest and did not modify anything". But I did, and I lied, didn't I ? :)
The real flow of a validation approach is this:
The client agreed what data should be sent upon beforehand, therefore you have a list of the valid checksums
The client sends data that to the server
The server calculates the checksum and sees if it's identical
If it isn't, it rejects the data, if it is, it moves to the next step
Now I know there are sites where they give you files and their checksum in a separate file, but that is the reversed flow: server to client. In theory, a client could trust a server.
There is an approach, though, that works in a way that is similar to the one you describe: the client sends data to the server (or a centralized services that would distribute the data to other clients) and also sends a sort of a checksum and the server uses that to validate that nobody changed it along the way (such as MITM attacks).
In that case you should consider a secure transmission (look up HTTPS and TLS) and digital signatures (also look up certificates public+private keys). While this paradigm is usually used for web services, not file uploads, it can be done if the file is packaged after being encrypted by an app client-side (for example in a p7s file). For that you need a key exchange between your server and the client.
Small caveat: MD5 is not secure. Don't rely on it for sensitive data.

How do we dump data into informatica?

I have to dump data from various sources to Informatica. Sources are some manual files which would be dumped via a SFTP server, some via APIs, some with direct DB connection. In that case, how do we connect the files from the server? via some kind of connection to the SFTP server, API endpoint connection, putting DB connection via DB endpoint? In these cases, how do we authenticate? i dont want to use the username/password, is there a way to use Active Directory connect?
How does informatica authenticate if the source of the files are genuine?
If you mean the source itself, then you need to decide if the source is genuine before you create a connection to it
If you mean how to secure the connection, then that is a property of the source and defined by the owner of the source. Informatica can use almost any industry-standard secure protocols and authentication methods
Any way to scan for malicious files?
Informatica can implement any business rules you want to define to determine if the data in a file is malicious
If you are asking is there a "magic button" you can press that will tell you if a file is malicious, then the answer is no
Answer to Question about PocketETL
Once you've identified all the functionality required to implement your overall architecture, you have 2 basic options for how you satisfy these requirements:
Identify a single tool that covers as much of the functionality as possible and then fill in the gaps with other tools
simplest to implement
should "just work"
unlikely to be "best of breed" in all areas
unlikely to the cheapest solution
Implement point solutions for each area of functionality
likely to be a better solution, for you, in each area
may be cheaper
but you have to get all the components working together, which is unlikely to be trivial
you need to know how to implement and configure multiple products, not just one
So you could use Informatica to do everything or you could use PocketETL to do the first piece of data movement and then other tools to implement the rest of data pipeline

Uploading large files to server

The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?
You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.

Secure configuration file in clients

In a project we will create configuration file for each clients(Also can be sqlite in each clients instead of a configuration file). That files will include critical information like policies. Therefore end-user musn't add, delete, change that configuration file or something in the file.
I am considering to use active directory to prevent users to open folder that include my configuration file.
Is there a standart way to use secure configuration files?
EDIT:
Of course speed of reading the file is important as security
EDIT2:
I can't do that with a DB server because my policies must be accesible whithout internet connection too. A server will update that file or sqlite tables in some periods. And I am using c++.
I'm sorry to crush your hopes and dreams, but if your security is based on that configuration file on the client, you're screwed.
The configuration file is loaded and decrypted by your application and this means that values can be changed using special tools when the application runs.
If security is important, do those checks on the server, not the client.
Security is a fairly broad matter. What happens if your system is compromised? Does someone lose money? Does someone get extra points in a game? Does someone gain access to nuclear missile-launching codes? Does someones medical data get exposed to the public?
All of these are more or less important security concerns, but as you can imagine, nuclear missile-launching has higher requirements to be completely secure than some game where someone may boost their score, and money and health obviously end up somewhere in the middle of that range, with lots of other things that we could add to the list.
It also matters what type of "users" you are trying to protect against. Is it national level security experts (e.g. FBI, CIA, KGB, etc), hobby hackers, or just normal computer users? Encrypting the file will stop a regular user, and perhaps a hobby hacker, but national security experts certainly won't be foiled by that.
Ultimately, tho', if the machine holding the data also knows how to read the data, then you can not have a completely secure system. The system can be bypassed by reading the key in the code, and re-implementing any de-/encryption etc that is part of your "security". And once the data is in plain text, it can be modified and then re-encrypted and stored back.
You can of course make it more convoluted, which will mean that someone will have to have a stronger motive to work their way through your convoluted methods, but in the end, it comes down to "If the machine knows how to decrypt something, someone with access to the machine can decrypt the content".
It is up to you (and obviously your "customers" and/or "partners" whose data you are looking after), whether you think that is something you can risk or not.

File web service architecture

I need to implement a web service which could provide requested files to other internal applications or components running on different networks. Files are dispersed across different servers in different locations and can be big as few gigabytes.
I am thinking to create a RESTful web service which will have implementation to discover the file, redirect the HTTP request to another web service on different location and send the file via HTTP.
Is it a good idea to send the file via HTTP or will it be better for the web service to copy the file to the location where requester component could access it?
The biggest problem with distributing large files over HTTP is that you will come across all sorts of limits that prevent it. As a simple example, WCF allows you to configure maximum payload size but you can only configure it up to 2 GB. You will likely run across issues like this in all layers of your stack. I doubt any of them are insurmountable (to work around the above limitation you can stream chunks of the file, rather than the entire file, although that introduces it's own problems), but you will likely have lots of timeouts and random failures, which are fixed by tweaking the configuration of this or that service or client.
Also, when dealing with large files, you have to carefully consider how you deal with the inevitable failures during transfer (e.g. the network drops out). Depending on the specific technologies you use, they may have some "resume" functionality, but you will want to be sure this is reliable before committing to it.
One possibility would be to do what Facebook does when distributing large binaries - use BitTorrent. So, your web-service serves a torrent of the file, not the file itself. The big advantages of BitTorrent are it is very robust, and can scale well. It's worth considering, but it will depend a lot on your environment and specific workload.
If the files you are going to serve, do not change often or do not change at all, you could use many strategies, since the one advised by RB, or use pure HTTP which supports partial data operations, see RFC 2616.
But depending on your usage scenario, I would also suggest you to take a look at the Amazon Web Services - S3 (Simple Storage Service), which probably does already what you are trying to do, it's cheap and have high availability.