I've seen the example from Azure github and you can see that when uploading a data it uses stream and also provides md5 hash of that stream.
My main purpose is to upload file into azure and provide md5, if local md5 and one calculated by Azure doesn't match it will return an error.
I also know that I can use uploadFrom function which takes a filename and opens it and takes care of the "chunking" etc. The main problem is that it doesn't allow me to specify md5 hash as it does in upload case. uploadFrom accepts different type of options structure which doesn't have TransactionalContentHash member.
Is there any functionality in Azure which would allow me to send a file and provide md5? I know I can open it and read chunks and calculate md5 and send them one-by-one but I want to avoid that headache if it's possible.
It depends what you want to do. If you want to do an online checksum calculator, it's not worth it.
If you want that, the simplest way is probably adding 2 uploads: one of the file (data) and one of a checksum file (the "control" file), with input validation, of course, then proceed from there, where you calculate the md5 and check the control file.
If you want to create a repository for another business process, then it's a different story. In fairness, what you described is not how designing a validation flow works at all. You don't ask a client for data, their version of a checksum, then you check the data and validate if the two checksums are identical. The reason behind this is simple: a server should never ever trust a client.
Imagine this scenario: There is data. It can be a file or transient data, it's irrelevant. I edited it (again, the purpose is irrelevant, I can do it to sabotage, for fun or because I am infecting the data with malware), I know the new checksum is incorrect and I know you will catch that. So what I do is I send you the altered file and the new checksum, not the one that the file had before I edited it. Now you check the data and see the checksums are identical. You tell me "thank you sir, you are honest and did not modify anything". But I did, and I lied, didn't I ? :)
The real flow of a validation approach is this:
The client agreed what data should be sent upon beforehand, therefore you have a list of the valid checksums
The client sends data that to the server
The server calculates the checksum and sees if it's identical
If it isn't, it rejects the data, if it is, it moves to the next step
Now I know there are sites where they give you files and their checksum in a separate file, but that is the reversed flow: server to client. In theory, a client could trust a server.
There is an approach, though, that works in a way that is similar to the one you describe: the client sends data to the server (or a centralized services that would distribute the data to other clients) and also sends a sort of a checksum and the server uses that to validate that nobody changed it along the way (such as MITM attacks).
In that case you should consider a secure transmission (look up HTTPS and TLS) and digital signatures (also look up certificates public+private keys). While this paradigm is usually used for web services, not file uploads, it can be done if the file is packaged after being encrypted by an app client-side (for example in a p7s file). For that you need a key exchange between your server and the client.
Small caveat: MD5 is not secure. Don't rely on it for sensitive data.
Related
I have to dump data from various sources to Informatica. Sources are some manual files which would be dumped via a SFTP server, some via APIs, some with direct DB connection. In that case, how do we connect the files from the server? via some kind of connection to the SFTP server, API endpoint connection, putting DB connection via DB endpoint? In these cases, how do we authenticate? i dont want to use the username/password, is there a way to use Active Directory connect?
How does informatica authenticate if the source of the files are genuine?
If you mean the source itself, then you need to decide if the source is genuine before you create a connection to it
If you mean how to secure the connection, then that is a property of the source and defined by the owner of the source. Informatica can use almost any industry-standard secure protocols and authentication methods
Any way to scan for malicious files?
Informatica can implement any business rules you want to define to determine if the data in a file is malicious
If you are asking is there a "magic button" you can press that will tell you if a file is malicious, then the answer is no
Answer to Question about PocketETL
Once you've identified all the functionality required to implement your overall architecture, you have 2 basic options for how you satisfy these requirements:
Identify a single tool that covers as much of the functionality as possible and then fill in the gaps with other tools
simplest to implement
should "just work"
unlikely to be "best of breed" in all areas
unlikely to the cheapest solution
Implement point solutions for each area of functionality
likely to be a better solution, for you, in each area
may be cheaper
but you have to get all the components working together, which is unlikely to be trivial
you need to know how to implement and configure multiple products, not just one
So you could use Informatica to do everything or you could use PocketETL to do the first piece of data movement and then other tools to implement the rest of data pipeline
The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?
You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.
I would like to ensure that our webservice works but I don't know how to do it because webservices data are controlled by a backoffice and data changes everyday multiple times.
The data loaded by the webservice doesn't come from a database but from json files dynamically loaded and distributed. I've considered replacing those files for testing the behavior, but bad data are a common frequent cause of disfunction, so I would rather tests those simultaneously or at least have some way to ensure that data are valid for the currently deployed sources.
I would also welcome suggestions of books too.
This is a big problem and it is difficult to find a single solution. Instead you should split task into smaller sub tasks:
Does web service work at all? Connect to it and make normal operations. If you are using real data, you cannot verify that it is correct. Just check you get a valid looking reply. You should also have a known set of data in a different server, maybe call it staging. Here you can verify that a new version web service gives out correct output.
How to check that files you get from backoffice are valid? It is not efficient to make you test them just before deployment. You mentioned several reasons why this is not possible so you have to live with it. Because your files are json, it should be possible to write a test suite that checks their validity.
How to check that real json files give out correct output in web service. This is your original question. You have a set of json files. How easy it is to calculate what web service responds based on these files? In some cases you would need to write your own web service engine. This is why testers usually do first two steps first.
I have 2 django servers, with their own database, I want to exchange some specific objects between them over the http protocol.
Actually, I planed to create some views to generate XML output on one side to be imported on the other side. Is there a nicer way ?
Is there a reason this needs to happen through http?
If you just want to read data from one server to be used on the other, you could create a simple API that returns a representation of the object you queried for (in xml/json or whatever other format you wanted).
If there is going to be a decent amount of processing going on, or slow communication, and you don't need it to happen real time (in the request/response cycle), you could look at a message queue. Something like RabbitMQ for instance.
If you want both servers to have direct access to both databases, you could try to take advantage of Django's multiple database support.
If it's more of a one-off copy of data, just write a small (non-Django) script to do it.
I have to transfer some files to a third party. We can invent the file format, but want to keep it simple, like CSV. These won't be big files - a few 10s of MB at most and there won't be many - 3 files per night.
Our preference for the protocol is sftp. We've done this lots in the past and we understand it well.
Their preference is to do it via a web service/SOAP/https call.
The reasons they give is reliability, mainly around knowing that they've fully received the file.
I don't buy this as a killer argument. You can easily build something into your file transfer process using sftp to make sure the transfer has completed, e.g. use headers/footers in the files, or move file between directories, etc.
The only other argument I can think of is that over http(s), ports 80/443 will be open, so there might be less firewall work for our infrastructure guys.
Can you think of any other arguments either way on this? Is there a consensus on what would be best practice here?
Thanks in advance.
File completeness is a common issue in "managed file transfer". If you went for a compromise "best practice", you'd end up running either AS/2 (a web service-ish way to transfer files that incorporates non-repudiation via signed integrity checks) or AS/3 (same thing over FTP or FTPS).
One of the problems with file integrity and SFTP is that you can't arbitrarily extend the protocol like you can FTP and FTPS. In other words, you can't add an XSHA1 command to your SFTP transfer just because you want to.
Yes, there are other workarounds (like transactional files that contain hashes of files received), but at the end of the day someone's going to have to do some work...but it really shouldn't be this hard.
If the third party you're talking to really doesn't have a non-web service call to accept large files, you might be their guinea pig as they try to navigate a brand new world. (Or, they may have jsut fired all their transmissions folks and are not just realizing that the world doesn't operate on SOAP...yet - seen that happen too.)
Either way, unless they GIVE you the magic code/utility/whatever to do the file-to-SOAP transaction for them (and that happens too), I'd stick to your sftp guns until they find the right guy on their end to talk bulk data transmissions.
SFTP is the protocol for secure file transfer, soap is an API protocol - which can be used for sending file attachments (i.e. MIME attachments), or as Base64 encoded data.
SFTP adds additional potential complexity around separate processes for encrypting/decrypting files (at-rest, if they contain sensitive data), file archiving, data latency, coordinating job scheduling, and setting-up FTP service accounts.