watermark files on the fly when served - amazon-web-services

I am looking for a very general answer to the feasibility of the idea, not a specific implementation.
If you want to serve small variations of the same media file to different people (say, an ePub or music file), is it possible to serve most of the file to everybody but individualized small portions of the file to each recipient for watermarking using something like Amazon WS.
If yes, would it be possible to create a dropbox-like file hosting service with these individualized media files where all users “see” most of the same physical stored file but with tiny parts of the file served individually? If, say, 1000 users had the same 10 MB mp3 file with different watermarks on a server that would amount to 10 GB. But if the same 1000 users were served the same file except for a tiny 10 kB individual watermarked portion it would only amount to 20 MB in total.

An EPUB is a single file and must be served/downloaded as such, not in pieces. Why don't you implement simple server-side logic to customize the necessary components, build the EPUB from the common assets and the customized ones, and then let users download that?

The answer is, of course, yes, it can be done, using an EC2 instance -- or any other machine that can run a web server, for that matter. The problem is that any given type of media file has different levels of complexity when it comes to customizing the file... from the simplest, where the file contains a string of bytes at a known position that can simply be overwritten with your watermark data, to a more complex format that would have to be fully or partially disassembled and repackaged every time a download is requested.
The bottom line is that for any format I can think of, the server would spent some amount of CPU resources -- possibly a significant amount -- crunching the data and preparing/reassembling the file for download. The ultimate solution would be very format-specific, and, as a side note, has really nothing to do with anything AWS other than the fact that you can host web servers in EC2.

Related

S3 vs EFS propagation delay for distributed file system?

I'm working on a project that utilizes multiple docker containers
which all need to have access to the same files for comparison purposes. What's important is that if a file appears visible to one container, then there is minimal time between when it appears visible to other containers.
As an example heres the situation I'm trying to avoid:
Let's say we have two files, A and B, and two containers, 1 and 2. File A is both uploaded to the filesystem and submitted for comparison at roughly the same time. Immediately after, the same happens to file B. Soon after File A appears visible to container 1 and file B appears visible to container 2. Due to the way the files propagated on the distributed file system, file B is not visible to container 1 and file A is not visible to container 2. Container 1 is now told to compare file A to all other files and container 2 is told to compare B to all other files. Because of the propagation delay, A and B were never compared to each other.
I'm trying to decide between EFS and S3 to use as the place to store all of these files. Im wondering which would better fit my needs (or if theres a third option I'm unaware of).
The characteristics of the files/containers are:
- All files are small text files averaging 2kb in size (although rarely they can be 10 kb)
- Theres currently 20mb of files total, but I expect there to be 1gb by the end of the year
- These containers are not in a swarm
- The output of each comparison are already being uploaded to S3
- Trying to make sure that every file is compared to every other file is extremely important, so the propagation delay is definitely the most important factor
(One last note: If I use end up using S3, I would probably be using sync to pull down all new files put into the bucket)
Edit: To answer Kannaiyan's questions, what I'm trying to achieve is having every file file compared to every other file at least once. I can't exactly say what I'm comparing, but the comparison happens by executing a closed source linux binary that takes in the file you want to compare and the files you want to compare it against (the distributed file system is holding all the files I want to compare against). They need to be in containers for two reasons:
The binary relies heavily upon a specific file system setup, and containerizing it ensures that the file system will always be correct (I know its dumb but again the binary is closed source and theres no way around it)
The binary only runs on linux, and containerizing it makes development easier in terms of testing on local machines.
Lastly the files only accumulate over time as we get more and more submissions. Every files only read from and never modified after being added to the system.
In the end, I decided that the approach I was going for originally was too complicated. Instead I ended up using S3 to store all the files, as well as using DynamoDB to act as a cache for the keys of the most recently stored files. Keys are added to the DynamoDB table only after a successful upload to S3. Whenever a comparison operation runs, the containers sync the desired S3 directory, then check the DynamoDB to see if any files are missing. Due to S3's read-after-write consistency, if any files are missing they can be pulled from S3 without needing to wait for propagation to all the S3 caches. This allows for a pretty much an instantly propagating distributed file system.

How to optimize loading of a large XAP file

We have inherited a silverlight application used in banks. This is a huge silverlight application with a single xap file of 6.5MB size.
Recently one the core banking applications has updated their services to delete the entire browser cache from the users machine on daily basis.
This impacts the silverlight application directly. We cannot afford to download the 6 MB file every day. On a long term basis I know we need to break this monolith in to smaller manageable pieces and load them dynamically.
I wanted to check if there are any short term alternatives.
Can we have the silverlight runtime load the xap in to different director ?
Will making the application Out of Browser application give us any additional flexibility in terms of where we are loading the xap from ?
Any other suggestions which can help us to give a short term solution will be helpful.
What is inside your xap file ? (change extension to .zip and check what is inside)
Are you including images, sound inside your xap file ?
Are all dlls used necessary ?
Short-term alternatives are :
do some cleanup of your application (remove unused dlls, images, code,...)
rezip your xap file using a more powerful compression tool to save some place
Also, some tools exists to "minify" the size of your xap file. (but I never tried them)
Here is a link that has helped me to reduce my xap size :
http://www.silverlightshow.net/items/My-XAP-file-is-5-Mb-size-is-that-bad.aspx
Edit to answer your comment :
I would suggest using the Isolated Storage.
Quote from http://www.silverlightshow.net/items/My-XAP-file-is-5-Mb-size-is-that-bad.aspx :
Use Isolated Storage: keep in cache XAP files, DLL’s, resources and application data. This won't enhance the first load of the application, but in subsequent downloads you can check whether the resource is already available on the client local storage and avoid downloading it from the server. Three things to bear in mind: It’s up to you building a timestamp mechanism to invalidate the cached information, by default isolated storage is limited to 1 Mb size (by user request it can grow), and users can disable isolated storage on a given app. More info:Basics about Isolated Storage and Caching and sample.
Related links :
http://msdn.microsoft.com/en-us/magazine/dd434650.aspx
http://timheuer.com/blog/archive/2008/09/24/silverlight-isolated-storage-caching.aspx

Replace strings in large file

I have a server-client application where clients are able to edit data in a file stored on the server side. The problem is that the file is too large in order to load it into the memory (8gb+). There could be around 50 string replacements per second invoked by the connected clients. So copying the whole file and replacing the specified string with the new one is out of question.
I was thinking about saving all changes in a cache on the server side and perform all the replacements after reaching a certain amount of data. After reaching that amount of data I would perform the update by copying the file in small chunks and replace the specified parts.
This is the only idea I came up with but I was wondering if there might be another way or what problems I could encounter with this method.
When you have more than 8GB of data which is edited by many users simultaneously, you are far beyond what can be handled with a flatfile.
You seriously need to move this data to a database. Regarding your comment that "the file content is no fit for a database": sorry, but I don't believe you. Especially regarding your remark that "many people can edit it" - that's one more reason to use a database. On a filesystem, only one user at a time can have write access to a file. But a database allows concurrent write access for multiple users.
We could help you to come up with a database schema, when you open a new question telling us how your data is structured exactly and what your use-cases are.
You could use some form of indexing on your data (in a separate file) to allow quick access to the relevant parts of this gigantic file (we've been doing this with large files successfully (~200-400gb), but as Phillipp mentioned you should move that data to a database, especially for the read/write access. Some frameworks (like OSG) already come with a database back-end for 3d terrain data, so you can peek there, how they do it.

Structure for storing data from thousands of files on a mobile device

I have more than 32000 binary files that store a certain kind of spatial data. I access the data by file name. The files range in size from 0-400kb. I need to be able to access the content of these files randomly and at various time points. I don't like the idea of having 32000+ separate files of data installed on a mobile device (even though the total file size is < 100mb). I want to merge the files into a single structure that will still let me access the data I need just as quickly. I'd like suggestions as to what the best way to do this is. Any suggestions should have C/C++ libs for accessing the data and should have a liberal license that allows inclusion in commercial, closed-source applications without any issue.
The only thing I've thought of so far is storing everything in an sqlite database, though I'm not sure if this is the best method, or what considerations I need to take into account for storing blob data with quick look up times (ie, what schema I'd use).
Why not roll your own?
Your requirements sound pretty simple and straight forward. Just bundle everything into a single binary file and add an index at the beginning telling which file starts where and how bit it is.
30 lines of C++ code max. Invest a good 10 minutes designing a good interface for it so you could replace the implementation when and if the need occurs.
That is of course if the data is read only. If you need to change it as you go, it gets hairy fast.

Is there a way to limit the number of output files of a process?

An application of our company uses pdfimages (from xpdf) to check whether some pages in a PDF files, on which we know there is no text, consist of one image.
For this we run pdfimages on that page and count whether only one, two or more, or zero output files have been created (could be JPG, PPM, PGM or PPM).
The problem is that for some PDF files, we get millions of 14-byte PPM images, and the process has to be killed manually.
We know that by assigning the process to a job we can restrict how much time the process will run for. But it would probably be better if we could control that the process will create new files at most twice during its execution.
Do you have any clue for doing that?
Thank you.
One approach is to monitor the directory for file creations: http://msdn.microsoft.com/en-us/library/aa365261(v=vs.85).aspx - the monitoring app could then terminate the PDF image extraction process.
Another would be to use a simple ramdisk which limited the number of files that could be created: you might modify something like http://support.microsoft.com/kb/257405.
If you can set up a FAT16 filesystem, I think there's a limit of 128 files in the root directory, 512 in other dirs? - with such small files that would be reached quickly.
Also, aside from my 'joke' comment, you might want to check out _setmaxstdio and see if that helps ( http://msdn.microsoft.com/en-us/library/6e3b887c(VS.71).aspx ).