Unzipping Large Files in AWS - amazon-web-services

We've recently ran into an issue with file corruption after large files are unzipped. The unzip process completes without error but can be missing last 5k bytes.
Our current process: A .ZIP file is downloaded from S3 onto the linux pod, perl code using IO::Uncompress::Unzip unzips a single .JSON file, the .JSON is uploaded back to S3.
There is another layer of challenge too. When using native windows or linux tools locally the files unzip completely, no missing bytes. However, at times single characters are changed within the file (We've seen corrupted JSON, changing "}]}" to }M}" or misspelled words, "item" to "idem"). This problem seems worse with tools like 7zip and Winrar.
In checking the details on the .ZIP file it looks to using windows for the encoding compression, which research says uses a GBK encoding. I suspect there may be a decoding issue with linux and some tools that use UTF8 decoding, but I've been unable to confirm that. Plus, we've experienced even the local windows unzip process changing single characters.
We've tried using IO::Uncompress::Unzip locally, which resulted in incomplete file.
We've tried using Archive::Zip locally, which errors out on any files over 4 GB.
We've tried using Compress::Raw::Zlib, but that also didn't work.
We've tried autoflush on the file handle, which resulted in incomplete file.
Has anyone encountered similar behaviors?

Related

Is there a way to create a folder that is interpreted by the OS (Windows, OS, Linux) as a single file?

The reason why I need this is because for example: There are a lots of files and folders inside a "some_important_folder" folder. User can usually browse to "some_important_folder" folder and go deeper into it to see its' subfolders and files like in any normal file explorer can do. But since in my use case, the user doesn't need to interact with the files and folders in "some_important_folder" folder at all. Therefore, I was wondering if there is any way to hide the complexity of the folders in "some_important_folder" folder and show to user as a single file only. But my programs (written in C++) can still somehow access the files and folders in it like normal such as: "C:\Users\user\Documents\some_important_folder\someFolder\someFileThatUserDoesntNeedToKnow.exe"
Something like .rar or .zip file but since the "some_important_folder" folder might be very big in size (more than TB), I don't think it would be good to convert the whole folder to a .zip file as it would take lots of redundant space from the hard disc and the process would be very slow
Have you considered encrypting your folders? That way if you wanted to only access the folder using your C++ app, you could pass down the password/decrypted for it, making your app the only access point you'd have to that folder.
Yes, both windows and linux have similar technology.
On windows, you can use "Compound File Binary Format". It is a general-purpose file format that provides a file-system-like structure within a file for the storage of arbitrary, application-specific streams of data. In fact, ealier office doc file format is based on this technology. The following is the doc link from microsoft and wiki. And I believe you can google some sample code.
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/53989ce4-7b05-4f8d-829b-d08d6148375b
https://en.wikipedia.org/wiki/Compound_File_Binary_Format
On linux, you can loop mount a file as file system as #stark mentioned. You can google "linux loop mount file", the following is the first article I found:
https://www.jamescoyle.net/how-to/2096-use-a-file-as-a-linux-block-device

Git, SourceTree, VisualStudio and corrupted .cpp files

Been working in C++ using Git (via SourceTree) for version control.
My .cpp files will randomly become seemingly corrupted when I pull the project as shown below:
Github still has the correct version of the file and even selecting the 'Open After' option in SourceTree shows me the file unaffected:
The fact that Git and Github both have no problem showing me the file suggests to me it's a Visual Studio issue but I don't know.
One thing to also note is that SourceTree can't seem to display some of my .cpp files, and just treats them like binary files (but I'm not sure if this is related or not):
It's not a massive issue since I can just copy the code from Github, but it happens almost every time I pull so it's rather annoying. Any help solving this would be massively appreciated.
No solution but things that you could check:
what is the version of git? Old versions <2.0 on Windows had a bug like that. Upgrade to the last versions that are quite good
verify that your files are encoded in utf8
that git don't touch you files when committing ('autocrlf' at false)

coldfusion file content no longer readable when we open the file in an editor

My ColdFusion 11 app is working fine on my local machine (Windows 10). But when I open any .cfm file in Eclipse or even in a Notepad, I get the content - such as the following - that is not readable. It was not happening before. How can I make the files readable again?
UPDATE:
The website is using IIS 10 on windows 10. Could the above issue have something to do with IIS? I've noticed that when I open a copy of the same .cfm file from my backup folder I can read the file and the above issue does not exist.
You are looking at an encrypted file. Most CFML files within the CFIDE folder are encrypted to protect the code from prying eyes. This has nothing to do with IIS.
A short description about this topic you can find here.
Unless there is any batch encryption in place for your own files, then you should be able to open up your own code and you will see the readable content. If you accidentally encoded your own files, then you will have to restore your backup or get the original files from source control. In the past Google search brought up a (technically illegal) tool for unencrypting such files. I've no idea whether this still works with newer ColdFusion editions.

C++ Embed external .exe into my compiled .exe

I have a quick question on a topic that I'm quite a noob about. I have a program I made that sends a command to another .exe in a folder I called "tools". I send it in this format:
system("tools\\program.exe -r -w file.dat file_new.dat");
Everything works great, however, when I build my program into a .exe it will require the other executable to be in a second folder, obviously. Is there any way to include the external .exe into my project so the final product is just one .exe?
I am using Visual Studio 2008 (lol) and run windows 7 64bit.
Thanks :)
Typically, the management of external dependencies would be handled by the installer. NSIS is my favoured solution for the Windows platform.
The alternative: Convert the binary to a base64 encoding and embed it as a header file in your project. When the application is run, convert the base64 representation of the exe to a binary sequence and then output that sequence of bytes to a file in a temporary directory (like C:\windows\temp or %AppData%\Local\Temp). Then run the exe. Once you're done with it, remove the exe.
You can add the file to resources. And before the command is executed, you can check, if the second executable exists. If it doesn't exist, you have to extract the data from resource and store to the file...
This thread was dealing with reading html from resource. It is very similar with binary file.

File formats with included versioning

I like the idea of using compressed folders as containers for file formats. They are used for LibreOffice or Dia. So if I want to define a special purpose file format, I can define a folder and file structure and just zip the root folder and have a single file with all the data in a single file. Imported files just live as originals inside the compressed file. Defining a binary file format from zero with this features would be a lot of work.
Now to my question: Are there applications which are using compressed folders as file formats and do versioning inside the folder? The benefits would be great. You could just commit a state in your project into your file and the versioning is just decorated with functions from your own application. Also diffs could be presented your own way.
Libraries for working with compressed files and for versioning are available. The used versioning system should be a distributed system, where the repository lives inside your working folder and not seperate as for example subversion with its client-server model.
What do you think? I'm sure there are applications out there using this approach, but I couldn't find one. Or is there a major drawback in this approach?
Sounds like an interesting idea.
I know many applications claim they have "unlimited" undo and redo,
but that's only back to the most recent time I opened this file.
With your system, your application could "undo" to previous versions of the file,
even before the version I saw the most recent time I opened this file -- that might be a nifty feature.
Have you looked at TortoiseHg?
TortoiseHg uses Mercurial, which is
"a distributed system, where the repository lives inside your working folder".
Rather than defining a new compressed versioned file format and all the software to work with it from scratch,
perhaps you could use the Mercurial file format and borrow the TortoiseHg and Mercurial source code to work with it.
What happens if I'm working on a project using 2 different applications,
and each application wants to store the entire project in its own slightly different compressed versioned file format?
What I found now is that OpenOffice aka LibreOffice has kind of versioning inside. LibreOffice file is a zip file with a structured content (XMLs, direcories, ...) inside. You are able to mark the current content as a version. This results in creating a VersionList.xml which contains information about all the versions. A Versions directory is added and this contains files like Version1, Version2 and so on. These files are the actual documents at that state.