How to store a non-UTF-8 file in a Fossil repository?

How to store a non-UTF-8 file in a Fossil repository? - fossil

When I try to add a Latin-9 encoded file to a Fossil repository I get the error:
... contains invalid UTF-8. Use --no-warnings or the "encoding-glob" setting to disable this warning.
But from the documentation I think this will suppress just the warning and will still do the wrong thing, which means that a Latin-9 file gets imported as a UTF-8 file.
How can I import a Latin-9 file as a Latin-9 file? How to specify the encoding of a file or all files?

What Fossil does is to warn you during a commit that a file contains data that it didn't expect to be there (binary, not Unicode, etc.). It will not actually alter the contents of the file unless the c=convert option is there and you select it. If you select the convert option, it will first convert the file and then ask you to actually commit it in a separate step.
When you suppress warnings with --no-warnings, it will not show the warning and assume that you want to commit the file (without converting it).
For a more permanent solution, the encoding-glob setting (which can be either local to the repository or set globally) can contain a pattern (such as *.txt) that denotes files that contain text in other formats (and for binary files, the binary-glob setting does that). When Fossil encounters non-Unicode content, it will then not raise the warning and assume that you want this; again, it will not convert the file, it just tells Fossil that you know what you are doing and that the non-Unicode content is intentional.

Related

Why does the g++ compiler add spaces between every character in my cpp file?

I'm trying to compile 3 cpp files, for only one of them, the g++ compiler on linux is reading spaces between every character on the making it impossible to compile. I get hundreds, if not thousands, of x.cpp:n:n: warning: null character(s) ignored (where x is a name and n is a number). I wrote the program in Visual studio and I copied them to linux. The other 2 files compile fine, I've done this for dozens of projects. How does this happen?
I managed to fix this issue by creating a new file and copying the text from the original cpp instead of copying the file.
Now I get an error from the terminal saying Permission Denied when I try launch the .o file

Your compiler problem is nothing to do with linebreaks.
You're trying to compile a file saved as UTF-16 (Unicode). Visual Studio will do this behind your back if the file contains any non-ASCII characters.
Solution 1 (recommended): stick to ASCII. Then the problem simply won't arise in the first place.
Solution 2: save the file in Visual Studio as UTF-8, as described here. You might need to save the file without a BOM (byte-order mark) as described here.
WRT your other problem, look for a file called a.out (yes, really) and try running that. And don't specify -c on the g++ command line.

There is no text but encoded text.
Dogmatic corollaries:
Authors choose a character encoding for a text file.
Readers must know what it is.
Any kind of shared understanding will do: specification, convention, internal tagging, metadata upon transfer, …. (Even last century's way of converting upon transfer would do in some cases.)
It seems you 1) didn't know what you chose. 2) didn't bring that knowledge with you when you copied the file between systems, and 3) didn't tell GCC.
Unfortunately, there has been a culture of hiding these basic communication needs, instead of doing it mindfully; so, your experience too much too common.
To tell GCC,
g++ -finput-charset=utf-16
Obviously, if you are using some sort of project system that supports keeping track of the required metadata of the relevant text files and passing it tools, that would be preferred.
You could try adopting UTF-8 Everywhere. That won't eliminate the need for communication (until maybe the middle of this century) but it could make it more agreeable.

VS2010 doesn't understand the string encoding of its own files

I am working with VS2010 project Unicode which all works fine. When I remove my local files and download fresh copy of it from source control (Perforce), the resource.h file reads wrong (in chinese).
//{{NO_DEPENDENCIES}} ਍⼀⼀ 䴀椀挀爀漀猀漀昀琀 嘀椀猀甀愀氀 䌀⬀⬀ 最攀渀攀爀愀琀攀搀 椀渀挀氀甀搀攀 昀椀氀攀⸀ഀഀ // Used by MyDemo.rc ਍⼀⼀ഀഀ #define IDM_ABOUTBOX 0x0010 ਍⌀搀攀昀椀渀攀 䤀䐀䐀开䄀䈀伀唀吀䈀伀堀                    ㄀　　ഀഀ
Why does VS2010 does that? and how can I fix it? It is essentially an identical file but in once instance it opens and a new instance it is not able to figure out the file encoding.
Although this is MFC project but it doesn't look like that has anything to do with this issue.

This appears to be a problem with Perforce's line ending conversion and a failure to correctly deduce the UTF16 format of resource.h
Following the steps here may fix the problem if you encounter it in future:
Problem
On Windows, after syncing "text" (Perforce file type) files containing
utf16 encoding, the file in my workspace seems corrupted.
Solution
As the utf16 character encoding is a double byte character encoding
and Perforce treats "text" files as single byte, you may encounter
rendering or corruption issues in a Windows environment. Windows line
endings are not correctly converted within the UTF16 character set for
"text" files. This corrupts the utf16 file content.
File revisions with utf16 content should always be submitted using the
"utf16" file type (on add, Perforce will automatically detect utf16
files unless the user or a typemap rule overrides this behavior).
In order to fix your issue follow these steps:
Edit your workspace specification and change the value of the LineEnding field to "unix"
Force sync the file (no line ending conversion will be done)
Check that the workspace file is now rendered properly
Checkout the file(s), changing the file type to utf16 (change from "text" to "utf16")
Edit your workspace specification and change the value of the LineEnding field back to "local"
Submit a new revision of the file
Example:
p4 client bruno_ws
LineEnd: unix
p4 sync -f myfile.txt
p4 edit -t utf16 myfile.txt
p4 client bruno_ws
LineEnd: local
p4 submit -d "Fixing unicode file"

I think I have seen this issue before so I am going to answer this now that I figured it out. Somehow the resource.h encoding or format was messed up and I don't know why.I haven't made any manual changes to it. Perforce was not able to detect changes and display both of them correctly side by side in comparison. However it didn't prompt "The files are identical" message that it normally does if files are identical. However If I do 'Revert unchanged files", it rolls it back not detecting changes.
I used hex comparison tool and the internals of the two files were different. I simply picked the one which was working. Also the file size was also different for some reason.
The correct file shows as following
//{{NO_DEPENDENCIES}}
// Microsoft Visual C++ generated include file.
// Used by MyDemo.rc
//
#define IDD_ABOUTBOX 100
....

Resource.h needs to be in ANSI format.
Sometimes Visual Studio converts it to Unicode and puts a 2 byte BOM in the beginning. However, when it is loaded in the IDE editor, it cannot recognize it and displays it as Chinese.
If you take a look with a hex editor, you will be able to read the file contents.
The solution is to use an independent text editor (I use Notepad++ or Notepad2) and make sure the file is encoded in ANSI format without BOM.
Then check in the file and don't open it with Visual Studio anymore.
If you need to do changes, always go through the external editor and make sure that after saving the encoding is still ANSI.
I don't know why this happens. My assumption is that the OS default locale is different from the VS Project Resource locale. Then the IDE gets confused and probably tries to convert the resource file to Unicode in order to avoid conversion problems, but Resource.h is not an ordinary text file. The compilers seem not to understand Unicode sources with BOM.

How to get the content-type of a file

I am implementing a HTTP/1.0 server that processes GET or HEAD request.
I've finished Date, Last-Modified, and Content-Length, but I don't know how to get the Content-Type of a file.
It has to return directory for directory(which I can do using stat() function), and for a regular file, text/html for text or html file, and image/gif for image or gif file.
Should this be hard-coded, using the name of the file?
I wonder if there is any function to get this Content-Type.

You could either look at the file extension (which is what most web servers do -- see e.g. the /etc/mime.types file; or you could use libmagic to automatically determine the content type by looking at the first few bytes of the file.

It depends how sophisticated you want to be.
If the files in question are all properly named and there are only several types to handle, having a switch based file suffix is sufficient. Going to the extreme case, making the right decision no matter what the file is would probably require either duplicating the functionality of Unix file command or running it on file in question (and then translating the output to the proper Content-Type).

Appropriate file upload validation

Background
In a targeted issue tracking application (in django) users are able add file attachments to internal messages. Files are mainly different image formats, office documents and spreadsheets (microsoft or open office), PDFs and PSDs.
A custom file field type (type extending FileField) currently validates that the files don't exceed a given size and that the file's content_type is in a the applications MIME Type 'white list'. But as the user base is very varied (multi national and multi platform) we are frequently having to adjust our white list as users using old or brand new application versions have different MIME types (even though they are valid files, and are opened correctly by other users within the business).
Note: Files are not 'executed' by apache, they are just stored (with unix permissions 600) and can be downloaded by users.
Question
What are the pro's and con's for the different types of validation?
A few options:
MIME type white list or black list
File extension while list or black list
Django file upload input validation and security even suggests "you have to actually read the file to be sure it's a JPEG, not an .EXE" (is that even viable when numerous types of files are to be accepeted?)
Is there a 'right' way to validate file uploads?
Edit
Let me clarify. I can understand that actually checking the entire file in the program that it should be opened with to ensure it works and isn't broken would be the only way to fully confirm that the file is what it says it is, and that it isn't corrupted.
But the files in question are like email attachments. we can't possibly verify that every PSD is a valid and working Photoshop image, same goes for JPG or any other type. Even if it is what it says it is, we couldn't guarantee that it's a fully functional file.
So What I was hoping to get at is: Is file magic absolutely crucial? What protection does it really add? And again does a MIME type whitelist actually add any protection that a file extension whitelist doesn't? If a file has an file extension of CSV, JPG, GIF, DOC, PSD is it really viable to check that it is what it says it is, even though the application itself doesn't depend on file?
Is it dangerous to use simple file extension whitelist excluding the obvious offenders (EXE, BAT, etc.) and, I think, disallowing files that are dangerous to the users?

The best way to validate that a file is what it says it is by using magic.
Er, that is, magic. Files can be identified by the first few bytes of their content. It's generally more accurate than extensions or mime types, since you're judging what a file is by what it contains rather than what the browser or user claimed it to be.
There's an article on FileMagic on the Python wiki
You might also look into using the python-magic package
Note that you don't need to get the entire file before using magic to determine what it is. You can read the first chunk of the file and send those bytes to be identified by file magic.
Clarification
Just to point out that using magic to identify a file really just means reading the first small chunk of a file. It's definitely more overhead then just checking the extension but not too mch work. All that file magic does is check that the file "looks" like it's the file you want. It's like checking the file extension only you're looking at the first few chars of the content instead of the last few chars of the filename. It's harder to spoof than just changing the filename. I'd recommend against a mime type whitelist. A file extension whitelist should work fine for your needs, just make sure that you include all possible extensions. Otherwise a perfectly valid file might be rejected just because it ends with .jpeg instead of .jpg.

How to store the Visual C++ debug settings?

The debug settings are stored in a .user file which should not be added to source control. However this file does contain useful information. Now I need to set each time I trying to build a fresh checkout.
Is there some workaround to make this less cumbersome?
Edit: It contains the debug launch parameters. This is often not really a per-user setting. The default is $(TargetPath), but I often set it to something like $(SolutionDir)TestApp\test.exe with a few command line arguments. So it isn't a local machine setting per se.

Well, I believe this file is human readable (xml format I think?), so you could create a template that is put into source control that everyone would check out, for instance settings.user.template. Each developer would than copy this to settings.user or whatever the name is and modify the contents to be what they need it to be.
Its been a while since I've looked at that file, but I've done similar things to this numerous times.

Set the debug launch parameters in a batch file, add the batch file to source control. Set the startup path in VS to startup.bat $(TargetPath).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js