c++ convert a char* of ascii characters to a unix filename - c++

I have a char* which only contains ASCII characters (decimal: 32-126). I'm searching for a c++ function which escapes (add a backslash before the character) characters that have special meanings in the unix filesystem like '/' or '.'. I want to open the file with fopen later.
I'm not sure, if manually replacing would be a good option. I don't know all characters with special meanings. I also don't know if '?' or '*' would work with fopen.

Actually Unix (or more specific the SuS) disallows only the byte values '/' and '\0' in file names. Everything else actually is fair game. The exact (in the sense that they're immediately following and followed by a '/') strings "." and ".." are reserved to relative path access, but they are very well valid in a Unix path.
And of course any number and sequence of '.' is perfectly allowed in a Unix filename, as long as another character other than '/' or '\0' is part of the filename. Yes, newline, any control character, they're all perfectly valid Unix filenames.
Of course the file system you're using may have a different idea about what's permissible, but you were just asking about Unix.
Update:
Oh and it should be noted, that Unix doesn't specify dome "parse" method for filenames. Which essentially means, a filename is treated as a binary blob key into a key→value database. It also means, that there's no such thing as "escaping" for Unix filenames.

POSIX filenames don't have a concept of escape characters. There is no way to have a slash as an element of a filename (when the system renders filenames using Unicode you may be able to create a filename which looks as if it contains a slash, though). I think all other printable characters are just fine although using special characters like * and ? in filename will probably cause problems when people try use them from a shell.

Related

How to manage file name in flutter

I have a string and it's going to be a filename . So i want to check if there is a special characters that i'm going to replace them so i won't be a problem when i'm going to create the file . is it a good practice to replace them with "_" ?
i' used this is it correct ? is there other characters excepts alphabet and number can be used on file name ? Which characters should I avoid in file names
String filename = ch.replaceAll(RegExp('[^A-Za-z0-9]'), '_');
The list of allowed filename characters depends on the underlying filesystem. On (most) Unix, anything except / and \0 is allowed. On Windows, the rules get weird. For example, you (usually) can't end a filename with a period; you can't name a file NUL, etc.
Other considerations: It would be confusing to allow spaces at the beginning/end of a filename. Spaces within a filename break certain tools (looking at you, make). Is your filesystem case-sensitive or case-preserving? Does it have a maximum filename length?
Which characters should I avoid in file names?
Wrong question. Do you have a particular need to allow "unusual" characters in filenames?
If these are machine-generated names, just do what you're doing (I prefer hyphens, but that's a stylistic decision). If these are user-generated filenames, just try saving the file -- if it fails, get the user to choose another name.
tl;dr: use URL-safe characters: [A-Za-z0-9_-]+.

Difference between \\ and / when working with path directories

Whenever I do any sort of file read or write, I always use the '/'
but I've seen some examples where the value of the given filepath is '\\' instead.
So what's the difference?
Am I doing it wrong or introducing bugs if I use '/'?
There's nothing wrong with using / on systems that support it. In fact, on UNIX systems it's the only thing that works.
Windows supports both / and \ as path separator in most situations.
Note that a platform agnostic option is available in the form of std::filesystem::path.
The common convention used for managing paths in Windows is just reciprocal of Linux. It's formatted something like: C:\abc\abc.txt, although it's your own choice which method you would prefer to access/write the file or folder.
This \\ is an escape sequence to print a common backslash to read or write the file. Note that you won't able to use a single backslash between string value since it reads next character as an escape sequence (e.g. \n, \b, etc.)
That's it.

Convert path to \\

Okay, after two days of searching the web and MSDN, I didn't found any real solution to this problem, so I'm gonna ask here in hope I've overlooked something.
I have open dialog window, and after I get location from selected file, it gives the string in following way C:\file.exe. For next part of mine program I need C:\\file.exe. Is there any Microsoft function that can solve this problem, or some workaround?
ofn.lpstrFile = fileName;
char fileNameStr[sizeof(fileName)+1] = "";
if (GetOpenFileName(&ofn))
strcpy(fileNameStr, fileName);
DeleteFile(fileName); // doesn't works, invalid path
I've posted only this part of code, because everything else works fine and isn't relevant to this problem. Any assistence is greatly appreciated, as I'm going mad in last two days.
You are confusing the requirement in C and C++ to escape backslash characters in string literals with what Windows requires.
Windows allows double backslashes in paths in only two circumstances:
Paths that begin with "\\?\"
Paths that refer to share names such as "\\myserver\foo"
Therefore, "C:\\file.exe" is never a valid path.
The problem here is that Microsoft made the (disastrous) decision decades ago to use backslashes as path separators rather than forward slashes like UNIX uses. That decision has been haunting Windows programmers since the early 1980s because C and C++ use the backslash as an escape character in string literals (and only in literals).
So in C or C++ if you type something like DeleteFile("c:\file.exe") what DeleteFile will see is "c:ile.exe" with an unprintable 0xf inserted between the colon and "ile.exe". That's because the compiler sees the backslash and interprets it to mean the next character isn't what it appears to be. In this case, the next character is an f, which is a valid hex digit. Therefore, the compiler converts "\f" into the character 0xf, which isn't valid in a file name.
So how do you create the path "c:\file.exe" in a C/C++ program? You have two choices:
"c:/file.exe"
"c:\\file.exe"
The first choice works because in the Win32 API (and only the API, not the command line), forward slashes in paths are accepted as path separators. The second choice works because the first backslash tells the compiler to treat the next character specially. If the next character is a hex digit, that's what you will get. If the next character is another backslash, it will be interpreted as exactly that and your string will be correct.
The library Boost.Filesystem "provides portable facilities to query and manipulate paths, files, and directories".
In short, you should not use strings as file or path names. Use boost::filesystem::path instead. You can still init it from a string or char* and you can convert it back to std::string, but all manipulations and decorations will be done correctly by the class.
Im guessing you mean convert "C:\file.exe" to "C:\\file.exe"
std::string output_string;
for (auto character : input_string)
{
if (character == '\\')
{
output_string.push_back(character);
}
output_string.push_back(character);
}
Please note it is actually looking for a single backslash to replace, the double backslash used in the code is to escape the first one.

QString::split() and "\r", "\n" and "\r\n" convention

I understand that QString::split should be used to get a QStringList from a multiline QString. But if I have a file and I don't know if it comes from Mac, Windows or Unix, I'm not sure if QString.split("\n") would work well in all the cases. What is the best way to handle this situation?
If it's acceptable to remove blank lines, you can try:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
This splits the string whenever any of the newline character (either line feed or carriage return) is found. Any consecutive line breaks (e.g. \r\n\r\n or \n\n) will be considered multiple delimiters with empty parts between them, which will be skipped.
Emanuele Bezzi's answer misses a couple of points.
In most cases, a string read from a text file will have been read using a text stream, which automatically translates the OS's end-of-line representation to a single '\n' character. So if you're dealing with native text files, '\n' should be the only delimiter you need to worry about. For example, if your program is running on a Windows system, reading input in text mode, line endings will be marked in memory with single \n characters; you'll never see the "\r\n" pairs that exist in the file.
But sometimes you do need to deal with "foreign" text files.
Ideally, you should probably translate any such files to the local format before reading them, which avoids the issue. Only the translation utility needs to be aware of variant line endings; everything else just deals with text.
But that's not always possible; sometimes you might want your program to handle Windows text files when running on a POSIX system (Linux, UNIX, etc.), or vice versa.
A Windows-format text file on a POSIX system will appear to have an extra '\r' character at the end of each line.
A POSIX-format text file on a Windows system will appear to consist of one very long line with embedded '\n' characters.
The most general approach is to read the file in binary mode and deal with the line endings explicitly.
I'm not familiar with QString.split, but I suspect that this:
QString.split(QRegExp("[\r\n]"),QString::SkipEmptyParts);
will ignore empty lines, which will appear either as "\n\n" or as "\r\n\r\n", depending on the format. Empty lines are perfectly valid text data; you shouldn't ignore them unless you're certain that it makes sense to do so.
If you need to deal with text input delimited either by "\n", "\r\n", or "\r", then I think something like this:
QString.split(QRegExp("\n|\r\n|\r"));
would do the job. (Thanks to parsley72's comment for helping me with the regular expression syntax.)
Another point: you're probably not likely to encounter text files that use just '\r' to delimit lines. That's the format used by MacOS up to version 9. MaxOS X is based on UNIX, and it uses standard UNIX-style '\n' line endings (though it probably tolerates '\r' line endings as well).

C++ change newline from CR+LF to LF

I am writing code that runs in Windows and outputs a text file that later becomes the input to a program in Linux. This program behaves incorrectly when given files that have newlines that are CR+LF rather than just LF.
I know that I can use tools like dos2unix, but I'd like to skip the extra step. Is it possible to get a C++ program in Windows to use the Linux newline instead of the Windows one?
Yes, you have to open the file in "binary" mode to stop the newline translation.
How you do it depends on how you are opening the file.
Using fopen:
FILE* outfile = fopen( "filename", "wb" );
Using ofstream:
std::ofstream outfile( "filename", std::ios_base::binary | std::ios_base::out );
OK, so this is probably not what you want to hear, but here's my $0.02 based on my experience with this:
If you need to pass data between different platforms, in the long run you're probably better off using a format that doesn't care what line breaks look like. If it's text files, users will sometimes mess with them. If by messing the line endings up they cause your application to fail, this is going to be a support intensive application.
Been there, done that, switched to XML. Made the support guys a lot happier.
A much cleaner solution is to use the ASCII escape sequence for the LF character (decimal 10): '\012' or '\x0A' represents an explicit single line feed regardless of platform.
Note that this at least on some compilers does not work; for example, on MSVC 2019 16.11.6, both '\012' and '\x0A' get translated to carriage return and line feed. It also does not matter there whether a string literal ("\012") or a char literal ('\012') is used.
This method also avoids string length surprises, as '\n' can expand to two characters. But so can multibyte unicode characters, in UTF8, when written directly into a string literal in the source code.
Note also that '\r' is the platform-independent code for a single carriage return (decimal 13). The '\f' character is not the line feed, but rather the form feed (decimal 12), which is not a newline on any platform I am aware of. C does not offer a single-character backslash escape for the line feed, thus the need for the longer octal or hexadecimal escapes.