Unreserved characters in C/C++ - c++

I need to encode all occurrence of < character in a C/C++ code file. To prevent conflict, I need to know which characters are not reserved in C/C++ standard. For example, if $ is not reserved, I can encode < to $ temporarily and revive the original C/C++ code later.
I need this encoding for my C/C++ code in the XML-like intermediate language.
Thanks in advance.

Rather than list unreserved characters (there are infinite), here are the reserved ones from 2.3.1 of the standard:
space, horizontal tab, vertical tab, form feed, new line
a through z
A through Z
0 through 9
_ { } [ ] # ( ) % : ; . ? * + - / ^ & | ~ ! = , \ " '

If you convert all < characters to $, how will you preserve any instances of $ in your original file?
Since you say you're targeting an XML-like intermediate language, why not use XML escaping and convert < to &lt instead? (You'll also need to convert & in that case, say to &amp.) There are lots of open source libraries available to help you do this. If you can't find any stand-alone module, here's code I've written which could have its XML (un)escaping functionality extracted.

It depends on what you mean by "reserved". An implementation
is only required to understand a very limited number of
characters in input, with all others being input by means of
universal character names. An implementation is allowed (and
I would even say encouraged) to support more, see §2.2, point 1.
In practice, there are (or should be) no reserved characters
in comments, and in string and character literals (at least the
wide character forms, and in C++11, the Unicode forms). Your
best bet is probably something like quoted printable.

Related

Difference between \\ and / when working with path directories

Whenever I do any sort of file read or write, I always use the '/'
but I've seen some examples where the value of the given filepath is '\\' instead.
So what's the difference?
Am I doing it wrong or introducing bugs if I use '/'?
There's nothing wrong with using / on systems that support it. In fact, on UNIX systems it's the only thing that works.
Windows supports both / and \ as path separator in most situations.
Note that a platform agnostic option is available in the form of std::filesystem::path.
The common convention used for managing paths in Windows is just reciprocal of Linux. It's formatted something like: C:\abc\abc.txt, although it's your own choice which method you would prefer to access/write the file or folder.
This \\ is an escape sequence to print a common backslash to read or write the file. Note that you won't able to use a single backslash between string value since it reads next character as an escape sequence (e.g. \n, \b, etc.)
That's it.

Are these characters safe to use in HTML, Postgres, and Bash?

I have a project where I'm trying to enable other, possibly hostile, coders to label, in lowercase various properties that will be displayed in differing contexts, including embed in HTML, saved and manipulated in Postgres, used as attribute labels in JavaScript, and manipulated in the shell (say, saving a data file as продажи.zip) as well as various data analysis tools like graph-tool, etc.
I've worked on multilingual projects before, but they were either smaller customers that didn't need to especially worry about sophisticated attacks or they were projects that I came to after the multilingual aspect was in place, so I wasn't the one responsible for verifying security.
I'm pretty sure these should be safe, but I don't know if there are gotchas I need to look out for, like, say, a special [TAB] or [QUOTE] character in the Chinese character set that might escape my escaping.
Am I ok with these in my regex filter?
dash = '-'
english = 'a-z'
italian = ''
russain = 'а-я'
ukrainian = 'ґї'
german = 'äöüß'
spanish = 'ñ'
french = 'çéâêîôûàèùëï'
portuguese = 'ãõ'
polish = 'ąćęłńóśźż'
turkish = 'ğışç'
dutch = 'áíúýÿìò'
swedish = 'å'
danish = 'æø'
norwegian = ''
estonian = ''
romainian = 'șî'
greek = 'α-ωίϊΐόάέύϋΰήώ'
chinese = '([\p{Han}]+)'
japanese = '([\p{Hiragana}\p{Katakana}]+)'
korean = '([\p{Hangul}]+)'
If you restrict yourself to text encodings with a 7-bit ASCII compatible subset, you're reasonably safe treating anything above 0x7f (U+007f) as "safe" when interacting with most saneish programming languages and tools. If you use perl6 you're out of luck ;)
You should avoid supporting or take special care with input or output of text using the text encoding Shift-JIS, where the ¥ symbol is at 0x5c where \ would usually reside. This offers opportunities for nefarious trickery by exploiting encoding conversions.
Avoid or take extra care with other non-ascii-compatible encodings too. EBDIC is one, but you're unlikely to ever meet it in the wild. UTF-16 and UTF-32 obviously, but if you misprocess them the results are glaringly obvious.
Reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
Personally I think your approach is backwards. You should define input and output functions to escape and unescape strings according to the lexical syntaxes of each target tool or language, rather than trying to prohibit any possible metacharacter. But then I don't know your situation, and maybe it's just impractical for what you're doing.
I'm not quite sure what your actual issue is. If you correctly convert your text to the target format, then you don't care what the text could possibly be. This will ensure both proper conversion AND security.
For instance:
If your text is to be included in HTML, it should be escaped using appropriate HTML quoting functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo "<span>".$variable."</span>"
Right:
// Actual encoding function varies based your environment
echo "<span>".htmlspecialchars($variable)."</span>"
Yes, this will also handle properly the case of text containing & or <.
If your text is to be used in an SQL query, you should use parameterised queries.
Example:
Wrong
// XXX DON'T DO THIS XXX
perform_sql_query("SELECT this FROM that WHERE thing=".$variable")
Right
// Actual syntax and function will vary
perform_sql_query("SELECT this FROM that WHERE thing=?", [$variable]);
If you text is to be included in JSON, just use appropriate JSON-encoding functions.
Example:
Wrong
// XXX DON'T DO THIS XXX
echo '{"this":"'.$variable.'"}'
Right
// actual syntax and function may vary
echo json_encode({this: $variable});
The shell is a bit more tricky, and it's often a pain to deal with non-ASCII characters in many environments (e.g. FTP or doing an scp between different environments). So don't use explicit names for files, use identifiers (numeric id, uuid, hash...) and store the mapping to the actual name somewhere else (in a database).

Convert path to \\

Okay, after two days of searching the web and MSDN, I didn't found any real solution to this problem, so I'm gonna ask here in hope I've overlooked something.
I have open dialog window, and after I get location from selected file, it gives the string in following way C:\file.exe. For next part of mine program I need C:\\file.exe. Is there any Microsoft function that can solve this problem, or some workaround?
ofn.lpstrFile = fileName;
char fileNameStr[sizeof(fileName)+1] = "";
if (GetOpenFileName(&ofn))
strcpy(fileNameStr, fileName);
DeleteFile(fileName); // doesn't works, invalid path
I've posted only this part of code, because everything else works fine and isn't relevant to this problem. Any assistence is greatly appreciated, as I'm going mad in last two days.
You are confusing the requirement in C and C++ to escape backslash characters in string literals with what Windows requires.
Windows allows double backslashes in paths in only two circumstances:
Paths that begin with "\\?\"
Paths that refer to share names such as "\\myserver\foo"
Therefore, "C:\\file.exe" is never a valid path.
The problem here is that Microsoft made the (disastrous) decision decades ago to use backslashes as path separators rather than forward slashes like UNIX uses. That decision has been haunting Windows programmers since the early 1980s because C and C++ use the backslash as an escape character in string literals (and only in literals).
So in C or C++ if you type something like DeleteFile("c:\file.exe") what DeleteFile will see is "c:ile.exe" with an unprintable 0xf inserted between the colon and "ile.exe". That's because the compiler sees the backslash and interprets it to mean the next character isn't what it appears to be. In this case, the next character is an f, which is a valid hex digit. Therefore, the compiler converts "\f" into the character 0xf, which isn't valid in a file name.
So how do you create the path "c:\file.exe" in a C/C++ program? You have two choices:
"c:/file.exe"
"c:\\file.exe"
The first choice works because in the Win32 API (and only the API, not the command line), forward slashes in paths are accepted as path separators. The second choice works because the first backslash tells the compiler to treat the next character specially. If the next character is a hex digit, that's what you will get. If the next character is another backslash, it will be interpreted as exactly that and your string will be correct.
The library Boost.Filesystem "provides portable facilities to query and manipulate paths, files, and directories".
In short, you should not use strings as file or path names. Use boost::filesystem::path instead. You can still init it from a string or char* and you can convert it back to std::string, but all manipulations and decorations will be done correctly by the class.
Im guessing you mean convert "C:\file.exe" to "C:\\file.exe"
std::string output_string;
for (auto character : input_string)
{
if (character == '\\')
{
output_string.push_back(character);
}
output_string.push_back(character);
}
Please note it is actually looking for a single backslash to replace, the double backslash used in the code is to escape the first one.

Using preg_replace/ preg_match with UTF-8 characters - specifically Māori macrons

I'm writing some autosuggest functionality which suggests page names that relate to the terms entered in the search box on our website.
For example typing in "rubbish" would suggest "Rubbish & Recycling", "Rubbish Collection Centres" etc.
I am running into a problem that some of our page names include macrons - specifically the macron used to correctly spell "Māori" (the indigenous people of New Zealand).
Users are going to type "maori" into the search box and I want to be able to return pages such as "Māori History".
The autosuggestion is sourced from a cached array built from all the pages and keywords. To try and locate Māori I've been trying various regex expressions like:
preg_match('/\m(.{1})ori/i',$page_title)
Which also returns page titles containing "Moorings" but not "Māori". How does preg_match/ preg_replace see characters like "ā" and how should I construct the regex to pick them up?
Cheers
Tama
Use the /u modifier for utf-8 mode in regexes,
You're better of on a whole with doing an iconv('utf-8','ascii//TRANSLIT',$string) on both name & search and comparing those.
One thing you need to remember is that UTF-8 gives you multi-byte characters for anything outside of ASCII. I don't know if the string $page_title is being treated as a Unicode object or a dumb byte string. If it's the byte string option, you're going to have to do double dots there to catch it instead, or {1,4}. And even then you're going to have to verify the up to four bytes you grab between the M and the o form a singular valid UTF-8 character. This is all moot if PHP does unicode right, I haven't used it in years so I can't vouch for it.
The other issue to consider is that ā can be constructed in two ways; one as a single character (U+0101) and one as TWO unicode characters ('a' plus a combining diacritic in the U+0300 range). You're likely just only going to ever get the former, but be aware that the latter is also possible.
The only language I know of that does this stuff reliably well is Perl 6, which has all kinds on insane modifiers for internationalized text in regexps.

ASCII Value for Nothing

Is there an ascii value I can put into a char in C++, that represents nothing? I tried 0 but it ends up screwing up my file so I can't read it.
ASCII 0 is null. Other than that, there are no "nothing" characters in traditional ASCII. If appropriate, you could use a control character like SOH (start of heading), STX (start of text), or ETX (end of text). Their ASCII values are 1, 2, and 3 respectively.
For the full list of ASCII codes that I used for this explaination, see this site
Sure. Use any character value that won't appear in your regular data. This is commonly referred to as a delimited text file. Popular choices for delimiters include spaces, tabs, commas, semi-colons, vertical-bar characters, and tilde.
In a C++ source file, '\0' represents a 0 byte. However, C++ strings are usually null-terminated, which means that '\0' represents the end of the string - which may be what is messing up your file.
If you really want to store a 0 byte in a data file, you need to use some other encoding. A simplistic one would use some other character - 0xFF, for example - that doesn't appear in your data, or some length/data format or something similar.
Whatever encoding you choose to use, the application writing the file and the one reading it need to agree on what the encoding is. And that is a whole new nightmare.
The null character '\0' still takes up a byte.
Does your software recognize the null character as an end-of-file character?
If your software is reading in this file, you can define a place holder character (one that isn't the same as data) but you'll also need to handle that character. As in, say '*' is your place-holder. You will read in the character but not add it to the structure that stores your data. It will still take up space in your file, but it won't take up space in your data structure.
Am I answering your question or missing it?
Do you mean a value you can write which won't actually change the file? The answer is no.
Maybe post a little more about what you're trying to accomplish.
it would depend on what kind of file it is and who is parsing it.