Extrea backslash added to n when parsing a string from XML

Extrea backslash added to n when parsing a string from XML - c++

I read an xml data into C++ application.Some of the data is multiline string.Each new line is broken by '\n' escape character.But when it is loaded into the program the backslash n gets extra backslash from the left.For example:
In XML:
<node attrStr = "Hello!\nWhat's your name?" />
In the program:
"Hello!\\nWhat's your name?"
So it causes '\' and 'n' to become separate characters.
It doesn't happen if the string is hardcoded into the program source code.
How this issue can be solved?
Important to note that the XML string is read into std::wstring to take care of unicode characters.
Found the answer here.
Replacing '\n' with
inside XML solves the issue.

If you want to escape a newline character in XML you will have to use the entity
. So the correct XML would look like:
<node attrStr = "Hello!
What's your name?" />
Since XML does not allow character escaping with backslash the string "\n" is read as two normal characters '\' and 'n'.
If you want to load the XML content with correct line breaks, you must replace the "\n" parts with "
" as suggested in the answer proposed by #Angew.
Alternatively, you could also modify or pre-process the XML file before reading it.

The two characters \ and n after each other do not inherently have any special meaning. In some contexts, these two characters are used to encode a newline. String literals in C++ source files are such a context. XML files are not such a context.
This means that when parsing an XML file containing the substring \n, you will get a string containing the substring \n in the memory of your C++ program. Anything else would be wrong. If you want \n in your data to represent a newline, you have to use string substitution once the data is in memory.
After parsing the string, simply replace each \n occurence with a the ASCII character LF and you're set. This is how you could do it (inefficiently) with the standard library:
std::string s = getTheStringFromXml();
for (size_t idx = 0;;)
{
idx = s.find("\\n", idx);
if (idx == s.npos)
break;
s[idx] = '\n';
s.erase(idx + 1);
}

This issue occurs also in JavaScript, and the fix
works well

Related

XSLT fn:tokenize ignore leading and trailing spaces

Simple date string needs to be tokenized. I'am using this sample xslt code:
fn:tokenize(date, '[ .\s]+')
All variants of bad date format (i.e. "10.10.2020", "10. 10 .2020", "10 . 10. 2020") are tokenized ok using the function above, except if there's a leading space present (i.e. " 10.10.2020"). If leading space is present, first element is then tokenized as " " blank space.
Is there an option to ignore these leading spaces as well so no matter how bad the format is, only delimiter "." means another token and all spaces are stripped as well?

The right solution seems to be:
fn:tokenize(normalize-space(date, '[ .\s]+')

JSONCPP is adding extra double quotes to string

I have a root in JSONcpp having string value like this.
Json::Value root;
std::string val = "{\"stringval\": \"mystring\"}";
Json::Reader reader;
bool parsingpassed = reader.parse(val, root, false);
Now when I am trying to retrieve this value using this piece of code.
Json::StreamWriterBuilder builder;
builder.settings_["indentation"] = "";
std::string out = Json::writeString(builder, root["stringval"]);
here out string ideally should be giving containing:
"mystring"
whereas it is giving output like this:
"\"mystring\"" \\you see this in debug mode if you check your string content
by the way if you print this value using stdout it will be printed something like this::
"mystring" \\ because \" is an escape sequence and prints " in stdout
it should be printing like this in stdout:
mystring \\Expected output
Any idea how to avoid this kind of output when converting json output to std::string ?
Please avoid suggesting fastwriter as it also adds newline character and it deprecated API as well.
Constraint: I do not want to modify the string by removing extra \" with string manipulation rather I am willing to know how I can I do that with JSONcpp directly.
This is StreamWriterBuilder Reference code which I have used
Also found this solution, which gives optimal solution to remove extra quotes from your current string , but I don't want it to be there in first place

I had this problem also until I realized you have to use the Json::Value class accessor functions, e.g. root["stringval"] will be "mystring", but root["stringval"].asString() will be mystring.

Okay so This question did not get answer after thorough explanation as well and I had to go through JSONCPP apis and documentation for a while.
I did not find any api as of now which takes care of this scenario of extra double quote addition.
Now from their wikibook I could figure out that some escape sequences might come in String. It is as designed and they haven't mentioned exact scenario.
\" - quote
\\ - backslash
\/ - slash
\n - newline
\t - tabulation
\r - carriage return
\b - backspace
\f - form feed
\uxxxx , where x is a hexadecimal digit - any 2-byte symbol
Link Explaining what all extra Escape Sequence might come in String
Anyone coming around this if finds out better explanation for the same issue , please feel free to post your answer.Till then I guess only string manipulation is the option to remove those extra escape sequence..

Strategy to replace spaces in string

I need to store a string replacing its spaces with some character. When I retrieve it back I need to replace the character with spaces again. I have thought of this strategy while storing I will replace (space with _a) and (_a with _aa) and while retrieving will replace (_a with space) and (_aa with _a). i.e even if the user enters _a in the string it will be handled. But I dont think this is a good strategy. Please let me know if anyone has a better one?

Replacing spaces with something is a problem when something is already in the string. Why don't you simply encode the string - there are many ways to do that, one is to convert all characters to hexadecimal.
For instance
Hello world!
is encoded as
48656c6c6f20776f726c6421
The space is 0x20. Then you simply decode back (hex to ascii) the string.
This way there are no space in the encoded string.
-- Edit - optimization --
You replace all % and all spaces in the string with %xx where xx is the hex code of the character.
For instance
Wine having 12% alcohol
becomes
Wine%20having%2012%25%20alcohol
%20 is space
%25 is the % character
This way, neither % nor (space) are a problem anymore - Decoding is easy.
Encoding algorithm
- replace all `%` with `%25`
- replace all ` ` with `%20`
Decoding algorithm
- replace all `%xx` with the character having `xx` as hex code
(You may even optimize more since you need to encode only two characters: use %1 for % and %2 for , but I recommend the %xx solution as it is more portable - and may be utilized later on if you need to code more characters)

I'm not sure your solution will work. When reading, how would you
distinguish between strings that were orginally " a" and strings that
were originally "_a": if I understand correctly, both will end up
"_aa".
In general, given a situation were a specific set of characters cannot
appear as such, but must be encoded, the solution is to choose one of
allowed characters as an "escape" character, remove it from the set of
allowed characters, and encode all of the forbidden characters
(including the escape character) as a two (or more) character sequence
starting with the escape character. In C++, for example, a new line is
not allowed in a string or character literal. The escape character is
\; because of that, it must be encoded as an escape sequence as well.
So we have "\n" for a new line (the choice of n is arbitrary), and
"\\" for a \. (The choice of \ for the second character is also
arbitrary, but it is fairly usual to use the escape character, escaped,
to represent itself.) In your case, if you want to use _ as the
escape character, and "_a" to represent a space, the logical choice
would be "__" to represent a _ (but I'd suggest something a little
more visually suggestive—maybe ^ as the escape, with "^_" for
a space and "^^" for a ^). When reading, anytime you see the escape
character, the following character must be mapped (and if it isn't one
of the predefined mappings, the input text is in error). This is simple
to implement, and very reliable; about the only disadvantage is that in
an extreme case, it can double the size of your string.

You want to implement this using C/C++? I think you should split your string into multiple part, separated by space.
If your string is like this : "a__b" (multiple space continuous), it will be splited into:
sub[0] = "a";
sub[1] = "";
sub[2] = "b";
Hope this will help!

With a normal string, using X characters, you cannot write or encode a string with x-1 using only 1 character/input character.
You can use a combination of 2 chars to replace a given character (this is exactly what you are trying in your example).
To do this, loop through your string to count the appearances of a space combined with its length, make a new character array and replace these spaces with "//" this is just an example though. The problem with this approach is that you cannot have "//" in your input string.
Another approach would be to use a rarely used char, for example "^" to replace the spaces.
The last approach, popular in a combination of these two approaches. It is used in unix, and php to have syntax character as a literal in a string. If you want to have a " " ", you simply write it as \" etc.

Why don't you use Replace function
String* stringWithoutSpace= stringWithSpace->Replace(S" ", S"replacementCharOrText");
So now stringWithoutSpace contains no spaces. When you want to put those spaces back,
String* stringWithSpacesBack= stringWithoutSpace ->Replace(S"replacementCharOrText", S" ");

I think just coding to ascii hexadecimal is a neat idea, but of course doubles the amount of storage needed.
If you want to do this using less memory, then you will need two-letter sequences, and have to be careful that you can go back easily.
You could e.g. replace blank by _a, but you also need to take care of your escape character _. To do this, replace every _ by __ (two underscores). You need to scan through the string once and do both replacements simultaneously.
This way, in the resulting text all original underscores will be doubled, and the only other occurence of an underscore will be in the combination _a. You can safely translate this back. Whenever you see an underscore, you need a lookahed of 1 and see what follows. If an a follows, then this was a blank before. If _ follows, then it was an underscore before.
Note that the point is to replace your escape character (_) in the original string, and not the character sequence to which you map the blank. Your idea with replacing _a breaks. as you do not know if _aa was originally _a or a (blank followed by a).

I'm guessing that there is more to this question than appears; for example, that you the strings you are storing must not only be free of spaces, but they must also look like words or some such. You should be clear about your requirements (and you might consider satisfying the curiosity of the spectators by explaining why you need to do such things.)
Edit: As JamesKanze points out in a comment, the following won't work in the case where you can have more than one consecutive space. But I'll leave it here anyway, for historical reference. (I modified it to compress consecutive spaces, so it at least produces unambiguous output.)
std::string out;
char prev = 0;
for (char ch : in) {
if (ch == ' ') {
if (prev != ' ') out.push_back('_');
} else {
if (prev == '_' && ch != '_') out.push_back('_');
out.push_back(ch);
}
prev = ch;
}
if (prev == '_') out.push_back('_');

how to use fout ()

Can some help me i have create this command
fout <<"osql -Ubatascan -Pdtsbsd12345 -dpos -i""c:\\temp_pd.sql"""<<endl;
Result Output
osql -Ubatascan -Pdtsbsd12345 -dpos -ic:\temp_pd.sql
Output that i want
osql -Ubatascan -Pdtsbsd12345 -dpos -i"c:\temp_pd.sql"
can some one help?

What you're doing is actually writing multiple string literals next to each other. The expression
"foo""bar"
gets parsed as the two string literals "foo" and "bar". The C and C++ languages say that when you have string literals next to each other, they get pasted together into one big string literal at compile time. So, the above expression is entirely equivalent to the single string literal "foobar".
Hence, your expression gets parsed as the following three string literals:
"osql -Udatascan -Pdtsbsd7188228 -dpos -i"
"c:\\temp_pd.sql"
""
Which when pasted together form the string "osql -Udatascan -Pdtsbsd7188228 -dpos -ic:\\temp_pd.sql" (note that the third string is the empty string""`).
What you want to do is to use the escape sequence \" to include a literal quotation mark within your string literal. Write it like this:
"osql -Udatascan -Pdtsbsd7188228 -dpos -i\"c:\\temp_pd.sql\""
Normally, the quotation mark " gets interpreted as the end of a string literal, except when it's preceded by a backslash, in which case it gets interpreted as a quotation mark character within the string.

Groovy Regex illegal Characters

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!

line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.

Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/

OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax

try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extrea backslash added to n when parsing a string from XML - c++

This issue occurs also in JavaScript, and the fix works well

Related

XSLT fn:tokenize ignore leading and trailing spaces

JSONCPP is adding extra double quotes to string

Strategy to replace spaces in string

how to use fout ()

Groovy Regex illegal Characters

Categories

Resources