I have the following output created using a printf() statement:
printf("She said time flies like an arrow, but fruit flies like a banana.");
but I want to put the actual quotation in double-quotes, so the output is
She said "time flies like an arrow, but fruit flies like a banana".
without interfering with the double quotes used to wrap the string literal in the printf() statement.
How can I do this?
Escape the quotes with backslashes:
printf("She said \"time flies like an arrow, but fruit flies like a banana\".");
There are special escape characters that you can use in string literals, and these are denoted with a leading backslash.
Thankfully, with C++11 there is also the more pleasing approach of using raw string literals.
printf("She said \"time flies like an arrow, but fruit flies like a banana\".");
Becomes:
printf(R"(She said "time flies like an arrow, but fruit flies like a banana".)");
With respect to the addition of brackets after the opening quote, and before the closing quote, note that they can be almost any combination of up to 16 characters, helping avoid the situation where the combination is present in the string itself. Specifically:
any member of the basic source character set except: space, the left
parenthesis (, the right parenthesis ), the backslash , and the
control characters representing horizontal tab, vertical tab, form
feed, and newline" (N3936 §2.14.5 [lex.string] grammar) and "at most
16 characters" (§2.14.5/2)
How much clearer it makes this short strings might be debatable, but when used on longer formatted strings like HTML or JSON, it's unquestionably far clearer.
Related
I have the following output created using a printf() statement:
printf("She said time flies like an arrow, but fruit flies like a banana.");
but I want to put the actual quotation in double-quotes, so the output is
She said "time flies like an arrow, but fruit flies like a banana".
without interfering with the double quotes used to wrap the string literal in the printf() statement.
How can I do this?
Escape the quotes with backslashes:
printf("She said \"time flies like an arrow, but fruit flies like a banana\".");
There are special escape characters that you can use in string literals, and these are denoted with a leading backslash.
Thankfully, with C++11 there is also the more pleasing approach of using raw string literals.
printf("She said \"time flies like an arrow, but fruit flies like a banana\".");
Becomes:
printf(R"(She said "time flies like an arrow, but fruit flies like a banana".)");
With respect to the addition of brackets after the opening quote, and before the closing quote, note that they can be almost any combination of up to 16 characters, helping avoid the situation where the combination is present in the string itself. Specifically:
any member of the basic source character set except: space, the left
parenthesis (, the right parenthesis ), the backslash , and the
control characters representing horizontal tab, vertical tab, form
feed, and newline" (N3936 §2.14.5 [lex.string] grammar) and "at most
16 characters" (§2.14.5/2)
How much clearer it makes this short strings might be debatable, but when used on longer formatted strings like HTML or JSON, it's unquestionably far clearer.
I am trying to write a regex expression that can be used to identify long sentences in a document. I my case a scientific manuscript. I aim to be doing that either in libre office or any text editor with regex search.
So far I got the following expression to work on most occasions:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+){24,}?(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
btw, I got inspired from this post
It contains:
group1:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*,*\:*\s+)
a repetition element (stating how many words n - 1):
{24,}?
group2:
(\[*\(*[\w|\-|–|−|\/|≥|≤|’|“|”|μ]+\%*\)*\]*[\.|?|!|$])
The basic functioning is:
group1 matches any number of word characters OR other characters that are present in the text followed by one or more spaces
group1 has to be repeated 24 times (or as many as you want the sentences to be long)
group2 matches any number of word characters OR other characters that are present in the text followed by a full stop, exclamation mark, question mark or paragraph break.
Any string that fulfills all the above would then be highlighted.
What I can't solve so far is to make it work when a dot appears in the text with another meaning than a full stop. Things like: i.e., e.g., et al., Fig., 1.89, etc....
Also I don't like that I had to manually adjust it to be able to handle sentences that contain non-word characters such as , [ ( % - # µ " ' and so on. I would have to extend the expression every time I come across some other uncommon character.
I'd be happy for any help or suggestions of other ways to solve this.
You can do a lot with the swiss-army-knife that is regular expressions, but the problem you've presented approaches regex's limits. Some of the things you want to detect can probably be handled with really small changes, while others are a bit harder. If your goal is to have some kind of tool that accurately measures sentence length for every possible mutation of characters, you'll probably need to move outside LibreOffice to a dedicated custom piece of software or a third-party tool.
But, that said, there are a few tricks you can worm into your existing regex to make it work better, if you want to avoid programming or another tool. Let's look at a few of the techniques that might be useful to you:
You can probably tweak your regex for a few special cases, like Fig. and Mr., by including them directly. Where you currently have [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, which is basically [\w]+ with a bunch of other "special" characters, you could use something like ([\w|...]+|Mr\.|Mrs\.|Miss\.|Fig\.) (substituting in all the special characters where I wrote ..., of course). Regexes are "greedy" algorithms, and will try to consume as much of the text as possible, so by including special "dot words" directly, you can make the regex "skip over" certain period characters that are problematic in your text. Make sure that when you want to add a "period to skip" that you always precede it with a backslash, like in i\.e\., so that it doesn't get treated as the special "any" character.
A similar trick can capture numbers better by assuming that digits followed by a period followed by more digits are supposed to "eat" the period: ([\w|...]+|\d+\.\d+|...) That doesn't handle everything, and if your document authors are writing stuff like 0. in the middle of sentences then you have a tough problem, but it can at least handle pi and e properly.
Also, right now, your regex consumes characters until it reaches any terminating punctuation character — a ., or !, or ?, or the end of the document. That's a problem for things like i.e., and 3.14, since as far as your regex is concerned, the sentence stops at the .. You could require your regex to only stop the sentence once ._ is reached — a period followed by a space. That wouldn't fix mismatches for words like Mr., but it would treat "words" like 3.14 as a word instead of as the end of a sentence, which is closer than you currently are. To do this, you'll have to include an odd sequence as part of the "word" regex, something like (\.[^ ]), which says "dot followed by not-a-space" is part of the word; and then you'll have to change the terminating sequence to (\. |!|?|$). Repeat the changes similarly for ! and ?.
Another useful trick is to take advantage of character-code ranges, instead of encoding each special character directly. Right now, you're doing it the hard way, by spelling out every accented character and digraph and diacritic in the universe. Instead, you could just say that everything that's a "special character" is considered to be part of the "word": Instead of [\w|\-|–|−|\/|≥|≤|’|“|”|μ]+, write [\w|\-|\/|\u0080-\uFFFF], which captures every character except emoji and a few from really obscure dead languages. LibreOffice seems to have Unicode support, so using \uXXXX patterns should work inside [ character ranges ].
This is probably enough to make your regex somewhat acceptable in LibreOffice, and might even be enough to answer your question. But if you're really intent on doing more complex document analysis like this, you may be better off exporting the document as plain text and then running a specialized tool on it.
I making use of this software, dk-brics-automaton to get number of states
of regular expressions. Now ,for example I have this type of RE:
^SEARCH\s+[^\n]{10}
When I insert it below as a string, the compiler say that invalid escape sequence
RegExp r = new RegExp("^SEARCH\s+[^\n]{10}", ALL);
where ALL is a certain FLAG
when I use double back slashes before small s, then the compiler accepts it
as a string where as over here \s means space but I am confused when I will make use of
double back slashes then it will consider just back slash and "s" where as I meant white space.
Now, I have thousands of such regular expressions for which I want to compute finite automaton
states.So, does that mean that I have to add manually back slashes in all the RE?
Here is a link where they have explained something related to this but I am not getting it:
http://www.brics.dk/automaton/doc/index.html
Please help me if anyone has some past experience in this software or if you have any idea to solve this issue.
I had another look at that documentation. "automaton" is a java package, therefor I think you have to treat them like java regexes. So just double every backslash inside a regex.
The thing here is, Java does not know "raw" strings. So you have to escape for two levels. The first level that evaluates escape sequences is the string level.
The string does not know an escape sequence \s, that is the error. \n is fine, the string evaluates it and stores instead the two characters \ (0x5C) and n (0x6E) the character 0x0A.
Then the string is stored and handed over to the regex constructor. Here happens the next round of escape sequence evaluation.
So if you want to escape for the regex level, then you have to double the backslashes. The string level will evaluate the \\ to \ and so the regex level gets the correct escape sequences.
I need to be able to handle data that can look like:
set setting1 "bind button_x +actionslot1;bind button_y \" bind button_x +stance \" "
bind button_a jump
set setting2 1 1 0 1
toggle setting_3 " \"value 1\" \"value 2\" \"value 3\" "
These are what some of the commands for the console of a game look like, and I'm trying to write an emulator of sorts that will interpret the code the same way the game will.
The first thing that comes to mind is regex, but I'm not sure it's the best option. For example, when matching for the value of a setting, I might trying something like /set [\w_]+ "?(.+)"?/, but the wildcard matches the ending quote because it's not lazy, but if I make it lazy, it matches the quote inside the value. If I make it greedy and stop it from matching the quotes, it won't match the escaped quotes in the values.
Even if there are possible regex solutions, they seem like the wrong option. I had asked before about how programs like Visual Studio and Notepad++ know which parentheses and curly braces matched, and I was told there was something similar to regex in some ways but much more powerful.
The only other thing I can think of is to go through the lines of code character by character and use booleans to determine that state of the current character.
What are my options here? What do game developers use to handle console commands?
edit: Here's another possible command which strongly deters me from using regex:
set setting4 "bind button_a \" bind button_b "\" set setting1 0 \" " \" "
The commands include not just escaped quotes, but quotes of the manner "\" inside escaped quotes.
I would suggest you read about Lexical Analysis
, this is the process of tokenizing your text using a grammar.
I think it will help you with what you are trying to do.
I don't want to keep you on the path of regex -- you are correct that there are non-regex solutions that may be more appropriate (I just don't know what they are). However, here is one possible regex that should fix your quotes issue:
/set [\w_]+ "?((\\"|[^"])+)"?/
I changed .+ to (\\"|[^"])+. Basically it's matching occurrences of \" OR of anything that isn't a quote. In other words, it will will match anything except quotes that aren't escaped.
Again, if someone can suggest a more sophisticated non-regex solution, you should strongly consider it.
Edit: The updated example you've provided breaks this solution, and I think it would break any regex solution.
Edit 2: Here is a C# string version of your regex. It uses # to tell the compiler to treat the string as a verbatim literal, which means it ignores \ as an escape character. The only caveat is that in order to represent " in a verbatim literal you have to type it as "", but it's still better than having slashes everywhere. Given the prevalence of escape sequences in regexes, I recommend using verbatim literals anywhere that you have to type a regex in a string.
string pattern = #"set [\w_]+ ""?((\\""|[^""])+)""?"
I want to be able to take a string of text from the user that should be formated like this:
.ext1 .ext2 .ext3 ...
Basically, I am looking for a dot, a string of alphanumeric characters of any length a space, and rinse and repeat. I am a little confused on how to say " i need a period, string of characters and a space". But also, the last extension could either be followed by nothing, or a space, or a series of spaces. Also, I guess in between extensions could be followed by any number of spaces?
EDIT: I made it clearer what I was looking for.
Thanks!
Try this:
^(?:\.[A-Za-z0-9]+ +)*\.[A-Za-z0-9]+ *$
(Rubular)
In a Java string literal you need to escape the backslashes:
"^(?:\\.[A-Za-z0-9]+ +)*\\.[A-Za-z0-9]+ *$"
(\.\w+)\s* Match this and get your results.
^((\.\w+)\s*)*$ Check this and if it's true, your String is exactly what you want.
For the last pattern thing, you can't (AFAIK) do both getting all extensions (separated) and checking that the last is followed by other things. Either you check your string, or you extract the extensions from it.
I'd start with something like: ^.[a-z0-9]+([\t\n\v ]+.[a-z0-9]+)*$