I am working on a lexer analyzer using C++Builder XE6 and this is what I've done so far: I have two memos (memoIN, memoOUT). memoIN contains the text to be analyzed and memoOUT the output (list of tokens).
First, I strip the memoIN content from all comments using boost::regex, and this works like a charm. Now I'm stuck on how to extract all double quotes from the text and display them as a string in the output memo.
All iIhave so far is an expression that removes all double quotes but not what i need, i need to extract theme and display theme for example:
memoIN :
This is a "Double" Quote and this is "another one"
memoOUT :
<(String "Double") #Line 01 #Length 06)>
<(String "another one") #Line 01 #Length 11)>
Using Boost.Regex
Here's some sample code that demonstrates using boost::regex to extract text within quotes.
#include <string>
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
using namespace boost;
int main(int argc, char **argv) {
// Capture any non-quotes that occur within double quotes.
boost::regex re("\"([^\"]+)\"");
// Input text
std::string memoIN = "This is a \"Double\" Quote and this is \"another one\"";
// Iterate through memoIN
boost::sregex_iterator m1(memoIN.begin(), memoIN.end(), re);
// Ending iterator (using the default constructor)
boost::sregex_iterator m2;
for (; m1 != m2; ++m1) {
// Replace this with code to organize memoOUT
std::cout << (*m1)[1].str() << std::endl;
}
return 0;
}
Using a lexer library
Depending on how sophisticated your needs are, you may find that you're better in the long run using a dedicated lexer and parser generator (like ANTLR3 C) than writing your own with Boost.Regex.
Interfacing with UnicodeString
There are several approaches to handling mismatches between C++Builder's AnsiString and UnicodeString and Standard C++'s std::string and std::wstring. One simple approach is to convert UnicodeString to std::string for internal text manipulation then convert it back to UnicodeString for the UI. For example:
// Use AnsiString to convert from UTF-16 to a narrow character encoding
std::string memoIN_text = AnsiString(MemoIN->Text).c_str();
std::string memoOUT_text;
// Insert Boost.Regex manipulation here and assign the results to memoOUT_text
// Use implicit conversion from const char* to AnsiString/UnicodeString
MemoOUT->Text = memoOUT_text.c_str();
Converting from Unicode to ANSI may lose data, so you may want to use SetMultiByteConversionCodePage to tell C++Builder to use UTF-8 for AnsiString. (Character encoding is complicated enough to be its own topic.)
Related
I want to print the first letter of a string.
#include <iostream>
#include <string>
using namespace std;
int main() {
string str = "다람쥐 헌 쳇바퀴 돌고파.";
cout << str.at(0) << endl;
}
I want '다' to be printed like java, but '?' is printed.
How can I fix it?
That text you have in str -- how is it encoded?
Unfortunately, you need to know that to get the first "character". The std::string class only deals with bytes. How bytes turn into characters is a rather large topic.
The magic word you are probably looking for is UTF-8. See here for more infomation: How do I properly use std::string on UTF-8 in C++?
If you want to go down this road yourself, look here: Extract (first) UTF-8 character from a std::string
And if you're really interested, here's an hour-long video that is actually a great explanation of text encoding: https://www.youtube.com/watch?v=_mZBa3sqTrI
I tried a very simple code in C++:
#include <iostream>
#include <string>
int main()
{
std::wstring test = L"asdfa-";
test += u'ç';
std::wcout << test;
}
But the result was:
asdfa-?
It was not possible print 'ç', with cout or wcout, how can I can print this string correctally?
OS: Linux.
Ps: I use wstring instead of string, because sometimes I need calculate the length of the string, and this size must be the same of what is on the screen.
Ps: I need concatenate the unicode char, it can't be on the string constructor.
First, here's something that does work:
#include <iostream>
#include <string>
int main() {
std::string test = "asdfa-";
test += "ç";
std::cout << test;
}
I used just regular strings here and let C++ keep everything in UTF-8. I think you already know that this would work because you mentioned that you wanted to concatenate the ç rather than just leaving it in the string constructor.
Dealing with char, char16_t, char32_t, and wchar_t in C++ has never really been fun. You have to be careful with the L, u, and U prefixes.
However, where possible, if you deal with utf-8 strings, and avoid characters, you can generally get things to work much better. And since most consoles (with the possible exception of old Windows machines) understand utf-8 pretty well, this is the approach that often just works the best. So if you have wide characters, see if you can convert them to regular std::string objects and work in that domain.
One general way of handling this would be:
Input (convert from multibyte to wide using current locale)
Your App: work with wide strings
Output or saving to a file (convert from wide to multibyte)
For wide string manipulations like num of characters, substring etc. there is wcsXXX class of functions.
If you are using libstdc++ on Linux: you forgot an essential call at the beginning of the program
std::locale::global(std::locale(""));
This is assuming you are on Linux and your locale supports UTF-8.
If you are using libc++: forget about using wstreams. This library does not support I/O of wide characters in a useful way (i.e. translation to UTF-8 like libstdc++ does).
Windows has a wholly separate set of quirks regarding Unicode. You are lucky if you don't have to deal with them.
demo with gcc/libstdc++ and a call to std::locale
demo with gcc/libstdc++ and no call to std::locale
Different versions of clang/libc++ behave differently with this example: some output ? instead of the non-ascii char, some output nothing; some crash on call to std::locale, some don't. None do the right thing, which is printing the ç, or maybe I just haven't found one that works. I don't recommend using libc++ if you need anything related to locale or wchar_t.
I solved this problem using a conversion function:
#include <iostream>
#include <string>
#include <codecvt>
#include <locale>
std::string wstr2str(const std::wstring& wstr) {
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wstr);
}
int main()
{
std::wstring test = L"asdfa-";
test += L'ç';
std::string str = wstr2str(test)
std::cout << str;
}
I have to use unicode range in a regex in C++. Basically what I need is to have a regex to accept all valid unicode characters..I just tried with the test expression and facing some issues with it.
std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");
Is the issue is with \\u?
This should work fine but you need to use std::wregex and std::wsmatch. You will need to convert the source string and regular expression to wide character unicode (UTF-32 on Linux, UTF-16(ish) on Windows) to make it work.
This works for me where source text is UTF-8:
inline std::wstring from_utf8(const std::string& utf8)
{
// code to convert from utf8 to utf32/utf16
}
inline std::string to_utf8(const std::wstring& ws)
{
// code to convert from utf32/utf16 to utf8
}
int main()
{
std::string test = "john.doe#神谕.com"; // utf8
std::string expr = "[\\u0080-\\uDB7F]+"; // utf8
std::wstring wtest = from_utf8(test);
std::wstring wexpr = from_utf8(expr);
std::wregex we(wexpr);
std::wsmatch wm;
if(std::regex_search(wtest, wm, we))
{
std::cout << to_utf8(wm.str(0)) << '\n';
}
}
Output:
神谕
Note: If you need a UTF conversion library I used THIS ONE in the example above.
Edit: Or, you could use the functions given in this answer:
Any good solutions for C++ string code point and code unit?
Rule I must abide by
Do not use loops or character arrays to process strings for any of the questions below. Use member functions of the string class. You can use a loop to read the file and to count the number of processors.
Some Tips
Here are some functions that you might find useful:
File class: getline
String class: find, rfind, substr, length, c_str, constant npos
Misc. functions: atoi, atof
(may require the C standard library for C++, i.e., )
isstringstream
(Both of the above are ways to convert a string to a number.)
Here is an example string I would need to extract:
"46 bits physical, 48 bits virtual"
I can go through the same string twice. I'd want to grab 46 and store it and then do the same for 48.
I'm not sure the best way to go about this. Is it possible to do something like this:
string.find_first_of(integer);
string.find_last_not_of(integer);
Or possibly regex? I think I can use that as long as I don't need to use a 3rd party library or anything like that.
The following ended up working for me.
#include <sstream>
string myString = "hello 47";
int val;
istringstream iss (myString);
iss >> val;
cout << val << endl;
// The output of val will be 47.
Since you indicated in the comments that STL is allowed, you can use a generic programming approach relying on STL algorithms. For example,
#include <iostream>
#include <algorithm>
#include <iterator>
#include <string>
int main()
{
using namespace std;
string haystack = "46 bits physical, 48 bits virtual";
string result;
remove_copy_if(begin(haystack), end(haystack),
back_inserter(result),
[](char c) { return !isspace(c) && !isdigit(c); } );
cout << result;
}
You basically treat the characters in the string as a stream of inputs, from that just filter out all non-digit characters and keeping whatever delimiter char you want to use. My example keeps whitespace as delimiter.
The above gives the output
46 48
I am trying to print Unicode characters in C++. My Unicode characters are Old Turkic, I have the font. When I use a letter's code it gives me another characters. For example:
#include <iostream>
#include <string>
using namespace std;
int main()
{
string str = "\u10C00" // My character's unicode code.
cout << str << endl;
return 0;
}
This snipped gives an output of another letter with a 0 just after its end.
For example, it gives me this (lets assume that I want to print 'Ö' letter):
A0
But when I copied and pasted my actual letter to my source snippet, from character-map application in ubuntu, it gives me what I want. What is the problem here? I mean, I want use the character code way "\u10C00", but it doesn't work properly. I think this string is too long, so it uses the first 6 characters and pops out the 0 at the end. How can I fix this?
After escape /u must be exactly 4 hexadecimal characters. If you need more, you should use /U. The second variant takes 8 characters.
Example:
"\u00D6" // 'Ö' letter
"\u10C00" // incorrect escape code!
"\U00010C00" // your character
std::string does not really support unicode, use std::wstring instead.
but even std::wstring could have problems since it does not support all sizes.
an alternative would be to use some external string class such as Glib::ustring if you use gtkmm or QString in case of Qt.
Almost each GUI toolkit and other libraries provide it's own string class to handle unicode.