Why can't regex find the "(" in a Japanese string in C++? - c++

I have a huge file of Japanese example sentences. It's set up so that one line is the sentence, and then the next line is comprised of the words used in the sentence separated by {}, () and []. Basically, I want to read a line from the file, find only the words in the (), store them in a separate file, and then remove them from the string.
I'm trying to do this with regexp. Here is the text I'm working with:
は 二十歳(はたち){20歳} になる[01]{になりました}
And here's the code I'm using to find the stuff between ():
std::smatch m;
std::regex e ("\(([^)]+)\)"); // matches things between ( and )
if (std::regex_search (components,m,e)) {
printToTest(m[0].str(), "what we got"); //Prints to a test file "what we got: " << m[0].str()
components = m.prefix().str().append(m.suffix().str());
//commponents is a string
printToTest(components, "[COMP_AFTER_REMOVAL]");
//Prints to test file "[COMP_AFTER_REMOVAL]: " << components
}
Here's what should get printed:
what we got:はたち
[COMP_AFTER_REMOVAL]:は 二十歳(){20歳} になる[01]{になりました}
Here's what gets printed:
what we got:は 二十歳(はたち
[COMP_AFTER_REMOVAL]:){20歳} になる[01]{になりました}
It seems like somehow the は is being confused for a (, which makes the regexp go from は to ). I believe it's a problem with the way the line is being read in from the file. Maybe it's not being read in as utf8 somehow. Here's what I do:
xml_document finalDoc;
string sentence;
string components;
ifstream infile;
infile.open("examples.utf");
unsigned int line = 0;
string linePos;
bool eof = infile.eof();
while (!eof && line < 1){
getline(infile, sentence);
getline(infile, components);
MakeSentences(sentence, components, finalDoc);
line++;
}
Is something wrong? Any tips? Need more code? Please help. Thanks.

You forgot to escape your backslashes. The compiler sees "\(([^)]+)\)" and interprets it as (([^)]+)) which is not the regex you wanted.
You need to type "\\(([^)]+)\\)"

Related

Deleting from a certain point in a file to the end of the line?

I'm having some trouble with detecting two '//' as a char and then deleting from the first '/' till the end of the line (im guessing /n comes into use here).
{
ifstream infile;
char comment = '//';
infile.open("test3.cpp");
if (!infile)
{
cout << "Can't open input file\n";
exit(1);
}
char line;
while (!infile.eof())
{
infile.get(line);
if (line == comment)
{
cout << "found it" << endl;
}
}
return 0;
}
In the test3.cpp file there are three comments, so 3 lots of '//'. But I can't detect the double slash and can only detect a single / which will affect other parts of the c++ file as I only want to delete from the beginning of a comment to the end of the line?
I'm having some trouble with detecting two '//' as a char
That's because // is not a character. It is a sequence of two characters. A sequence of characters is known as a string. You can make string literals with double quotation marks: "//".
A simple solution is to compare the current input character from the stream to the first character of the string "//" which is '/'. If it matches, then compare the next character from the stream with the second character in the string that is searched for. If you find two '/' in a row, you have your match. Or you could be smart and read the entire line into a std::string and use the member functions to find it.
Also:
while (!infile.eof())
{
infile.get(line);
// using line without testing eof- and badbit
This piece of code is wrong. You test for eofbit before reading the stream and process the input.
And your choice of name for the line variable is a bit confusing since it doesn't contain the entire. line but just one character.

Convert text to csv file in C++?

I got a text file that contain lots of line like the following:
data[0]: a=123 b=234 c=3456 d=4567 e=123.45 f=234.56
I am trying to extract the number out in order to convert it to a csv file in order to let excel import and recognize it.
My logic is, find the " " character, then chop the data out. For example, chop between first
" " and second " ". Is it viable? I have been trying on this but I did not succeed.
Actually I want to create a csv file like
a, b, c, d, e, f
123, 234, 3456 .... blablabla
234, 345, 4567 .... blablabla
But it seems it is quite difficult to do this specific task.
Are there any utilities/better method that could help me to do this?
I suggest you take a look at boost::tokenizer, this is the best approach I have found. You will find several example on the web. Have also a look at this high-score question.
Steps: for each line:
Cut string in two parts using the : character
Cut the right part into several strings using space character
separate the values using the = character, and stuff these into a std::vector<std::string>
Put these values in a file.
Last part can be something like:
std::ofstream f( "myfile.csv" );
for( const auto& s: vstrings )
f << s << ',';
f << "\n";
A easy way with no non-Standard libraries is:
std::string line;
while (getline(input_stream, line))
{
std::istringstream iss(line);
std::string word;
if (is >> word) // throw away "data[n]:"
{
std::string identifier;
std::string value;
while (getline(iss, identifier, '=') && is >> value)
std::cout << value << ",";
std::cout << '\n';
}
}
You can tweak it if training commas are causing excel any trouble, add more sanity checks (e.g. that value is numeric, that fields are consistent across all lines), but the basic parsing above is a start.

Read File line by line using C++

I am trying to read a file line by line using the code below :
void main()
{
cout << "b";
getGrades("C:\Users\TOUCHMATE\Documents\VS projects\GradeSystem\input.txt");
}
void getGrades(string file){
string buf;
string line;
ifstream in(file);
if (in.fail())
{
cout << "Input file error !!!\n";
return;
}
while(getline(in, line))
{
cout << "read : " << buf << "\n";
}
}
For some reason it keeps returning "input file error!!!". I have tried to full path and relative path (by just using the name of the file as its located in the same folder as the project). what am I doing wrong?
You did not escape the string. Try to change with:
getGrades("C:\\Users\\TOUCHMATE\\Documents\\VS projects\\GradeSystem\\input.txt");
otherwise all the \something are misinterpreted.
As Felice said the '\' is an escape. Thus you need two.
Or you can use the '/' character.
As windows has accepted this as a directory separator for a decade or more now.
getGrades("C:/Users/TOUCHMATE/Documents/VS projects/GradeSystem/input.txt");
This has the advantage that it looks much neater.
first, if you wanna say '\' in a string, you should put '\\', that's the path issue.
then, the string buf is not in connect to your file..
The backslash in C strings is used for escape sequences (e.g. \n is newline, \r carriage return, \t is a tabulation, ...), thus your string is getting garbled because for each backslash+character sequence the compiler is replacing the corresponding escape sequence. To enter backslashes in a C string you have to escape them, using \\:
getGrades("C:\\Users\\TOUCHMATE\\Documents\\VS projects\\GradeSystem\\input.txt");
By the way, it's int main, not void main, and you should return an exit code (usually 0 if everything went fine).

How to compare a string with certain words and if a match is found print the whole string

I am trying to write a little program that will load in a file, compare each line with a specific array of words, and if that line has any of those words in it then I want to "print" that line out to a file.
My current code is:
int main()
{
string wordsToFind[13] =
{"MS SQL", "MySQL", "Virus", "spoof", "VNC", "Terminal", "imesh", "squid",
"SSH", "tivo", "udp idk", "Web access request dropped", "bounce"};
string firewallLogString = "";
ifstream firewallLog("C:\\firewalllogreview\\logfile.txt");
ofstream condensedFirewallLog("C:\\firewalllogreview\\firewallLog.txt");
if(firewallLog.fail())
{
cout << "The file does not exist. Please put the file at C:\\firewalllogreview and run this program again." << endl;
system("PAUSE");
return 0;
}
while(!firewallLog.eof())
{
getline(firewallLog, firewallLogString);
for(int i = 0; i < 13; i++)
{
if(firewallLogString == wordsToFind[i])
{
firewallLogString = firewallLogString + '\n';
condensedFirewallLog << firewallLogString;
cout << firewallLogString;
}
}
}
condensedFirewallLog.close();
firewallLog.close();
}
When I run the program it will compare the string, and if it matches it will only print out the specific word instead of the string. Any help would be much appreciated.
If I understand your problem correctly, you want to check if the line contains one of the word and print it if it does.
Right now what you are doing is this:
if(firewallLogString == wordsToFind[i])
Which checks if the string exactly matches the word. So, if the string contains one of the word but has other words in it, the test will fail.
Instead, check if the word is part of the string, like this:
if(firewallLogString.find(wordsToFind[i]) != string::npos)
There is something wrong in your code.
in this line
getline(firewallLog, firewallLogString);
you are reading a line, not a word, but then later you are comparing the whole line with a word from your array. Your IF shall not work actually.
Instead you need to use strstr method, to lookup for any word in your firewallLogString and if it finds you do the rest of your code.
Use std::string's find method to find the occurrence of your pattern words.

How to read a word into a string ignoring a certain character

I am reading a text file which contains a word with a punctuation mark on it and I would like to read this word into a string without the punctuation marks.
For example, a word may be " Hello, "
I would like the string to get " Hello " (without the comma). How can I do that in C++ using ifstream libraries only.
Can I use the ignore function to ignore the last character?
Thank you in advance.
Try ifstream::get(Ch* p, streamsize n, Ch term).
An example:
char buffer[64];
std::cin.get(buffer, 64, ',');
// will read up to 64 characters until a ',' is found
// For the string "Hello," it would stream in "Hello"
If you need to be more robust than simply a comma, you'll need to post-process the string. The steps might be:
Read the stream into a string
Use string::find_first_of() to help "chunk" the words
Return the word as appropriate.
If I've misunderstood your question, please feel free to elaborate!
If you only want to ignore , then you can use getline.
const int MAX_LEN = 128;
ifstream file("data.txt");
char buffer[MAX_LEN];
while(file.getline(buffer,MAX_LEN,','))
{
cout<<buffer;
}
EDIT: This uses std::string and does away with MAX_LEN
ifstream file("data.txt");
string string_buffer;
while(getline(file,string_buffer,','))
{
cout<<string_buffer;
}
One way would be to use the Boost String Algorithms library. There are several "replace" functions that can be used to replace (or remove) specific characters or strings in strings.
You can also use the Boost Tokenizer library for splitting the string into words after you have removed the punctuation marks.