Passing string Argument, read from a file - c++

I am Trying to find a regex pattern in a text. Let's call the text: the original Text.
The following is the code for the patternFinder() program:
vector <pair <long,long> >CaddressParser::patternFinder(string pattern)
{
string m_text1=m_text;
int begin =0;
int end=0;
smatch m;
regex e (pattern);
vector<pair<long, long>> indices;
if(std::regex_search(m_text1,m,e))
{
begin=m.position();
end=m.position()+m.length()-1;
m_text1 = m.suffix().str();
indices.push_back(make_pair(begin,end));
while(end<m_length&&std::regex_search(m_text1,m,e))
{
begin=end+m.prefix().length()+1;
end=end+m.prefix().length()+m.length();
indices.push_back(make_pair(begin,end));
m_text1 = m.suffix().str();
}
return indices;
}
else return indices;
}
I have the following regular Expression:
"\\b[0-9]{3}\\b.*(Street).*[0-9]{5}"
and the Original text mentioned at the beginning is:
way 10.01.2013 700 West Market Street OH 35611 asdh
and only the bold text is supposed to match the regex.
Now the Problem is when the regex is passed as a string which has been read from a text file the patternFinder() does not recognize the pattern.Though when a direct string (which is identical to the one in the text file) is passed as an argument to patternFinder() it works.
Where could this problem coming from?
The following is the code of my fileReader() function which I don't think is very relevant to mention:
string CaddressParser::fileReader(string fileName)
{
string text;
FILE *fin;
fin=fopen(fileName.c_str(),"rb" );
int length=getLength(fileName);
char *buffer= new char[length];
fread(buffer,length,1,fin);
buffer[length]='\0';
text =string(buffer);
fclose(fin);
return text;
}

Note that there is an apparent syntactic difference when writing the regex directly into C++ code and when reading it from a file.
In C++, the backslash character has escape semantics, so to put a literal backslash into a string literal, you must escape it itself with a backslash. So to get a a two-character string \b in memory, you have to use a string literal "\\b". The two backslashes are interpreted by the C++ compiler as a single backslash character to be stored in the literal. In other words, strlen("\\b") is 2.
On the other hand, contents of a text file are read by your program and never processed by the C++ compiler. So to get the two characters \ and b into a string read from a file, write just the two-character string \b into the file.

The problem is probably in the function reading the string from the file. Print the string read and make sure the regular expression is being read correctly.

The problem is in these 2 lines
buffer[length]='\0';
text =string(buffer);
buffer[length] should have been buffer[length - 1]

Related

How can I read CSV file in to vector in C++

I'm doing the project that convert the python code to C++, for better performance. That python project name is Adcvanced EAST, for now, I got the input data for nms function, in .csv file like this:
"[ 5.9358170e-04 5.2773970e-01 5.0061589e-01 -1.3098677e+00
-2.7747922e+00 1.5079222e+00 -3.4586751e+00]","[ 3.8175487e-05 6.3440394e-01 7.0218205e-01 -1.5393494e+00
-5.1545496e+00 4.2795391e+00 -3.4941311e+00]","[ 4.6003381e-05 5.9677261e-01 6.6983813e-01 -1.6515008e+00
-5.1606908e+00 5.2009044e+00 -3.0518508e+00]","[ 5.5172237e-05 5.8421570e-01 5.9929764e-01 -1.8425952e+00
-5.2444854e+00 4.5013981e+00 -2.7876694e+00]","[ 5.2929961e-05 5.4777789e-01 6.4851379e-01 -1.3151239e+00
-5.1559062e+00 5.2229333e+00 -2.4008298e+00]","[ 8.0250458e-05 6.1284608e-01 6.1014801e-01 -1.8556541e+00
-5.0002270e+00 5.2796564e+00 -2.2154367e+00]","[ 8.1256607e-05 6.1321974e-01 5.9887391e-01 -2.2241254e+00
-4.7920742e+00 5.4237065e+00 -2.2534993e+00]
one unit is 7 numbers, but a '\n' after first four numbers,
I wanna read this csv file into my C++ project,
so that I can do the math work in C++, make it more fast.
using namespace std;
void read_csv(const string &filename)
{
//File pointer
fstream fin;
//open an existing file
fin.open(filename, ios::in);
vector<vector<vector<double>>> predict;
string line;
while (getline(fin, line))
{
std::istringstream sin(line);
vector<double> preds;
double pred;
while (getline(sin, pred, ']'))
{
preds.push_back(preds);
}
}
}
For now...my code emmmmmm not working ofc,
I'm totally have no idea with this...
please help me with read the csv data into my code.
thanks
Unfortunately parsing strings (and consequently files) is very tedious in C++.
I highly recommend using a library, ideally a header-only one, like this one.
If you insist on writing it yourself, maybe you can draw some inspiration from this StackOverflow question on how to parse general CSV files in C++.
You could look at getdelim(',', fin, line),
But the other issue will be those quotes, unless you /know/ the file is always formatted exactly this way, it becomes difficult.
One hack I have used in the past that is NOT PERFECT, if the first character is a quote, then the last character before the comma must also be a matching quote, and not escaped.
If it is not a quote then getdelim() some more, but the auto-alloc feature of getdelim means you must use another buffer. In C++ I end up with a vector of all the pieces of getdelim results that then need to be concatenated to make the final string:
std::vector<char*> gotLine;
gotLine.push_back(malloc(2));
*gotLine.back() = fgetch();
gotLine.back()[1] = 0;
bool gotquote = *gotLine.back() == '"'; // perhaps different classes of quote
if (*gotLine.back() != ',')
for(;;)
{
char* gotSub= nullptr;
gotSub=getdelim(',');
gotLine.push_back(gotSub);
if (!gotquote) break;
auto subLen = strlen(gotSub);
if (subLen>1 && *(gotSub-1)=='"') // again different classes of quote
if (sublen==2 || *(gotSub-2)!='\\') // needs to be a while loop
break;
}
Then just concatenate all these string segments back together.
Note that getdelim supports null bytes. If you expect null bytes in the content, and not represented by the character sequences \000 or \# you need to store the actual length returned by getdelim, and use memcpy to concatenate them.
Oh, and if you allow utf-8 extended quotes it gets very messy!
The case this doesn't cover is a string that ends \\" or \\\\". Ideally you need to while count the number of leading backslashes, and accept the quote if the count is even.
Note that this leave the issue of unescaping the quoted content, i.e. converting any \" into ", and \\ into \, etc. Also discarding the enclosing quotes.
In the end a library may be easier if you need to deal with completely arbitrary content. But if the content is "known" you can live without.

c++ Function to add an extra '\' to a filepath?

I have about 3500 full file paths to sort through (ex. "C:\Users\Nick\Documents\ReadIns\NC_000852.gbk"). I just learned that c++ does not recognize the single backslash when reading in a file path. I have about 3500 file paths that I am reading in so it would be overly tedious to manually change each one.
I have this for loop that finds the single backslash and inserts a double backslash at that index. This:
string line = "C:\Users\Nick\Documents\ReadIns\NC_000852.gbk";
for (unsigned int i = 0; i < filepath.size(); i++) {
if(filepath[i] == '\') {
filepath.insert(i, '\');
}
}
However, c++, specifically on c::b, does not compile because of the backslash character. Is there a way to add in the extra backslash character with a function?
I am reading the filepaths in from a text file, so they are being read into the string filepath variable, this is just a test.
Use double backslash as '\\' and "C:\\Users...". Because single backslash with the next character makes an escape.
Also the string::insert() method's 2nd argument expects number of characters, which is missing in your code.
With all those fixes, it compiles fine:
string filepath = "C:\\Users\\Nick\\Documents\\ReadIns\\NC_000852.gbk";
// ^^ ^^ ^^ ^^ ^^
for (unsigned int i = 0; i < filepath.size(); i++) {
if(filepath[i] == '\\') {
// ^^
filepath.insert(i, 1, '\\');
} // ^^^^^^^
}
I am not sure, how above logic will work. But below is my preferred way:
for(auto pos = filepath.find('\\'); pos != string::npos; pos = filepath.find('\\', ++pos))
filepath.insert(++pos, 1, '\\');
If you had only single character to be replaced (e.g. linux system or probably supported in windows); then, you may also use std::replace() to avoid the looping as mentioned in this answer:
std::replace(filepath.begin(), filepath.end(), '\\', '/');
I assumed that, you already have a file created which contains single backslashes and you are using that for parsing.
But from your comments, I notice that apparently you are getting the file paths directly in runtime (i.e. while running the .exe). In that case, as #MSalters has mentioned, you need not worry about such transformations (i.e. changing the backslashes).
The problem that you're seeing is because in C++, string literals are commonly enclosed in "" quotes. This brings up one minor problem: how do you put a quote inside a string literal, when that quote would end the string literal. The solution is escaping it with a \. This can also be used to add a few other characters to a string, such as \n (newline). And since \ now has a special meaning in string literals, it's also used to escape itself. So "\\" is a string containing just one character (and of course a trailing NUL).
This also applies to character literals: char example[4] = {'a', '\\', 'b', 0} is an alternative way to write "a\\b".
Now this is all about compile time, when the compiler needs to separate C++ code and string contents. Once your executable is running, a backslash is just one char. std::cout << "a\\b" prints a single backslash, because there's only one in memory. std::String word; std::cin >> word will read a single word, and if you enter one backslash then word will contain one backslash. The compiler isn't involved in that.
So if you read 3500 filenames from a std::ifstream list_of_filenames and then use that to create a further 3500 std::ifstreams, you only need to worry about backslashes in specifying that very first filename in code. And if ou take that filename from argv[1] instead, you don't need to care at all.
One way to get rid of special handling of backslash is to keep all file names in a separate disk file as such and use file stream objects such as ifstream to get file names in C++ format.
TCHAR tcszFilename[MAX_PATH] = {0};
ifstream ObjInFiles( "E:\\filenames.txt" );
ObjInFiles.getline( tcszFilename, MAX_PATH );
ObjInFiles.close();
Suppose first file name stored in filenames.txt is "e:\temp\abc.txt" then after executing getline() above, the variable tcszFilename will hold "e:\\temp\\abc.txt".

Why can't regex find the "(" in a Japanese string in C++?

I have a huge file of Japanese example sentences. It's set up so that one line is the sentence, and then the next line is comprised of the words used in the sentence separated by {}, () and []. Basically, I want to read a line from the file, find only the words in the (), store them in a separate file, and then remove them from the string.
I'm trying to do this with regexp. Here is the text I'm working with:
は 二十歳(はたち){20歳} になる[01]{になりました}
And here's the code I'm using to find the stuff between ():
std::smatch m;
std::regex e ("\(([^)]+)\)"); // matches things between ( and )
if (std::regex_search (components,m,e)) {
printToTest(m[0].str(), "what we got"); //Prints to a test file "what we got: " << m[0].str()
components = m.prefix().str().append(m.suffix().str());
//commponents is a string
printToTest(components, "[COMP_AFTER_REMOVAL]");
//Prints to test file "[COMP_AFTER_REMOVAL]: " << components
}
Here's what should get printed:
what we got:はたち
[COMP_AFTER_REMOVAL]:は 二十歳(){20歳} になる[01]{になりました}
Here's what gets printed:
what we got:は 二十歳(はたち
[COMP_AFTER_REMOVAL]:){20歳} になる[01]{になりました}
It seems like somehow the は is being confused for a (, which makes the regexp go from は to ). I believe it's a problem with the way the line is being read in from the file. Maybe it's not being read in as utf8 somehow. Here's what I do:
xml_document finalDoc;
string sentence;
string components;
ifstream infile;
infile.open("examples.utf");
unsigned int line = 0;
string linePos;
bool eof = infile.eof();
while (!eof && line < 1){
getline(infile, sentence);
getline(infile, components);
MakeSentences(sentence, components, finalDoc);
line++;
}
Is something wrong? Any tips? Need more code? Please help. Thanks.
You forgot to escape your backslashes. The compiler sees "\(([^)]+)\)" and interprets it as (([^)]+)) which is not the regex you wanted.
You need to type "\\(([^)]+)\\)"

seekg() not working as expected

I have a small program, that is meant to copy a small phrase from a file, but it appears that I am either misinformed as to how seekg() works, or there is a problem in my code preventing the function from working as expected.
The text file contains:
//Intro
previouslyNoted=false
The code is meant to copy the word "false" into a string
std::fstream stats("text.txt", std::ios::out | std::ios::in);
//String that will hold the contents of the file
std::string statsStr = "";
//Integer to hold the index of the phrase we want to extract
int index = 0;
//COPY CONTENTS OF FILE TO STRING
while (!stats.eof())
{
static std::string tempString;
stats >> tempString;
statsStr += tempString + " ";
}
//FIND AND COPY PHRASE
index = statsStr.find("previouslyNoted="); //index is equal to 8
//Place the get pointer where "false" is expected to be
stats.seekg(index + strlen("previouslyNoted=")); //get pointer is placed at 24th index
//Copy phrase
stats >> previouslyNotedStr;
//Output phrase
std::cout << previouslyNotedStr << std::endl;
But for whatever reason, the program outputs:
=false
What I expected to happen:
I believe that I placed the get pointer at the 24th index of the file, which is where the phrase "false" begins. Then the program would've inputted from that index onward until a space character would have been met, or the end of the file would have been met.
What actually happened:
For whatever reason, the get pointer started an index before expected. And I'm not sure as to why. An explanation as to what is going wrong/what I'm doing wrong would be much appreciated.
Also, I do understand that I could simply make previouslyNotedStr a substring of statsStr, starting from where I wish, and I've already tried that with success. I'm really just experimenting here.
The VisualC++ tag means you are on windows. On Windows the end of line takes two characters (\r\n). When you read the file in a string at a time, this end-of-line sequence is treated as a delimiter and you replace it with a single space character.
Therefore after you read the file you statsStr does not match the contents of the file. Every where there is a new line in the file you have replaced two characters with one. Hence when you use seekg to position yourself in the file based on numbers you got from the statsStr string, you end up in the wrong place.
Even if you get the new line handling correct, you will still encounter problems if the file contains two or more consecutive white space characters, because these will be collapsed into a single space character by your read loop.
You are reading the file word by word. There are better methods:
while (getline(stats, statsSTr)
{
// An entire line is read into statsStr.
std::string::size_type posn = statsStr.find("previouslyNoted=");
// ...
}
By reading entire text lines into a string, there is no need to reposition the file.
Also, there is a white-space issue when reading by word. This will affect where you think the text is in the file. For example, white space is skipped, and there is no telling how many spaces, newlines or tabs were skipped.
By the way, don't even think about replacing the text in the same file. Replacement of text only works if the replacement text has the same length as the original text in the file. Write to a new file instead.
Edit 1:
A better method is to declare your key strings as array. This helps with positioning pointers within a string:
static const char key_text[] = "previouslyNoted=";
while (getline(stats, statsStr))
{
std::string::size_type key_position = statsStr.find(key_text);
std::string::size_type value_position = key_position + sizeof(key_text) - 1; // for the nul terminator.
// value_position points to the character after the '='.
// ...
}
You may want to save programming type by making your data file conform to an existing format, such as INI or XML, and using appropriate libraries to parse them.

How to read a word into a string ignoring a certain character

I am reading a text file which contains a word with a punctuation mark on it and I would like to read this word into a string without the punctuation marks.
For example, a word may be " Hello, "
I would like the string to get " Hello " (without the comma). How can I do that in C++ using ifstream libraries only.
Can I use the ignore function to ignore the last character?
Thank you in advance.
Try ifstream::get(Ch* p, streamsize n, Ch term).
An example:
char buffer[64];
std::cin.get(buffer, 64, ',');
// will read up to 64 characters until a ',' is found
// For the string "Hello," it would stream in "Hello"
If you need to be more robust than simply a comma, you'll need to post-process the string. The steps might be:
Read the stream into a string
Use string::find_first_of() to help "chunk" the words
Return the word as appropriate.
If I've misunderstood your question, please feel free to elaborate!
If you only want to ignore , then you can use getline.
const int MAX_LEN = 128;
ifstream file("data.txt");
char buffer[MAX_LEN];
while(file.getline(buffer,MAX_LEN,','))
{
cout<<buffer;
}
EDIT: This uses std::string and does away with MAX_LEN
ifstream file("data.txt");
string string_buffer;
while(getline(file,string_buffer,','))
{
cout<<string_buffer;
}
One way would be to use the Boost String Algorithms library. There are several "replace" functions that can be used to replace (or remove) specific characters or strings in strings.
You can also use the Boost Tokenizer library for splitting the string into words after you have removed the punctuation marks.