C++ Finding an incomplete string from a text file

C++ Finding an incomplete string from a text file - c++

I have a program which reads text files and parses information from them, and I am trying to accomplish a task like this:
A text file which has about 500 characters of data in it, in this data lies user name like so:
this_just_some_random_data_in_the_file_hdfhehhr2342t543t3y3y
_please_don't_mind_about_me(username: "sara123452")reldgfhfh
2134242gt3gfd2342353ggf43t436tygrghrhtyj7i6789679jhkjhkuklll
The thing is that we only need to find and write sara123452 to a string from that text file. The user name is unknown of course, and does not have fixed length.
Here is what I have managed to do so far:
std::string Profile = "http://something.com/all_users/data.txt";
std::string FileName = "profileInfo.txt";
std::string Buffer, ProfileName;
std::ifstream FileReader;
DeleteUrlCacheEntryA(Profile .c_str());
URLDownloadToFileA(0, Profile .c_str(), FileName.c_str(), 0, 0);
FileReader.open(FileName);
if (FileReader.is_open())
{
std::ostringstream FileBuffer;
FileBuffer << FileReader.rdbuf();
Buffer= FileBuffer.str();
if (Buffer.find("(username: ") != std::string::npos) {
cout << "dont know how to continue" << endl;
}
FileReader.close();
DeleteFileA(FileName.c_str());
}
else {
}
cin.get();
So how can I get the user name string and assign/copy it to ProfileName string?

I believe what you're looking for is something like the code below -- possibly with minor tweaks to account for the username being quoted. The key here is to remember that your Buffer variable is a std::string and you can use substring once you have a definite start and end.
std::size_t userNameStartIndex, userNameEndIndex
...
userNameStartIndex = Buffer.find("(username: ")) + 11;
if (userNameStartIndex != std::string::npos) {
userNameEndIndex = Buffer.find(")", userNameStartIndex);
if (userNameEndIndex != std::string::npos)
{
ProfileName = Buffer.substr(userNameStartIndex, userNameEndIndex - userNameStartIndex)
}
}

There are many other ways to do it, but this one would be less painful I guess.
#include <regex>
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
struct Profile
{ // ...
string name;
};
int main(int argc, const char * argv[])
{
std::cout.sync_with_stdio(false); // optional
// read from file
string filename {"data1.txt"};
ifstream in {filename};
vector<Profile> profiles;
// tweaks this pattern in case you're not happy with it
regex pat {R"(\(username:\s*\"(.*?)\"\))"};
for (string line; getline(in,line); ) {
Profile p;
sregex_iterator first(cbegin(line), cend(line), pat);
const sregex_iterator last;
while (first != last) {
// By dereferencing a first, you get a smatch object.
// [1] gives you the matched sub-string:
p.name = (*first)[1]; // (*first)[1] = sara123452
profiles.push_back(p);
++first;
}
}
// Test
for (const auto& p : profiles)
cout << p.name << '\n';
}

Related

How can i split adjacent numbers and letters in c++?

I've got a large text document that including adjacent numbers and letters.
Just like that,
JACK1940383DAVID30284HAROLD68372TROY4392 etc.
How can i split this like below in C++
List: Jack / 1940383 , David/30284, ...

You can use std::string::find_first_of() and std::string::find_first_not_of() in a loop, using std::string::substr() to extract each piece, eg:
std::string s = "JACK1940383DAVID30284HAROLD68372TROY4392";
std::string::size_type start = 0, end;
while ((end = s.find_first_of("0123456789", start)) != std::string::npos) {
std::string name = s.substr(start, end-start);
start = end;
int number;
if ((end = s.find_first_not_of("0123456789", start)) != std::string::npos) {
number = std::stoi(s.substr(start, end-start));
}
else {
number = std::stoi(s.substr(start));
}
start = end;
// use name and number as needed...
}
Online Demo

You can use regex like this:
#include <iostream>
#include <string>
#include <regex>
#include <vector>
// create a struct to group your data
// this makes it easy to store it in a vector.
struct person_t
{
std::string name;
std::string number;
};
// overloaded output operator for printing one person's details
std::ostream& operator<<(std::ostream& os, const person_t& person)
{
std::cout << person.name << ": " << person.number << std::endl;
return os;
}
// get a vector of person_t based on the input
auto get_persons(const std::string& input)
{
// make a regex in this case a regex that will match one or more capital letters
// and groups them using the ()
// then match one or more digits and group them too.
static const std::regex rx{ "([A-Z]+)([0-9]+)" };
std::smatch match;
// a vector to hold all the persons
std::vector<person_t> persons;
// start at begin of string and look for first part of the string
// that matches the regex.
auto cbegin = input.cbegin();
while (std::regex_search(cbegin, input.cend(), match, rx))
{
// match[0] will contain the whole match,
// match[1]-match[n] will contain the groups from the regular expressions
// match[1] will contain the match with characters and thus the name
// match[2] will contain the match with the numbers and thus the number.
// create a person_t struct with this info
person_t person{ match[1], match[2] };
// and add it to the vector
persons.push_back(person);
cbegin = match.suffix().first;
}
return persons;
}
int main()
{
// parse and split the string
auto persons = get_persons("JACK1940383DAVID30284HAROLD68372TROY4392");
// show the output
for (const auto& person : persons)
{
std::cout << person;
}
}

As pointed in other good answers you can use
find_first_of(), find_first_not_of() and substr() from std::string in a loop
regex
But it may be too much. I will add 3 more examples that you may find
simpler.
The first 2 programs expects the file name on the command line for (my) convenience here, and the test file is in.txt. Contents are the same as posted
JACK1940383DAVID30284HAROLD68372TROY4392
The last example just parses the string data declared as a char[]
1. Using fscanf()
Since the target is to consume formatted data, fscanf() is an option. As the data structure is very simple, the program is just a one line loop:
char mask[] = "%50[^0-9]%50[0-9]";
while ( 2 == fscanf(F, mask, tk_key, tk_value))
std::cout << tk_key << "/" << tk_value << "\n";
program output
output is the same for all examples
JACK/1940383
DAVID/30284
HAROLD/68372
TROY/4392
code for ex. 1
#include <errno.h>
#include <iostream>
int main(int argc,char** argv)
{
if (argc < 2)
{ std::cerr << "Use: pgm FileName\n";
return -1;
}
FILE* F = fopen(argv[1], "r");
if (F == NULL)
{
perror("Could not open file");
return -1;
}
std::cerr << "File: \"" << argv[1] << "\"\n";
char tk_key[50], tk_value[50];
char mask[] = "%50[^0-9]%50[0-9]";
while ( 2 == fscanf(F, mask, tk_key, tk_value))
std::cout << tk_key << "/" << tk_value << "\n";
fclose(F);
return 0;
}
using a state machine
There are just 2 states so it is not a fancy FSA ;) State machines are good for representing this kind of stuff, albeit here this seems to be overkill.
#define S_LETTER 0
#define S_DIGIT 1
#include <algorithm>
#include <iostream>
#include <fstream>
using iich = std::istream_iterator<char>;
int main(int argc,char** argv)
{
std::ifstream in_file{argv[1]};
if ( not in_file.good()) return -1;
iich p {in_file}, eofile{};
std::string token{}; // string to build values
char st = S_LETTER; // state value for FSA
std::for_each(p, eofile,
[&token,&st](char ch)
{
char temp = 0;
switch (st)
{
case S_LETTER:
if ((ch >= '0') && (ch <= '9'))
{
std::cout << token << "/";
token = ch;
st = S_DIGIT; // now in number
}
else token += ch; // concat in string
break;
case S_DIGIT:
default:
if ((ch < '0') || (ch > '9'))
{ // is a letter
std::cout << token << "\n";
token = ch;
st = S_LETTER; // now in name
}
else token += ch; // concat in string
break;
}; // switch()
});
std::cout << token << "\n"; // print last token
}
Here we have no loop. for_each gets the data from an iterator and passes it to a function that builds the name and the value as strings and couts them
Output is the same
3. a simple FSA to consume the data
#define S_LETTER 0
#define S_DIGIT 1
#include <iostream>
int main(void)
{
char one[] = "JACK1940383DAVID30284HAROLD68372TROY4392";
char* p = (char*)&one;
char* token = p;
char st = S_LETTER;
char temp = 0;
while (*p != 0)
{
switch (st)
{
case S_LETTER:
if ((*p >= '0') && (*p <= '9'))
{
temp = *p;
*p = 0;
std::cout << token << "/";
*p = temp;
token = p;
st = S_DIGIT; // now in number
}
break;
case S_DIGIT:
default:
if ( (*p < '0') || (*p > '9'))
{ // letter
temp = *p;
*p = 0;
std::cout << token << "\n";
*p = temp;
token = p;
st = S_LETTER; // now in name
}
break;
}; // switch()
p += 1; // next symbol
}; // while()
std::cout << token << "\n"; // print last token
}
This code just uses a C-style loop to parse the input data

How to remove stopwords from a vector of sentences?

I am working on some code that requires stopwords to be removed from sentences. My current solution does not work.
I have a vector of two test sentences:
std::vector<std::string> sentences = {"this is a test", "another a test"};
I have an unordered set of strings containing stopwords:
std::unordered_set<std::string> stopwords;
Now I tried to loop over the sentences in the vector, check and compare each word with the stopwords, and if it is a stopword is should get removed.
sentences.erase(std::remove_if(sentences.begin(), sentences.end(),
[](const std::string &s){return stopwords.find(s) != stopwords.end();}),
sentences.end());
The idea is that my vector -after removing the stopwords- contains the sentences without the stopwords, but for now, I get the exact same sentences back. Any idea why?
My unordered set is filled with the following function:
void load() {
std::ifstream file;
file.open ("stopwords.txt");
if(!file.is_open()) {return;}
std::string stopword;
while (file >> stopword) {
stopwords.insert(stopword);
}
}

Your current code cannot work, since you are not deleting words from each individual string. Your erase/remove_if call takes an entire string and tries to match the word in the set with the entire string.
First, you should write a simple function that when given a std::string and a map of words to delete, return the string with the deleted words.
Here is a small function using std::istringstream that can do this:
#include <unordered_set>
#include <sstream>
#include <string>
#include <iostream>
std::string remove_stop_words(const std::string& src, const std::unordered_set<std::string>& stops)
{
std::string retval;
std::istringstream strm(src);
std::string word;
while (strm >> word)
{
if ( !stops.count(word) )
retval += word + " ";
}
if ( !retval.empty())
retval.pop_back();
return retval;
}
int main()
{
std::string test = "this is a test";
std::unordered_set<std::string> stops = {"is", "test"};
std::cout << "Changed word:\n" << remove_stop_words(test, stops) << "\n";
}
Output:
Changed word:
this a
So once you have this working correctly, the std::vector version is nothing more than looping through each item in the vector and calling the remove_stop_words function:
int main()
{
std::vector<std::string> test = {"this is a test", "another a test"};
std::unordered_set<std::string> stops = {"is", "test"};
for (size_t i = 0; i < test.size(); ++i)
test[i] = remove_stop_words(test[i], stops);
std::cout << "Changed words:\n";
for ( auto& s : test )
std::cout << s << "\n";
}
Output:
Changed words:
this a
another a
Note that you can utilize the std::transform function to remove the hand-rolled loop in the above example:
#include <algorithm>
//...
int main()
{
std::vector<std::string> test = {"this is a test", "another a test"};
std::unordered_set<std::string> stops = {"is", "test"};
// Use std::transform
std::transform(test.begin(), test.end(), test.begin(),
[&](const std::string& s){return remove_stop_words(s, stops);});
std::cout << "Changed words:\n";
for ( auto& s : test )
std::cout << s << "\n";
}

Difficulties with string declaration/reference parameters (c++)

Last week I got an homework to write a function: the function gets a string and a char value and should divide the string in two parts, before and after the first occurrence of the existing char.
The code worked but my teacher told me to do it again, because it is not well written code. But I don't understand how to make it better. I understand so far that defining two strings with white spaces is not good, but i get out of bounds exceptions otherwise. Since the string input changes, the string size changes everytime.
#include <iostream>
#include <string>
using namespace std;
void divide(char search, string text, string& first_part, string& sec_part)
{
bool firstc = true;
int counter = 0;
for (int i = 0; i < text.size(); i++) {
if (text.at(i) != search && firstc) {
first_part.at(i) = text.at(i);
}
else if (text.at(i) == search&& firstc == true) {
firstc = false;
sec_part.at(counter) = text.at(i);
}
else {
sec_part.at(counter) = text.at(i);
counter++;
}
}
}
int main() {
string text;
string part1=" ";
string part2=" ";
char search_char;
cout << "Please enter text? ";
getline(cin, text);
cout << "Please enter a char: ? ";
cin >> search_char;
divide(search_char,text,aprt1,part2);
cout << "First string: " << part1 <<endl;
cout << "Second string: " << part2 << endl;
system("PAUSE");
return 0;
}

I would suggest you, learn to use c++ standard functions. there are plenty utility function that can help you in programming.
void divide(const std::string& text, char search, std::string& first_part, std::string& sec_part)
{
std::string::const_iterator pos = std::find(text.begin(), text.end(), search);
first_part.append(text, 0, pos - text.begin());
sec_part.append(text, pos - text.begin());
}
int main()
{
std::string text = "thisisfirst";
char search = 'f';
std::string first;
std::string second;
divide(text, search, first, second);
}
Here I used std::find that you can read about it from here and also Iterators.
You have some other mistakes. you are passing your text by value that will do a copy every time you call your function. pass it by reference but qualify it with const that will indicate it is an input parameter not an output.

Why is your teacher right ?
The fact that you need to initialize your destination strings with empty space is terrible:
If the input string is longer, you'll get out of bound errors.
If it's shorter, you got wrong answer, because in IT and programming, "It works " is not the same as "It works".
In addition, your code does not fit the specifications. It should work all the time, independently of the current value which is stored in your output strings.
Alternative 1: your code but working
Just clear the destination strings at the beginning. Then iterate as you did, but use += or push_back() to add chars at the end of the string.
void divide(char search, string text, string& first_part, string& sec_part)
{
bool firstc = true;
first_part.clear(); // make destinations strings empty
sec_part.clear();
for (int i = 0; i < text.size(); i++) {
char c = text.at(i);
if (firstc && c != search) {
first_part += c;
}
else if (firstc && c == search) {
firstc = false;
sec_part += c;
}
else {
sec_part += c;
}
}
}
I used a temporary c instead of text.at(i) or text\[i\], in order to avoid multiple indexing But this is not really required: nowadays, optimizing compilers should produce equivalent code, whatever variant you use here.
Alternative 2: use string member functions
This alternative uses the find() function, and then constructs a string from the start until that position, and another from that position. There is a special case when the character was not found.
void divide(char search, string text, string& first_part, string& sec_part)
{
auto pos = text.find(search);
first_part = string(text, 0, pos);
if (pos== string::npos)
sec_part.clear();
else sec_part = string(text, pos, string::npos);
}

As you understand yourself these declarations
string part1=" ";
string part2=" ";
do not make sense because the entered string in the object text can essentially exceed the both initialized strings. In this case using the string method at can result in throwing an exception or the strings will have trailing spaces.
From the description of the assignment it is not clear whether the searched character should be included in one of the strings. You suppose that the character should be included in the second string.
Take into account that the parameter text should be declared as a constant reference.
Also instead of using loops it is better to use methods of the class std::string such as for example find.
The function can look the following way
#include <iostream>
#include <string>
void divide(const std::string &text, char search, std::string &first_part, std::string &sec_part)
{
std::string::size_type pos = text.find(search);
first_part = text.substr(0, pos);
if (pos == std::string::npos)
{
sec_part.clear();
}
else
{
sec_part = text.substr(pos);
}
}
int main()
{
std::string text("Hello World");
std::string first_part;
std::string sec_part;
divide(text, ' ', first_part, sec_part);
std::cout << "\"" << text << "\"\n";
std::cout << "\"" << first_part << "\"\n";
std::cout << "\"" << sec_part << "\"\n";
}
The program output is
"Hello World"
"Hello"
" World"
As you can see the separating character is included in the second string though I think that maybe it would be better to exclude it from the both strings.
An alternative and in my opinion more clear approach can look the following way
#include <iostream>
#include <string>
#include <utility>
std::pair<std::string, std::string> divide(const std::string &s, char c)
{
std::string::size_type pos = s.find(c);
return { s.substr(0, pos), pos == std::string::npos ? "" : s.substr(pos) };
}
int main()
{
std::string text("Hello World");
auto p = divide(text, ' ');
std::cout << "\"" << text << "\"\n";
std::cout << "\"" << p.first << "\"\n";
std::cout << "\"" << p.second << "\"\n";
}

Your code will only work as long the character is found within part1.length(). You need something similar to this:
void string_split_once(const char s, const string & text, string & first, string & second) {
first.clear();
second.clear();
std::size_t pos = str.find(s);
if (pos != string::npos) {
first = text.substr(0, pos);
second = text.substr(pos);
}
}

The biggest problem I see is that you are using at where you should be using push_back. See std::basic_string::push_back. at is designed to access an existing character to read or modify it. push_back appends a new character to the string.
divide could look like this :
void divide(char search, string text, string& first_part,
string& sec_part)
{
bool firstc = true;
for (int i = 0; i < text.size(); i++) {
if (text.at(i) != search && firstc) {
first_part.push_back(text.at(i));
}
else if (text.at(i) == search&& firstc == true) {
firstc = false;
sec_part.push_back(text.at(i));
}
else {
sec_part.push_back(text.at(i));
}
}
}
Since you aren't handling exceptions, consider using text[i] rather than text.at(i).

String.erase giving out_of_range exception

I was meant to write some program which will read text from text file and erase given words.
Unfortunately, something's wrong with this particular part of code, I get the following exception notification:
This text is just a sample, based on other textterminate called after throwing
an instance of 'std::out_of_range' what<>: Basic_string_erase
I guess that there is something wrong with the way I use erase, I'm trying to to use do while loop, determine the beginning of word which is meant to be erased every time the loop is done and eventually erase text which begins at the beginning of word which is meant to be erased and the end of it - I'm using its length.
#include <iostream>
#include <string>
using namespace std;
void eraseString(string &str1, string &str2) // str1 - text, str2 - phrase
{
size_t positionOfPhrase = str1.find(str2);
if(positionOfPhrase == string::npos)
{
cout <<"Phrase hasn't been found... at all"<< endl;
}
else
{
do{
positionOfPhrase = str1.find(str2, positionOfPhrase + str2.size());
str1.erase(positionOfPhrase, str2.size());//**IT's PROBABLY THE SOURCE OF PROBLEM**
}while(positionOfPhrase != string::npos);
}
}
int main(void)
{
string str("This text is just a sample text, based on other text");
string str0("text");
cout << str;
eraseString(str, str0);
cout << str;
}

Your function is wrong. It is entirely unclear why you call method find twice after each other.
Try the following code.
#include <iostream>
#include <string>
std::string & eraseString( std::string &s1, const std::string &s2 )
{
std::string::size_type pos = 0;
while ( ( pos = s1.find( s2, pos ) ) != std::string::npos )
{
s1.erase( pos, s2.size() );
}
return s1;
}
int main()
{
std::string s1( "This text is just a sample text, based on other text" );
std::string s2( "text" );
std::cout << s1 << std::endl;
std::cout << eraseString( s1, s2 ) << std::endl;
return 0;
}
The program output is
This text is just a sample text, based on other text
This is just a sample , based on other

I think your trouble is that positionOfPhrase inside do loop can be string::npos, in which case erase will throw an exception. This can be fixed by changing logic to:
while (true) {
positionOfPhrase = str1.find(str2, positionOfPhrase + str2.size());
if (positionOfPhrase == string::npos) break;
str1.erase(positionOfPhrase, str2.size());
}

Easy way to remove extension from a filename?

I am trying to grab the raw filename without the extension from the filename passed in arguments:
int main ( int argc, char *argv[] )
{
// Check to make sure there is a single argument
if ( argc != 2 )
{
cout<<"usage: "<< argv[0] <<" <filename>\n";
return 1;
}
// Remove the extension if it was supplied from argv[1] -- pseudocode
char* filename = removeExtension(argv[1]);
cout << filename;
}
The filename should for example be "test" when I passed in "test.dat".

size_t lastindex = fullname.find_last_of(".");
string rawname = fullname.substr(0, lastindex);
Beware of the case when there is no "." and it returns npos

This works:
std::string remove_extension(const std::string& filename) {
size_t lastdot = filename.find_last_of(".");
if (lastdot == std::string::npos) return filename;
return filename.substr(0, lastdot);
}

Since C++17 you can use std::filesystem::path::replace_extension with a parameter to replace the extension or without to remove it:
#include <iostream>
#include <filesystem>
int main()
{
std::filesystem::path p = "/foo/bar.txt";
std::cout << "Was: " << p << std::endl;
std::cout << "Now: " << p.replace_extension() << std::endl;
}
Compile it with:
g++ -std=c++17 -O2 -Wall -pedantic -pthread main.cpp && ./a.out
Running the resulting binary leaves you with:
Was: "/foo/bar.txt"
Now: "/foo/bar"
However this does only remove the last file extension:
Was: "/foo/bar.tar.gz"
Now: "/foo/bar.tar"

In my opinion it is easiest, and the most readable solution:
#include <boost/filesystem/convenience.hpp>
std::string removeFileExtension(const std::string& fileName)
{
return boost::filesystem::change_extension(fileName, "").string();
}

For those who like boost:
Use boost::filesystem::path::stem. It returns the filename without the last extension. So ./myFiles/foo.bar.foobar becomes foo.bar. So when you know you are dealing with only one extension you could do the follwing:
boost::filesystem::path path("./myFiles/fileWithOneExt.myExt");
std::string fileNameWithoutExtension = path.stem().string();
When you have to deal with multiple extensions you might do the following:
boost::filesystem::path path("./myFiles/fileWithMultiExt.myExt.my2ndExt.my3rdExt");
while(!path.extension().empty())
{
path = path.stem();
}
std::string fileNameWithoutExtensions = path.stem().string();
(taken from here: http://www.boost.org/doc/libs/1_53_0/libs/filesystem/doc/reference.html#path-decomposition found in the stem section)
BTW works with rooted paths, too.

The following works for a std::string:
string s = filename;
s.erase(s.find_last_of("."), string::npos);

More complex, but with respect to special cases (for example: "foo.bar/baz", "c:foo.bar", works for Windows too)
std::string remove_extension(const std::string& path) {
if (path == "." || path == "..")
return path;
size_t pos = path.find_last_of("\\/.");
if (pos != std::string::npos && path[pos] == '.')
return path.substr(0, pos);
return path;
}

You can do this easily :
string fileName = argv[1];
string fileNameWithoutExtension = fileName.substr(0, fileName.rfind("."));
Note that this only work if there is a dot. You should test before if there is a dot, but you get the idea.

In case someone just wants a simple solution for windows:
Use PathCchRemoveExtension ->MSDN
... or PathRemoveExtension (deprecated!) ->MSDN

Try the following trick to extract the file name from path with no extension in c++ with no external libraries in c++ :
#include <iostream>
#include <string>
using std::string;
string getFileName(const string& s) {
char sep = '/';
#ifdef _WIN32
sep = '\\';
#endif
size_t i = s.rfind(sep, s.length());
if (i != string::npos)
{
string filename = s.substr(i+1, s.length() - i);
size_t lastindex = filename.find_last_of(".");
string rawname = filename.substr(0, lastindex);
return(rawname);
}
return("");
}
int main(int argc, char** argv) {
string path = "/home/aymen/hello_world.cpp";
string ss = getFileName(path);
std::cout << "The file name is \"" << ss << "\"\n";
}

Just loop through the list and replace the first (or last) occurrence of a '.' with a NULL terminator. That will end the string at that point.
Or make a copy of the string up until the '.', but only if you want to return a new copy. Which could get messy since a dynamically allocated string could be a source of memory leak.
for(len=strlen(extension);len>= 0 && extension[len] != '.';len--)
;
char * str = malloc(len+1);
for(i=0;i<len;i++)
str[i] = extension[i];
str[i] = '\0'l

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ Finding an incomplete string from a text file - c++

Related

How can i split adjacent numbers and letters in c++?

How to remove stopwords from a vector of sentences?

Difficulties with string declaration/reference parameters (c++)

String.erase giving out_of_range exception

Easy way to remove extension from a filename?

Categories

Resources