How to keep track of distinct chars and words? - c++

Trying to write a function that analyzes an input file and outputs info such as distinct characters, the average length of each word, and the number of total words. I'm having trouble figuring out how to keep track of the distinct characters in a string. As an example the line:
To be or not TO BE, THAT IS the question.
Should return 10 total words, 12 distinct characters, and 3.2 average word length.
This is the code I have so far:
void fileInfo(const string& fileName)
{
ifstream in(fileName);
if (in.fail())
{
cout << "Error, bad input file.";
}
string line = "";
int wordTotal = 0;
while (getline(in, line))
{
istringstream ss(line);
string word = "";
while (ss >> word)
{
wordTotal++;
for (size_t i = 0, len = word.size(); i < len; i++)
{
if (word.at(i))
}
}
{
}

One solution is to use a std::unordered_set<char> to store the letters of each word. Since an unordered_set does not store duplicates, you end up with a set of distinct letters.
Second, you only want to count alphabetic characters, not punctuation or digits before plading in a set. Thus you need to filter each character to ensure it is alphabetic.
void fileInfo(const string& fileName)
{
std::unordered_set<char> cSet;
//...
while (ss >> word)
{
wordTotal++;
for (auto v : word)
{
if (std::isalpha(v))
cSet.insert(std::tolower(v));
}
}
//...
}
Live Example
The word is only inserted into the set if it is alphabetic. Also note that the letter inserted is the lower case version.

Related

How to count how many words are in line?Smarter way?

How to find out how many words are in line? I now that method where you count how many there are spaces. But what if someone hit 2 spaces or start line with space.
Is there any other or smarter way to solve this?
And is there any remark on my way of solving it or my code?
I solved it like this:
#include <iostream>
#include <cctype>
#include <cstring>
using namespace std;
int main( )
{
char str[80];
cout << "Enter a string: ";
cin.getline(str,80);
int len;
len=strlen(str);
int words = 0;
for(int i = 0; str[i] != '\0'; i++) //is space after character
{
if (isalpha(str[i]))
{
if(isspace(str[i+1]))
words++;
}
}
if(isalpha(str[len]))
{
words++;
}
cout << "The number of words = " << words+1 << endl;
return 0;
}
The std one-liner is:
words= distance(istream_iterator<string>(istringstream(str)), istream_iterator<string>());
streams by default skip spaces (multiple also).
So if you do something like:
string word;
int numWords = 0;
while (cin >> word) ++numWords;
That should count the number of words for simple cases (not considering what the format of a word is, skipping spaces).
If you want per line, you could read first the line, create a stream from a string, and do a similar thing like this:
string line, word;
int wordCount = 0;
getline(cin, line);
stringstream lineStream(line);
while (lineStream >> word) ++wordCount;
You should not use cin.getline and should prefer the free function std::getline, which takes a string that can be grown up and prevents stack overflows (lol). Stick to the free function for better safety.
First, you need a very specific definition of "word." Most of the answers will give slightly different counts than your attempt because you're using different definitions of what constitutes a word. Your example specifically requires alpha characters in certain positions. The answers based on streams will allow any non-space character to be part of a word.
The general solution is to come up with a precise definition of a word, transform this into a regular expression or finite state machine, and then count each instance of a match.
Here's a sample state machine solution:
std::size_t CountWords(const std::string &line) {
std::size_t count = 0;
enum { between_words, in_word } state = between_words;
for (const auto c : line) {
switch (state) {
case between_words:
if (std::isalpha(c)) {
state = in_word;
++count;
}
break;
case in_word:
if (std::isspace(c)) state = between_words;
break;
}
}
return count;
}
Some test cases to consider (and that highlight the differences among the definitions of a word):
"" empty string
" " just spaces
"a"
" one "
"count two"
"hyphenated-word"
"\"That's Crazy!\" she said." punctuation between alpha characters and adjacent spaces
"the answer is 42" should the number count as a word?

How to take a specified number of lines as input?

cin>>string takes input until space or new line. But getline(cin,string) takes input until new line. Again, getline(cin,string,'c') takes input until 'c'. Is there any way to ignore a few '\n' character and take a specified number of lines as input?
I tried the code below but it didn't work
int main()
{
string a;
for(int i=0;i<4;i++)
{
getline(cin,a);//take string input
}
cout<<a;
}
here for the following input
ksafj kfaskjf(\n)1st
uuiiuo akjfksad(\n)2nd
ksafj kasfj(\n)3rd
asdfed kkkl(\n) when the 4th enter comes input terminate
string a only holds "asdfed kkkl". I want it to hold all the characters, including the end-of-lines (\n).
Do you want to get the first n lines?
std::string get_n_lines(std::istream& input, const std::size_t n)
{
std::ostringstream result;
std::string line;
std::size_t i = 0;
while (std::getline(input, line) && i < n)
{
result << line << '\n';
++i
}
return result.str();
}
std::string first_4_lines = get_n_lines(std::cin, 4);

Reading in only letters from a text file

I am trying to read in from a text file a poem that contains commas, spaces, periods, and newline character. I am trying to use getline to read in each separate word. I do not want to read in any of the commas, spaces, periods, or newline character. As I read in each word I am capitalizing each letter then calling my insert function to insert each word into a binary search tree as a separate node. I do not know the best way to separate each word. I have been able to separate each word by spaces but the commas, periods, and newline characters keep being read in.
Here is my text file:
Roses are red,
Violets are blue,
Data Structures is the best,
You and I both know it is true.
The code I am using is this:
string inputFile;
cout << "What is the name of the text file?";
cin >> inputFile;
ifstream fin;
fin.open(inputFile);
//Input once
string input;
getline(fin, input, ' ');
for (int i = 0; i < input.length(); i++)
{
input[i] = toupper(input[i]);
}
//check for duplicates
if (tree.Find(input, tree.Current, tree.Parent) == true)
{
tree.Insert(input);
countNodes++;
countHeight = tree.Height(tree.Root);
}
Basically I am using the getline(fin,input, ' ') to read in my input.
I was able to figure out a solution. I was able to read in an entire line of code into the variable line, then I searched each letter of the word and only kept what was a letter and I stored that into word.Then, I was able to call my insert function to insert the Node into my tree.
const int MAXWORDSIZE = 50;
const int MAXLINESIZE = 1000;
char word[MAXWORDSIZE], line[MAXLINESIZE];
int lineIdx, wordIdx, lineLength;
//get a line
fin.getline(line, MAXLINESIZE - 1);
lineLength = strlen(line);
while (fin)
{
for (int lineIdx = 0; lineIdx < lineLength;)
{
//skip over non-alphas, and check for end of line null terminator
while (!isalpha(line[lineIdx]) && line[lineIdx] != '\0')
++lineIdx;
//make sure not at the end of the line
if (line[lineIdx] != '\0')
{
//copy alphas to word c-string
wordIdx = 0;
while (isalpha(line[lineIdx]))
{
word[wordIdx] = toupper(line[lineIdx]);
wordIdx++;
lineIdx++;
}
//make it a c-string with the null terminator
word[wordIdx] = '\0';
//THIS IS WHERE YOU WOULD INSERT INTO THE BST OR INCREMENT FREQUENCY COUNTER IN THE NODE
if (tree.Find(word) == false)
{
tree.Insert(word);
totalNodes++;
//output word
//cout << word << endl;
}
else
{
tree.Counter();
}
}
This is a good time for a technique I've posted a few times before: define a ctype facet that treats everything but letters as white space (searching for imbue will show several examples).
From there, it's a matter of std::transform with istream_iterators on the input side, a std::set for the output, and a lambda to capitalize the first letter.
You can make a custom getline function for multiple delimiters:
std::istream &getline(std::istream &is, std::string &str, std::string const& delims)
{
str.clear();
// the 3rd parameter type and the condition part on the right side of &&
// should be all that differs from std::getline
for(char c; is.get(c) && delims.find(c) == std::string::npos; )
str.push_back(c);
return is;
}
And use it:
getline(fin, input, " \n,.");
You can use std::regex to select your tokens
Depending on the size of your file you can read it either line by line or entirely in an std::string.
To read the file you can use :
std::ifstream t("file.txt");
std::string sin((std::istreambuf_iterator<char>(t)),
std::istreambuf_iterator<char>());
and this will do the matching for space separated string.
std::regex word_regex(",\\s]+");
auto what =
std::sregex_iterator(sin.begin(), sin.end(), word_regex);
auto wend = std::sregex_iterator();
std::vector<std::string> v;
for (;what!=wend ; wend) {
std::smatch match = *what;
V.push_back(match.str());
}
I think to separate tokens separated either by , space or new line you should use this regex : (,| \n| )[[:alpha:]].+ . I have not tested though and it might need you to check this out.

Using erase() in a while loop and segfault C++

Okay, so I'm having a bit of a problem here. The thing is this code works on a friend's computer but I'm getting segmentation faults when I try to run it.
I am reading a file looking like so:
word 2 wor ord
anotherword 7 ano oth the her erw wor ord
...
And I want to parse every word of the file. The first two words (e.g. word and 2) are to be erased but saving the first one in another variable in the process.
I've looked around a bit on accomplishing this, and I've come up with this half-assed piece of code that seems to work on my friends' computer but not mine.
Dictionary::Dictionary() {
ifstream ip;
ip.open("words.txt", ifstream::in);
string input;
string buf;
vector<string> tokens; // Holds words
while(getline(ip, input)){
if(input != " ") {
stringstream ss(input);
while(ss >> buf) {
tokens.push_back(buf);
}
string werd = tokens.at(0);
tokens.erase(tokens.begin()); // Remove the word from the vector
tokens.erase(tokens.begin()); // Remove the number indicating trigrams
Word curr(werd, tokens);
words[werd.length()].push_back(curr); // Put the word at the vector with word length i.
tokens.clear();
}
}
ip.close();
}
What's the best of of parsing this kind of structure in a file and removing the first two elements but saving the others? As you can see, I'm making a Word object that contains a string and a vector for later use.
Regards
EDIT; It seems to add the first line fine, but on removal of the second element, it crashes with a segmentation fault error.
EDIT; words.txt contain this:
addict 4 add ddi dic ict
sinister 6 ini ist nis sin ste ter
test 2 est tes
cplusplus 7 cpl lus lus plu plu spl usp
Without leading blank spaces or ending blanks. Not that it reads all the way anyway.
Word.cc:
#include <string>
#include <vector>
#include <algorithm>
#include "word.h"
using namespace std;
Word::Word(const string& w, const vector<string>& t) : word(w), trigrams(t) {}
string Word::get_word() const {
return word;
}
unsigned int Word::get_matches(const vector<string>& t) const {
vector<string> sharedTrigrams;
set_intersection(t.begin(),t.end(), trigrams.begin(), trigrams.end(), back_inserter(sharedTrigrams));
return sharedTrigrams.size();
}
First of all, there is error in the number of closing }s in your posted code. If you indent them properly, you will see that your code is:
while(getline(ip, input))
{
if(input != " ")
{
stringstream ss(input);
while(ss >> buf) {
tokens.push_back(buf);
}
}
string werd = tokens.at(0);
tokens.erase(tokens.begin());
tokens.erase(tokens.begin());
Word curr(werd, tokens);
words[werd.length()].push_back(curr);
tokens.clear();
}
}
Assuming that is a small typo in posting, the other problem is that tokens is an empty list when input == " " yet you continue to use tokens as though it has 2 or more items in it.
You can fix that by moving everything inside the if statement.
while(getline(ip, input))
{
if(input != " ")
{
stringstream ss(input);
while(ss >> buf) {
tokens.push_back(buf);
}
string werd = tokens.at(0);
tokens.erase(tokens.begin());
tokens.erase(tokens.begin());
Word curr(werd, tokens);
words[werd.length()].push_back(curr);
tokens.clear();
}
}
I would add further checks to make it more robust.
while(getline(ip, input))
{
if(input != " ")
{
stringstream ss(input);
while(ss >> buf) {
tokens.push_back(buf);
}
string werd;
if ( !tokens.empty() )
{
werd = tokens.at(0);
tokens.erase(tokens.begin());
}
if ( !tokens.empty() )
{
tokens.erase(tokens.begin());
}
Word curr(werd, tokens);
words[werd.length()].push_back(curr);
tokens.clear();
}
}
You forgot to include the initialization of the variable "words" in your code. Just looking at it, I am guessing you are initializing "words" to be a fixed-length array of vectors, but then read a word that is off the end of the array. Bang, you're dead. Add a check to "werd.length()" to ensure it is strictly less than the length of "words".
ifstream ip;
ip.open("words.txt", ifstream::in);
string input;
while(getline(ip, input)){
istringstream iss(input);
string str;
unsigned int count = 0;
if(iss >> str >> count) {
vector<string> tokens { istream_iterator<string>(iss), istream_iterator<string>() }; // Holds words
if(tokens.size() == count)
words[str.length()].emplace_back(str, tokens);
}
}
ip.close();
This is what I used to make it work.

How to input text containing less than 1000 words with spaces and punctuations?

I am able to input string using the following code:
string str;
getline(cin, str);
But I want to know how to put an upper limit on the number of words that can be given as input.
You cannot do what you are asking with just getline or even read. If you want to limit the number of words you can use a simple for loop and the stream in operator.
#include <vector>
#include <string>
int main()
{
std::string word;
std::vector<std::string> words;
for (size_t count = 0; count < 1000 && std::cin >> word; ++count)
words.push_back(word);
}
This will read up to 1000 words and stuff them into a vector.
getline() reads characters and has no notion of what a word is. The definition of a word is likely to change with context and language. You'll need to read a stream one character at a time, extracting words that match your definition of a word and stop when you have met your limit.
You can either read one character at a time, or only process 1000 characters from your string(s).
You may be able to set a limit on std::string and use that.
Following will read only count no words separated by spaces in a vector, discarding
others.
Here punctuations are also read as "word" is separated by spaces, you need to remove them from vector.
std::vector<std::string> v;
int count=1000;
std::copy_if(std::istream_iterator<std::string>(std::cin),
// can use a ifstream here to read from file
std::istream_iterator<std::string>(),
std::back_inserter(v),
[&](const std::string & s){return --count >= 0;}
);
Hope this program helps you out. This code handles input ofmultiple words in a single line as well
#include<iostream>
#include<string>
using namespace std;
int main()
{
const int LIMIT = 5;
int counter = 0;
string line;
string words[LIMIT];
bool flag = false;
char* word;
do
{
cout<<"enter a word or a line";
getline(cin,line);
word = strtok(const_cast<char*>(line.c_str())," ");
while(word)
{
if(LIMIT == counter)
{
cout<<"Limit reached";
flag = true;
break;
}
words[counter] = word;
word = strtok(NULL," ");
counter++;
}
if(flag)
{
break;
}
}while(counter>0);
getchar();
}
As of now, this program has the limit to accept only 5 words and put it in a string array.
Use the following function:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684961%28v=vs.85%29.aspx
You can specify the third argument to limit the amount of read characters.