Removing punctuation marks using ispunct()

Removing punctuation marks using ispunct() - c++

ispunct() works well when words are separated in this way "one, two; three". Then it will remove ", ;" and replace with given character.
But if string is given in this manner "ts='TOK_STORE_ID';" then it will take "ts='TOK_STORE_ID';" as one single token, or
"one,one, two;four$three two" as three tokens 1. "one,one" 2. "two;four$three" 3. "two"
Is there any one so that "one,one, two;four$three two" could be considered as "one one two four three two" each separate token?
Writing manual code like:
for(i=0;i<str.length();i++)
{
//validating each character
}
This operation will become very costly when string is very very long.
So is there any other function like ispunct()? or anything else?
In c we do this to compare each character:
for(i=0;i<str.length();i++)
{
if(str[i]==',' || str[i]==",") // Is there any way here to compare with all puctuations in one shot?
str[i]=" "; //replace with space
}
In c++ what is the correct way for this?

This operation will become very costly when string is very very long.
No, it won't. It will be an O(n) operation which is good for this problem. You cannot get better than this for this operation because any which way, you have to look at each and every character in the string. There is no way to do this without looking at each and every character in the string.

Assuming you're dealing with a typical 8-bit character set, I'd start by building a translation table:
std::vector<char> trans(UCHAR_MAX);
for (int i=0; i<UCHAR_MAX; i++)
trans[i] = ispunct(i) ? ' ' : i;
Then processing a string of text can be something like this:
for (auto &ch : str)
ch = trans[(unsigned char)ch];
For an 8-bit character set, the translation table will typically all fit in your L1 cache, and the loop has only one branch that's highly predictable (always taken except when you reach the end of the string) so it should be fairly fast.
Just to be clear, when I say "fairly fast", I mean i's extremely unlikely that this would be the bottleneck in the process you've described. You'd need a combination of a slow processor and fast network connection to stand any chance of this being the bottleneck in processing data you're obtaining over a network.
If you have a Raspberry Pi with a 10 GbE network connection, you might need to do a little more optimization work for this to keep up (but I'm not sure even then). For any less radical mismatch, the network is clearly going to be the bottleneck.

So is there any other function like ispunct()? or anything else?
As a matter of fact, there is. man ispunct gives me this beautiful list:
int isalnum(int c);
int isalpha(int c);
int isascii(int c);
int isblank(int c);
int iscntrl(int c);
int isdigit(int c);
int isgraph(int c);
int islower(int c);
int isprint(int c);
int ispunct(int c);
int isspace(int c);
int isupper(int c);
int isxdigit(int c);
Take whichever you want.

You can also use std::remove_copy_if to remove the punctuation completely:
#include <algorithm>
#include <string>
string words = "I,love.punct-uation!";
string result; // this will be our final string after it has been purified
// iterates through each character in the string
// to remove all punctuation
std::remove_copy_if(words.begin(), words.end(),
std::back_inserter(result), //Store output
std::ptr_fun<int, int>(&std::ispunct)
);
// ta-da!
cout << result << endl;

Related

Is there a way to seek the "\n" character that is faster than looping through char one at a time?

Looking at the sample implementation of wc.c when counting number of lines, it loop through the file, one character at a time and accumulating the '\n' to count the number of newlines:
#define COUNT(c) \
ccount++; \
if ((c) == '\n') \
lcount++;
Is there a way to just seek the file for '\n' and keep jumping to the newline characters and do a count?
Would seeking for '\n' be the same as just reading characters one at a time until we see '\n' and count it?

Well, all characters are not '\n', except for one.
A branch-less algorithm is likely to be faster.
Have you tried std::count, though?
#include <string>
#include <algorithm>
int main() {
const auto s = std::string("Hello, World!\nfoo\nbar\nbaz");
const auto lines_in_s = std::count(s.cbegin(), s.cend(), '\n');
return lines_in_s;
}
Compiler Explorer
Or with a file:
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <string>
int main() {
if (std::ifstream is("filename.txt"); is) {
const auto lines_in_file =
std::count(std::istreambuf_iterator<char>(is),
std::istreambuf_iterator<char>{}, '\n');
std::cout << lines_in_file << '\n';
}
}
Compiler Explorer

The only way you could skip looking at every character would be if you had domain knowledge about the string you're currently looking at:
If you knew that you're handling a text with continuous paragraphs of at least 50 words or so, you could, after each '\n', advance by 100 or 200 chars, thus saving some time. You'd need to test and refine that jump length, of course, but then you wouldn't need to check every single char.
For a general-purpose counting function you're stuck with looking at every possible char.

Q: Is there a faster way to count the number of lines in a file than reading one character at a time?
A: The quick answer is no, but one can parallelize the counting which might shorten the runtime but the program would still have to run through every byte once. Such a program may by IO bound and so it depends on the hardware involved as to how useful parallelization is in this case.
Q: Is there a way to skip from one newline character to the next without having to read through all the bytes in between?
A: The quick answer is no, but if one had a really large text file for example, what one could do is make an 'index' file of offsets. One would still have to make one pass over the file in order to generate such a file, but once it was made, one could find the nth line by reading the nth offset in the index and then 'seek'-ing to it. The index would have to maintained or regenerated though every time the file changed. If one used fixed width offsets, one could seek straight to the offset required with some simple arithmetic, read the index for the offset, then seek to the correct position in the file. A line count can be obtained at the same time as generating the index. Once the index is generated, a line count could quickly be determined from the size of the index file if it has to be computed again.
It probably should be mentioned that the number of lines in a text file might not be derived from the number of '\n' bytes because of multi-byte character encoding. To count the number of lines, one needs to scan the file character by character rather than just byte by byte, and to do that, one needs to know what character encoding scheme is being used.

You can use strchr function to "jump" to next '\n' in string, and it will be faster on some platforms, because strchr usually implemented in assembly language and use processor instructions that can scan memory faster where such instructions are available.
something like this:
#include <string.h>
unsigned int count_newlines(const char *str) {
unsigned result = 0;
const char s = str;
while ((s = strchr(s, '\n')) != NULL) {
++result; // found one '\n'
++s; // and start searching again from the next character
}
return result;
}

String Management C/C++ & Writing and Reading From txt File

I am facing a problem with reading and writing a string from and to a file respectively.
Purpose:
To enter a string into a text file as a complete sentence, read the string from the text file and separate all words that start from a vowel using a function and display them as a sentence. (The sentence just needs to consist of the words from the string that start with a vowel.)
Problem:
The code is working as intended but as i have used the getline() function to obtain the string from the txt file when i withdraw a substring from it, it includes the entire file after the vowel instead of just the word. I cannot understand how to make the substring only include words.
Code:
#include <fstream>
#include <string>
#include <iostream>
#include <cstring>
using namespace std;
string vowels(string a)
{
int c=sizeof(a);
string b[c];
string d;
static int n;
for(int i=1;i<=c;i++)
{
if (a.find("a")!=-1)
{
b[i]=a.substr(a.find("a",n));
d+=b[i];
n=a.find("a")+1;
}
else if (a.find("e")!=-1)
{
b[i]=a.substr(a.find("e",n));
d+=b[i];
n=a.find("e")+1;
}
else if (a.find("i")!=-1)
{
b[i]=a.substr(a.find("i",n));
d+=b[i];
n=a.find("i")+1;
}
else if (a.find("o")!=-1)
{
b[i]=a.substr(a.find("o",n));
d+=b[i];
n=a.find("o")+1;
}
else if (a.find("u")!=-1)
{
b[i]=a.substr(a.find("u",n));
d+=b[i];
n=a.find("u")+1;
}
}
return d;
}
int main()
{
string input,lne,e;
ofstream file("output.txt", ios::app);
cout<<"Please input text for text file input: ";
getline(cin,input);
file << input;
file.close();
ifstream myfile("output.txt");
getline(myfile,lne);
e=vowels(lne);
cout<<endl<<"Text inside file reads: ";
cout<<lne;
cout<<endl;
cout<<e<<endl;
system("pause");
myfile.close();
return 0;
}

I haven't read your code VERY carefully, but several things stand out:
Look up find_first_of - it'll simplify your code A LOT.
sizeof(a) certainly doesn't do what you think it does [unless you think it gives you the size of the std::string class type - which makes it rather strange as a use-case, why not use either 12 or 24?]
find (and find_first_of), technically speaking, doesn't return -1 when the function isn't finding what you want. It returns std::string::npos [which may appear to be -1, but a) is not guaranteed to be, and b) is unsingned so can't be negative].
Your program only reads one line.
x.substr(n) will give you the string of x from position n - is that what you want?
Don't repeat find, use p = x.find("X"); and then do x.substr(p) [assuming that is what you want].

There are various problems with your code.
int c = sizeof( a );
This is the number of bytes that a string takes up in memory. And you certainly don't want to create an array of this many strings as it makes no sense for what you're trying to achieve. Don't do this to yourself. You're only copying one string inside the loop, all you need is one string and you already have string d.
To get the actual size of a string, you have to call
str.size()
The string.substr(..) has a couple overloads, one of them takes only one argument, an index. This will return sub string starting at that index in the original string. (The string starting at the vowel all the way through to the end of the string)
What you are maybe looking for is the overload that takes two arguments, the start index (beginning of the word and the end of the word).
The string input will not take the newline that you enter to flush cin. And then you add it to the file in append mode, so after running the program a few times your file is a huge one-liner. Did you really intend to do this?
Maybe you should explicitly add a new line to the file after entering the input. Something like file << std::endl;
Also, the conditions in the ifs
if (a.find("a")!=-1)
Don't match what you do next,
b[i]=a.substr(a.find("a",n));
Then you use a static int,
static int n;
This is bad, because this function will only work once. You're lucky that static initializes its values to zero, but you should always initialize explicitly. In your case, you don't need this to be static.
Finally: "so i was unsure of how many loops to run"
When you don't know how many loops you have to run, then a for loop is not adequate.
You should use a while loop or a do while.
You shouldn't try to learn C++ by guessing, because that's what it looks like you're doing. You're trying to do more than you know and making some very silly mistakes. Find a good book to learn from, or at the very least google the functions you're using to see what they do and how to use them properly. (ie: http://www.cplusplus.com/reference/string/string/substr/ )
Here's a list of books from stackoverflow's FAQ: The Definitive C++ Book Guide and List
The last thing is about finding vowels. When you find a vowel, you have to make sure it's at the beginning of a word. Then you want to read it until the word ends, that is when you find a character that is not part of a word. (a whitespace, certain punctuation, ... ) This should mark the beginning and end of the word.

want to optimize this string manipulation program c++

I've just solve this problem:
http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=3139
Here's my solution:
https://ideone.com/pl8K3K
int main(void)
{
string s, sub;
int f,e,i;
while(getline(cin, s)){
f=s.find_first_of("[");
while(f< s.size()){
e= s.find_first_of("[]", f+1);
sub = s.substr(f, e-f);
s.erase(f,e-f);
s.insert(0, sub);
f=s.find_first_of("[", f+1);
}
for(i=0; i<s.size(); i++){
while((s[i]==']') || (s[i]=='[')) s.erase(s.begin()+i);
}
cout << s << endl;
}
return 0;
}
I get TLE ,and I wanna know which operation in my code costs too expensive and somehow optimize the code..
Thanks in advance..

If I am reading your problem correctly, you need to rethink your design. There is no need for functions to search, no need for erase, substr, etc.
First, don't think about the [ or ] characters right now. Start out with a blank string and add characters to it from the original string. That is the first thing that speeds up your code. A simple loop is what you should start out with.
Now, while looping, when you actually do encounter those special characters, all you need to do is change the "insertion point" in your output string to either the beginning of the string (in the case of [) or the end of the string (in the case of ]).
So the trick is to not only build a new string as you go along, but also change the point of insertion into the new string. Initially, the point of insertion is at the end of the string, but that will change if you encounter those special characters.
If you are not aware, you can build a string not by just using += or +, but also using the std::string::insert function.
So for example, you always build your output string this way:
out.insert(out.begin() + curInsertionPoint, original_text[i]);
curInsertionPoint++;
The out string is the string you're building, the original_text is the input that you were given. The curInsertionPoint will start out at 0, and will change if you encounter the [ or ] characters. The i is merely a loop index into the original string.
I won't post any more than this, but you should get the idea.

What is wrong with my UVa code

I tried to solve this problem in UVa but I am getting a wrong answer and I cant seem to find the error
http://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=2525
#include<cstdio>
#include<cstring>
using namespace std;
int main()
{
int t,j,k,i=1;
char a[1000];
while(scanf("%d",&t)!=EOF && t)
{
int sum=0;
getchar();
gets(a);
k=strlen(a);
for(j=0;j<k;j++)
{ if(a[j]=='a'||a[j]=='d'||a[j]=='g'||a[j]=='j'||a[j]=='m'||a[j]=='p'||a[j]=='t'||a[j]=='w'||a[j]==32)
sum=sum+1;
else if(a[j]=='b'||a[j]=='e'||a[j]=='h'||a[j]=='k'||a[j]=='n'||a[j]=='q'||a[j]=='u'||a[j]=='x')
sum=sum+2;
else if(a[j]=='c'||a[j]=='f'||a[j]=='i'||a[j]=='l'||a[j]=='o'||a[j]=='r'||a[j]=='v'||a[j]=='y')
sum=sum+3;
else if(a[j]=='s'||a[j]=='z')
sum=sum+4;
}
printf("Case #%d: %d\n",i,sum);
i++;
}
return 0;
}

In the problem description there is a single number that indicates the number of texts that will be in the input afterwards. Your original code was trying to read the number before every row of input.
The attempt to read the number in each one of the rows will fail since the input character set does not include any digits, so you could be inclined to think that there should be no difference. But there is, when you try to read a number it will start by consuming the leading whitespace. If the input is:
< space >< space >a
The output should be 3 (two '0' and one '2' keys), but the attempt to read the number out of the line will consume the two leading whitespace characters and the later gets will read the string "a", rather than " a". Your count will be off by the amount of leading whitespace.

separate your code into functions that do specific things: read the data from the file, calculate the number of key presses for each input, output the result
Benefit:
You can test each function independently. It is also easier to reason about the code.
The maximum size of an input is 100, this means you only need an array of 101 characters( including the final \0) for each input, not 1000.
Since this question is also tagged C++ try to use std::vector and std::string in your code.
The inner for seems right at a cursory glance. The befit of having a specialized function that computes the number of key presses is that you can easily verify it does the correct thing. Make sure you check it thoroughly.

How to read a file and get words in C++

I am curious as to how I would go about reading the input from a text file with no set structure (Such as notes or a small report) word by word.
The text for example might be structured like this:
"06/05/1992
Today is a good day;
The worm has turned and the battle was won."
I was thinking maybe getting the line using getline, and then seeing if I can split it into words via whitespace from there. Then I thought using strtok might work! However I don't think that will work with the punctuation.
Another method I was thinking of was getting everything char by char and omitting the characters that were undesired. Yet that one seems unlikely.
So to sort the thing short:
Is there an easy way to read an input from a file and split it into words?

Since it's easier to write than to find the duplicate question,
#include <iterator>
std::istream_iterator<std::string> word_iter( my_file_stream ), word_iter_end;
size_t wordcnt;
for ( ; word_iter != word_iter_end; ++ word_iter ) {
std::cout << "word " << wordcnt << ": " << * word_iter << '\n';
}
The std::string argument to istream_iterator tells it to return a string when you do *word_iter. Every time the iterator is incremented, it grabs another word from its stream.
If you have multiple iterators on the same stream at the same time, you can choose between data types to extract. However, in that case it may be easier just to use >> directly. The advantage of an iterator is that it can plug into the generic functions in <algorithm>.

Yes. You're looking for std::istream::operator>> :) Note that it will remove consecutive whitespace but I doubt that's a problem here.
i.e.
std::ifstream file("filename");
std::vector<std::string> words;
std::string currentWord;
while(file >> currentWord)
words.push_back(currentWord);

You can use getline with a space character, getline(buffer,1000,' ');
Or perhaps you can use this function to split a string into several parts, with a certain delimiter:
string StrPart(string s, char sep, int i) {
string out="";
int n=0, c=0;
for (c=0;c<(int)s.length();c++) {
if (s[c]==sep) {
n+=1;
} else {
if (n==i) out+=s[c];
}
}
return out;
}
Notes: This function assumes that it you have declared using namespace std;.
s is the string to be split.
sep is the delimiter
i is the part to get (0 based).

You can use the scanner technique to grabb words, numbers dates etc... very simple and flexible. The scanner normally returns token (word, number, real, keywords etc..) to a Parser.
If you later intend to interpret the words, I would recommend this approach.
I can warmly recommend the book "Writing Compilers and Interpreters" by Ronald Mak (Wiley Computer Publishing)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js