How to extract specific elements from a string?

How to extract specific elements from a string? - c++

I am trying to extract the first numbers from each block of numbers from the next string.
string s = "f 1079//2059 1165//2417 1164//2414 1068//1980";
In this example I need to extract 1079, 1165, 1164 and 1068
I have tried with getline and substr but I have not been able to.

You can utilize the <regex>(C++ regular expression library) with pattern (\\d+)//. Locate the numbers before double slashes. Also using the parentheses to extract the numbers only by submatch.
Here is usage.
string s = "f 1079//2059 1165//2417 1164//2414 1068//1980";
std::regex pattern("(\\d+)//");
auto match_iter = std::sregex_iterator(s.begin(), s.end(), pattern);
auto match_end = std::sregex_iterator();
for (;match_iter != match_end; match_iter++)
{
const std::smatch& m = *match_iter;
std::cout << m[1].str() << std::endl; // sub-match for token in parentheses, the 1079, 1165, ...
// m[0]: whole match, "1079//"
// m[1]: first submatch, "1070"
}

I usually reach for istringstream for this kind of thing:
std::string input = "f 1079//2059 1165//2417 1164//2414 1068//1980";
std::istringstream is(input);
char f;
if (is >> f)
{
int number, othernumber;
char slash1, slash2;
while (is >> number >> slash1 >> slash2 >> othernumber)
{
// Process 'number'...
}
}

here is an attempt with getline and substring which works.
auto extractValues(const std::string& source)
-> std::vector<std::string>
{
auto target = std::vector<std::string>{};
auto stream = std::stringstream{ source };
auto currentPartOfSource = std::string{};
while (std::getline(stream, currentPartOfSource, ' '))
{
auto partBeforeTheSlashes = std::string{};
auto positionOfSlashes = currentPartOfSource.find("//");
if (positionOfSlashes != std::string::npos)
{
target.push_back(currentPartOfSource.substr(0, positionOfSlashes));
}
}
return target;
}

Or there is another split way to extract tokens, but it may involve some string copy.
Consider a split_by function like
std::vector<std::string> split_by(const std::string& str, const std::string& delem);
Possible implementations in Split a string in C++?
Make string be splitted by first, then splitted by // and extract first item.
std::vector<std::string> tokens = split_by(s, " ");
std::vector<std::string> words;
std::transform(tokens.begin() + 1, tokens.end(), // drop first "f"
std::back_inserter(words),
[](const std::string& s){ return split_by(s, "//")[0]; });

Related

Could you recommend, how reimplement split function to work with string_view?

I write this split function, can't find easy way to split by string_view(several chars).
My function:
size_t split(std::vector<std::string_view>& result, std::string_view in, char sep) {
result.reserve(std::count(in.begin(), in.end(), in.find(sep) != std::string::npos) + 1);
for (auto pfirst = in.begin();; ++pfirst) {
auto pbefore = pfirst;
pfirst = std::find(pfirst, in.end(), sep);
result.emplace_back(q, pfirst-pbefore);
if (pfirst == in.end())
return result.size();
}
}
I want to call this split function with string_view separator. For example:
str = "apple, phone, bread\n keyboard, computer"
split(result, str, "\n,")
Result:['apple', 'phone', 'bread', 'keyboard', 'computer']
My question is, how can i implement this function as fast as possible?

First, you are using std::count() incorrectly.
Second, std::string_view has its own find_first_of() and substr() methods, which you can use in this situation, instead of using iterators. find_first_of() allows you to specify multiple characters to search for.
Try something more like this:
size_t split(std::vector<std::string_view>& result, std::string_view in, std::string_view seps) {
result.reserve(std::count_if(in.begin(), in.end(), [&](char ch){ return seps.find(ch) != std::string_view::npos; }) + 1);
std::string_view::size_type start = 0, end;
while ((end = in.find_first_of(seps, start)) != std::string_view::npos) {
result.push_back(in.substr(start, end-start));
start = in.find_first_not_of(' ', end+1);
}
if (start != std::string_view::npos)
result.push_back(in.substr(start));
return result.size();
}
Online Demo

This is my take on splitting a string view, just loops once over all the characters in the string view and returns a vector of string_views (so no copying of data)
The calling code can still use words.size() to get the size if needed.
(I use C++20 std::set contains function)
Live demo here : https://onlinegdb.com/tHfPIeo1iM
#include <iostream>
#include <set>
#include <string_view>
#include <vector>
auto split(const std::string_view& string, const std::set<char>& separators)
{
std::vector<std::string_view> words;
auto word_begin{ string.data() };
std::size_t word_len{ 0ul };
for (const auto& c : string)
{
if (!separators.contains(c))
{
word_len++;
}
else
{
// we found a word and not a seperator repeat
if (word_len > 0)
{
words.emplace_back(word_begin, word_len);
word_begin += word_len;
word_len = 0;
}
word_begin++;
}
}
// string_view doesn't have a trailing zero so
// also no trailing separator so if there is still
// a word in the "pipeline" add it too
if (word_len > 0)
{
words.emplace_back(word_begin, word_len);
}
return words;
}
int main()
{
std::set<char> seperators{ ' ', ',', '.', '!', '\n' };
auto words = split("apple, phone, bread\n keyboard, computer", seperators);
bool comma = false;
std::cout << "[";
for (const auto& word : words)
{
if (comma) std::cout << ", ";
std::cout << word;
comma = true;
}
std::cout << "]\n";
return 0;
}

I do not know about performance, but this code seems a lot simpler
std::vector<std::string> ParseDelimited(
const std::string &l, char delim )
{
std::vector<std::string> token;
std::stringstream sst(l);
std::string a;
while (getline(sst, a, delim))
token.push_back(a);
return token;
}

Split strings into tokens with delimiter (/ and -) in c++ [duplicate]

This question already has answers here:
Right way to split an std::string into a vector<string>
(12 answers)
Closed 11 months ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
I have some text (meaningful text or arithmetical expression) and I want to split it into words.
If I had a single delimiter, I'd use:
std::stringstream stringStream(inputString);
std::string word;
while(std::getline(stringStream, word, delimiter))
{
wordVector.push_back(word);
}
How can I break the string into tokens with several delimiters?

Assuming one of the delimiters is newline, the following reads the line and further splits it by the delimiters. For this example I've chosen the delimiters space, apostrophe, and semi-colon.
std::stringstream stringStream(inputString);
std::string line;
while(std::getline(stringStream, line))
{
std::size_t prev = 0, pos;
while ((pos = line.find_first_of(" ';", prev)) != std::string::npos)
{
if (pos > prev)
wordVector.push_back(line.substr(prev, pos-prev));
prev = pos+1;
}
if (prev < line.length())
wordVector.push_back(line.substr(prev, std::string::npos));
}

If you have boost, you could use:
#include <boost/algorithm/string.hpp>
std::string inputString("One!Two,Three:Four");
std::string delimiters("|,:");
std::vector<std::string> parts;
boost::split(parts, inputString, boost::is_any_of(delimiters));

Using std::regex
A std::regex can do string splitting in a few lines:
std::regex re("[\\|,:]");
std::sregex_token_iterator first{input.begin(), input.end(), re, -1}, last;//the '-1' is what makes the regex split (-1 := what was not matched)
std::vector<std::string> tokens{first, last};
Try it yourself

I don't know why nobody pointed out the manual way, but here it is:
const std::string delims(";,:. \n\t");
inline bool isDelim(char c) {
for (int i = 0; i < delims.size(); ++i)
if (delims[i] == c)
return true;
return false;
}
and in function:
std::stringstream stringStream(inputString);
std::string word; char c;
while (stringStream) {
word.clear();
// Read word
while (!isDelim((c = stringStream.get())))
word.push_back(c);
if (c != EOF)
stringStream.unget();
wordVector.push_back(word);
// Read delims
while (isDelim((c = stringStream.get())));
if (c != EOF)
stringStream.unget();
}
This way you can do something useful with the delims if you want.

And here, ages later, a solution using C++20:
constexpr std::string_view words{"Hello-_-C++-_-20-_-!"};
constexpr std::string_view delimeters{"-_-"};
for (const std::string_view word : std::views::split(words, delimeters)) {
std::cout << std::quoted(word) << ' ';
}
// outputs: Hello C++ 20!
Required headers:
#include <ranges>
#include <string_view>
Reference: https://en.cppreference.com/w/cpp/ranges/split_view

If you interesting in how to do it yourself and not using boost.
Assuming the delimiter string may be very long - let say M, checking for every char in your string if it is a delimiter, would cost O(M) each, so doing so in a loop for all chars in your original string, let say in length N, is O(M*N).
I would use a dictionary (like a map - "delimiter" to "booleans" - but here I would use a simple boolean array that has true in index = ascii value for each delimiter).
Now iterating on the string and check if the char is a delimiter is O(1), which eventually gives us O(N) overall.
Here is my sample code:
const int dictSize = 256;
vector<string> tokenizeMyString(const string &s, const string &del)
{
static bool dict[dictSize] = { false};
vector<string> res;
for (int i = 0; i < del.size(); ++i) {
dict[del[i]] = true;
}
string token("");
for (auto &i : s) {
if (dict[i]) {
if (!token.empty()) {
res.push_back(token);
token.clear();
}
}
else {
token += i;
}
}
if (!token.empty()) {
res.push_back(token);
}
return res;
}
int main()
{
string delString = "MyDog:Odie, MyCat:Garfield MyNumber:1001001";
//the delimiters are " " (space) and "," (comma)
vector<string> res = tokenizeMyString(delString, " ,");
for (auto &i : res) {
cout << "token: " << i << endl;
}
return 0;
}
Note: tokenizeMyString returns vector by value and create it on the stack first, so we're using here the power of the compiler >>> RVO - return value optimization :)

Using Eric Niebler's range-v3 library:
https://godbolt.org/z/ZnxfSa
#include <string>
#include <iostream>
#include "range/v3/all.hpp"
int main()
{
std::string s = "user1:192.168.0.1|user2:192.168.0.2|user3:192.168.0.3";
auto words = s
| ranges::view::split('|')
| ranges::view::transform([](auto w){
return w | ranges::view::split(':');
});
ranges::for_each(words, [](auto i){ std::cout << i << "\n"; });
}

How to replace multiple sets of keywords in a string?

So I have a file of strings that I am reading in, and I have to replace certain values in them with other values. The amount of possible replacements is variable. As in, it reads the patterns to replace with in from a file. Currently I'm storing in a vector<pair<string,string>> for the patterns to find and match. However I run into issues:
Example:
Input string: abcd.eaef%afas&333
Delimiter patterns:
. %%%
% ###
& ###
Output I want: abcd%%%eaef###afas###333
Output I get: abcd#########eaef###afas###333
The issue being it ends up replacing the % sign or any other symbol that was already a replacement for something else, it should not be doing that.
My code is (relevant portions):
std::string& replace(std::string& s, const std::string& from, const std::string& to){
if(!from.empty())
for(size_t pos = 0; (pos = s.find(from, pos)) != std::string::npos; pos += to.size()) s.replace(pos, from.size(), to);
return s;
}
string line;
vector<pair<string, string>> myset;
while(getline(delimiterfile, line)){
istringstream is(line);
string delim, pattern;
if(is >> delim >> pattern){
myset.push_back(make_pair(delim, pattern));
} else {
throw runtime_error("Invalid pattern pair!");
}
}
while(getline(input, line)){
string temp = line;
for(auto &item : myset){
replace(temp, item.first, item.second);
}
output << temp << endl;
}
Can someone please tell me what I'm messing up and how to fix it?

In pseudo-code a simple replacement algorithm could look something like this:
string input = getline();
string output; // The string containing the replacements
for (each char in input)
{
if (char == '.')
output += "%%%";
// TODO: Other replacements
else
output += char;
}
If you implement the above code, once it's done the variable output will contain the string with all replacements made.

I would suggest you use stringstream. This way you will be able to achieve what you are looking for very easily.

Remove whitespace, convert case, in string except in quotes

I am using C++03 without Boost.
Suppose I have a string such as.. The day is "Mon day"
I want to process this to
THEDAYISMon day
That is, convert to upper case what is not in the quote, and remove whitespace that isn't in the quote.
The string may not contain quotes, but if it does, there will only be 2.
I tried using STL algorithms but I get stuck on how to remember if it's in a quote or not between elements.
Of course I can do it with good old for loops, but I was wondering if there is a fancy C++ way.
Thanks.
This is what I have using a for loop
while (getline(is, str))
{
// remove whitespace and convert case except in quotes
temp.clear();
bool bInQuote = false;
for (string::const_iterator it = str.begin(), end_it = str.end(); it != end_it; ++it)
{
char c = *it;
if (c == '\"')
{
bInQuote = (! bInQuote);
}
else
{
if (! ::isspace(c))
{
temp.push_back(bInQuote ? c : ::toupper(c));
}
}
}
swap(str, temp);

You can do something with STL algorithms like the following:
#include <iostream>
#include <string>
#include <algorithm>
#include <cctype>
using namespace std;
struct convert {
void operator()(char& c) { c = toupper((unsigned char)c); }
};
bool isSpace(char c)
{
return std::isspace(c);
}
int main() {
string input = "The day is \"Mon Day\" You know";
cout << "original string: " << input <<endl;
unsigned int firstQuote = input.find("\"");
unsigned int secondQuote = input.find_last_of("\"");
string firstPart="";
string secondPart="";
string quotePart="";
if (firstQuote != string::npos)
{
firstPart = input.substr(0,firstQuote);
if (secondQuote != string::npos)
{
secondPart = input.substr(secondQuote+1);
quotePart = input.substr(firstQuote+1, secondQuote-firstQuote-1);
//drop those quotes
}
std::for_each(firstPart.begin(), firstPart.end(), convert());
firstPart.erase(remove_if(firstPart.begin(),
firstPart.end(), isSpace),firstPart.end());
std::for_each(secondPart.begin(), secondPart.end(), convert());
secondPart.erase(remove_if(secondPart.begin(),
secondPart.end(), isSpace),secondPart.end());
input = firstPart + quotePart + secondPart;
}
else //does not contains quote
{
std::for_each(input.begin(), input.end(), convert());
input.erase(remove_if(input.begin(),
input.end(), isSpace),input.end());
}
cout << "transformed string: " << input << endl;
return 0;
}
It gave the following output:
original string: The day is "Mon Day" You know
transformed string: THEDAYISMon DayYOUKNOW
With the test case you have shown:
original string: The day is "Mon Day"
transformed string: THEDAYISMon Day

Just for laughs, use a custom iterator, std::copy and a std::back_insert_iterator, and an operator++ that knows to skip whitespace and set a flag on a quote character:
CustomStringIt& CustomStringIt::operator++ ()
{
if(index_<originalString_.size())
++index_;
if(!inQuotes_ && isspace(originalString_[index_]))
return ++(*this);
if('\"'==originalString_[index_])
{
inQuotes_ = !inQuotes_;
return ++(*this);
}
return *this;
}
char CustomStringIt::operator* () const
{
char c = originalString_[index_];
return inQuotes_ ? c : std::toupper(c) ;
}
Full code here.

You can use stringstream and getline with the \" character as the delimiter instead of newline.
Split your string into 3 cases: the part of the string before the first quote, the part in quotes, and the part after the second quote.
You would process the first and third parts before adding to your output, but add the second part without processing.
If your string contains no quotes, the entire string will be contained in the first part. The second and third parts will just be empty.
while (getline (is, str)) {
string processed;
stringstream line(str);
string beforeFirstQuote;
string inQuotes;
getline(line, beforeFirstQuote, '\"');
Process(beforeFirstQuote, processed);
getline(line, inQuotes, '\"');
processed += inQuotes;
getline(line, afterSecondQuote, '\"');
Process(afterFirstQuote, processed);
}
void Process(const string& input, string& output) {
for (string::const_iterator it = input.begin(), end_it = input.end(); it != end_it; ++it)
{
char c = *it;
if (! ::isspace(c))
{
output.push_back(::toupper(c));
}
}
}

Need a regular expression to extract only letters and whitespace from a string

I'm building a small utility method that parses a line (a string) and returns a vector of all the words. The istringstream code I have below works fine except for when there is punctuation so naturally my fix is to want to "sanitize" the line before I run it through the while loop.
I would appreciate some help in using the regex library in c++ for this. My initial solution was to us substr() and go to town but that seems complicated as I'll have to iterate and test each character to see what it is then perform some operations.
vector<string> lineParser(Line * ln)
{
vector<string> result;
string word;
string line = ln->getLine();
istringstream iss(line);
while(iss)
{
iss >> word;
result.push_back(word);
}
return result;
}

Don't need to use regular expressions just for punctuation:
// Replace all punctuation with space character.
std::replace_if(line.begin(), line.end(),
std::ptr_fun<int, int>(&std::ispunct),
' '
);
Or if you want everything but letters and numbers turned into space:
std::replace_if(line.begin(), line.end(),
std::not1(std::ptr_fun<int,int>(&std::isalphanum)),
' '
);
While we are here:
Your while loop is broken and will push the last value into the vector twice.
It should be:
while(iss)
{
iss >> word;
if (iss) // If the read of a word failed. Then iss state is bad.
{ result.push_back(word);// Only push_back() if the state is not bad.
}
}
Or the more common version:
while(iss >> word) // Loop is only entered if the read of the word worked.
{
result.push_back(word);
}
Or you can use the stl:
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(result)
);

[^A-Za-z\s] should do what you need if your replace the matching characters by nothing. It should remove all characters that are not letters and spaces. Or [^A-Za-z0-9\s] if you want to keep numbers too.
You can use online tools like this one : http://gskinner.com/RegExr/ to test out your patterns (Replace tab). Indeed some modifications can be required based on the regex lib you are using.

I'm not positive, but I think this is what you're looking for:
#include<iostream>
#include<regex>
#include<vector>
int
main()
{
std::string line("some words: with some punctuation.");
std::regex words("[\\w]+");
std::sregex_token_iterator i(line.begin(), line.end(), words);
std::vector<std::string> list(i, std::sregex_token_iterator());
for (auto j = list.begin(), e = list.end(); j != e; ++j)
std::cout << *j << '\n';
}
some
words
with
some
punctuation

The simplest solution is probably to create a filtering
streambuf to convert all non alphanumeric characters to space,
then to read using std::copy:
class StripPunct : public std::streambuf
{
std::streambuf* mySource;
char myBuffer;
protected:
virtual int underflow()
{
int result = mySource->sbumpc();
if ( result != EOF ) {
if ( !::isalnum( result ) )
result = ' ';
myBuffer = result;
setg( &myBuffer, &myBuffer, &myBuffer + 1 );
}
return result;
}
public:
explicit StripPunct( std::streambuf* source )
: mySource( source )
{
}
};
std::vector<std::string>
LineParser( std::istream& source )
{
StripPunct sb( source.rdbuf() );
std::istream src( &sb );
return std::vector<std::string>(
(std::istream_iterator<std::string>( src )),
(std::istream_iterator<std::string>()) );
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract specific elements from a string? - c++

I am trying to extract the first numbers from each block of numbers from the next string. string s = "f 1079//2059 1165//2417 1164//2414 1068//1980"; In this example I need to extract 1079, 1165, 1164 and 1068 I have tried with getline and substr but I have not been able to.

Related

Could you recommend, how reimplement split function to work with string_view?

Split strings into tokens with delimiter (/ and -) in c++ [duplicate]

How to replace multiple sets of keywords in a string?

Remove whitespace, convert case, in string except in quotes

Need a regular expression to extract only letters and whitespace from a string

Categories

Resources