how can extract the name from a line - c++

Assume that I have a line from a file that I want to read:
>NZ_FNBK01000055.1 Halorientalis regularis
So how can extract the name from that line that begins with a greater than sign; everything following the greater-than sign (and excluding the newline at the end of the line) is the name.
The name should be:
NZ_FNBK01000055.1 Halorientalis regularis
Here is my code so far:
bool file::load(istream& file)
{
string line;
while(getline(genomeSource, line)){
if(line.find(">") != string::npos)
{
m_name =
}
}
return true;
}

You could easily handle both conditions using regular expressions. c++ introduced <regex> in c++11. Using this and a regex like:
>.*? (.*?) .*$
> Get the literal character
.*? Non greedy search for anything stopping at a space
(.*?) Non greedy search sor anything stopping at a space but grouping the characters before hand.
.*$ Greedy search until the end of the string.
With this you can easily check if this line meets your criteria and get the name at the same time. Here is a test showing it working. For the code, the c++11 regex lib is very simple:
std::string s = ">NZ_FNBK01000055.1 Halorientalis regularis ";
std::regex rgx(">.*? (.*?) .*$"); // Make the regex
std::smatch matches;
if(std::regex_search(s, matches, rgx)) { // Do a search
if (matches.size() > 1) { // If there are matches, print them.
std::cout << "The name is " << matches[1].str() << "\n";
}
}
Here is a live example.

Related

How can I replace all words in a string except one

So, I would like to change all words in a string except one, that stays in the middle.
#include <boost/algorithm/string/replace.hpp>
int main()
{
string test = "You want to join player group";
string find = "You want to join group";
string replace = "This is a test about group";
boost::replace_all(test, find, replace);
cout << test << endl;
}
The output was expected to be:
This is a test about player group
But it doesn't work, the output is:
You want to join player group
The problem is on finding out the words, since they are a unique string.
There's a function that reads all words, no matter their position and just change what I want?
EDIT2:
This is the best example of what I want to happen:
char* a = "This is MYYYYYYYYY line in the void Translate"; // This is the main line
char* b = "This is line in the void Translate"; // This is what needs to be find in the main line
char* c = "Testing - is line twatawtn thdwae voiwd Transwlate"; // This needs to replace ALL the words in the char* b, perserving the MYYYYYYYYY
// The output is expected to be:
Testing - is MYYYYYYYY is line twatawtn thdwae voiwd Transwlate
You need to invert your thinking here. Instead of matching "All words but one", you need to try to match that one word so you can extract it and insert it elsewhere.
We can do this with Regular Expressions, which became standardized in C++11:
std::string test = "You want to join player group";
static const std::regex find{R"(You want to join (\S+) group)"};
std::smatch search_result;
if (!std::regex_search(test, search_result, find))
{
std::cerr << "Could not match the string\n";
exit(1);
}
else
{
std::string found_group_name = search_result[1];
auto replace = boost::format("This is a test about %1% group") % found_group_name;
std::cout << replace;
}
Live Demo
To match the word "player" I used a pretty simply regular expression (\S+) which means "match one or more non-whitespace characters (greedily) and put that into a group"
"Groups" in regular expressions are enclosed by parentheses. The 0th group is always the entire match, and since we only have one set of parentheses, your word is therefore in group 1, hence the resulting access of the match result at search_result[1].
To create the regular expression, you'll notice I used the perhaps-unfamiliar string literal syntaxR"(...)". This is called a raw string literal and was also standardized in C++11. It was basically made for describing regular expressions without needing to escape backslashes. If you've used Python, it's the same as r'...'. If you've used C#, it's the same as #"..."
I threw in some boost::format to print the result because you were using Boost in the question and I thought you'd like to have some fun with it :-)
In your example, find is not a substring of test, so boost::replace_all(test, find, replace); has no effect.
Removing group from find and replace solves it:
#include <boost/algorithm/string/replace.hpp>
#include <iostream>
int main()
{
std::string test = "You want to join player group";
std::string find = "You want to join";
std::string replace = "This is a test about";
boost::replace_all(test, find, replace);
std::cout << test << std::endl;
}
Output: This is a test about player group.
In this case, there is just one replace of the beginning of the string because the end of the string is already the right one. You could have another call of replace_all to change the end if needed.
Some other options:
one is in the other answer.
split the strings into a vector (or array) of words, then insert the desired word (player) at the right spot of the replace vector, then build your output string from it.

Regex in C++ how to search for valid Linux Device Node?

Given a device node in Linux such as "/dev/sda1" or "/dev/sdb", I'd like to match all valid choices to know if I have a valid device node.
Here's what I have so far:
static bool isUSBNameValid(const std::string &node) {
std::regex device("/dev/sd[a-z]*");
if (std::regex_match(node, device)) {
return true;
}
return false;
}
This does not work. Why is this?
How to make this work with any valid Linux device node?
Your /dev/sd[a-z]* pattern matches /dev/sd literal substring followed with any 0+ lowercase ASCII letters. Used within regex_match, the pattern must match the whole string. Since the /dev/sda1 ends with a digit, the regex_match fails, but it succeeds with /dev/sdb.
So, if you plan to only match SATA devices, you will need to use /dev/sd[a-z][0-9]* pattern, else, to match arbitrary number of alphanumeric chars after /dev/, you may use /dev/[[:alnum:]]+.
std::regex device_sata("/dev/sd[a-z][0-9]*");
std::regex device_any("/dev/[[:alnum:]]+");
See the C++ demo:
#include<regex>
#include <iostream>
using namespace std;
bool isUSBNameValid(const std::string &node, std::regex device) {
if (std::regex_match(node, device)) {
return true;
}
return false;
}
int main() {
std::regex device_sata("/dev/sd[a-z][0-9]*");
std::regex device_any("/dev/[[:alnum:]]+");
cout<< ( isUSBNameValid("/dev/sda1", device_sata) ? "Found" : "Not found")<<endl;
cout<< ( isUSBNameValid("/dev/sdb", device_sata) ? "Found" : "Not found")<<endl;
cout<< ( isUSBNameValid("/dev/ttyS0", device_any) ? "Found" : "Not found")<<endl;
return 0;
}
I would suggest the following pattern instead:
std::regex device("/dev/sd[a-z][0-9]*");
Add capturing groups around the [a-z] and [0-9]* if that becomes important.
If you truly want to match any device it would be:
std::regex device("/dev/[[::anum]]+");
with an additional check that what you have matched is not a directory. It would probably be good to add such a check (using stat) anyway.

Regexp matching fails with invalid special open parenthesis

I am trying to use regexps in c++11, but my code always throws an std::regex_error of Invalid special open parenthesis.. A minimal example code which tries to find the first duplicate character in a string:
std::string regexp_string("(?P<a>[a-z])(?P=a)"); // Nothing to be escaped here, right?
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
std::regex_match(target, matched_regexp, regexp_to_match);
for(const auto& m: matched_regexp)
{
std::cout << m << std::endl;
}
Why do I get an error and how do I fix this example?
There are 2 issues here:
std::regex flavors do not support named capturing groups / backreferences, you need to use numbered capturing groups / backreferences
You should use regex_search rather than regex_match that requires a full string match.
Use
std::string regexp_string(R"(([a-z])\1)");
std::regex regexp_to_match(regexp_string);
std::string target("abbab");
std::smatch matched_regexp;
if (std::regex_search(target, matched_regexp, regexp_to_match)) {
std::cout << matched_regexp.str() << std::endl;
}
// => bb
See the C++ demo
The R"(([a-z])\1)" raw string literal defines the ([a-z])\1 regex that matches any lowercase ASCII letter and then matches the same letter again.
http://en.cppreference.com/w/cpp/regex/ecmascript says that ECMAScript (the default type for std::regex) requires (?= for positive lookahead.
The reason your regex crashes for you is because named groups not supported by std::regex. However you can still use what is available to find the first duplicate char in string:
#include <iostream>
#include <regex>
int main()
{
std::string s = "abc def cde";
std::smatch m;
std::regex r("(\\w).*?(?=\\1)");
if (std::regex_search(s, m, r))
std::cout << m[1] << std::endl;
return 0;
}
Prints
c

Why does std::regex_match not support "zero-length assertions"?

#include <regex>
int main()
{
b = std::regex_match("building", std::regex("^\w*uild(?=ing$)"));
//
// b is expected to be true, but the actual value is false.
//
}
My compiler is clang 3.8.
Why does std::regex_match not support "zero-length assertions"?
regex_match is only for matching the entire input string. Your regex — written correctly as "^\\w*uild(?=ing$) with the backslash escaped, or as a raw string R"(^\w*uild(?=ing$))" — only actually matches (consumes) the prefix build. It looks ahead for ing$, and will successfully find it, but since the whole input string wasn't consumed, regex_match rejects the match.
If you want to use regex_match but only capture the first part, you could use ^(\w*uild)ing$ (or just (\w*uild)ing since the whole string must be matched) and access the 1st capture group.
But since you're using ^ and $ anyway, you might as well use regex_search instead:
int main()
{
std::cmatch m;
if (std::regex_search("building", m, std::regex(R"(^\w*uild(?=ing$))"))) {
std::cout << "m[0] = " << m[0] << std::endl; // prints "m[0] = build"
}
return 0;
}

unchecked exception while running regex- get file name without extention from file path

I have this simple program
string str = "D:\Praxisphase 1 project\test\Brainstorming.docx";
regex ex("[^\\]+(?=\.docx$)");
if (regex_match(str, ex)){
cout << "match found"<< endl;
}
expecting the result to be true, my regex is working since I have tried it online, but when trying to run in C++ , the app throws unchecked exception.
First of all, use raw string literals when defining regex to avoid issues with backslashes (the \. is not a valid escape sequence, you need "\\." or R"(\.)"). Second, regex_match requires a full string match, thus, use regex_search.
#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
string str = R"(D:\Praxisphase 1 project\test\Brainstorming.docx)";
// OR
// string str = R"D:\\Praxisphase 1 project\\test\\Brainstorming.docx";
regex ex(R"([^\\]+(?=\.docx$))");
if (regex_search(str, ex)){
cout << "match found"<< endl;
}
return 0;
}
See the C++ demo
Note that R"([^\\]+(?=\.docx$))" = "[^\\\\]+(?=\\.docx$)", the \ in the first are literal backslashes (and you need two backslashes in a regex pattern to match a \ symbol), and in the second, the 4 backslashes are necessary to declare 2 literal backslashes that will match a single \ in the input text.