Getting full name and values from string with regex and c++

Getting full name and values from string with regex and c++ - c++

I have a project where I am reading data from a text file in c++ which contains a person's name and up to 4 numerical numbers like this. (each line has an entry)
Dave Light 89 71 91 89
Hua Tran Du 81 79 80
I am wondering if regex would be an efficient way of splitting the name and numerical values or if I should find an alternative method.
I would also like to be able to pick up any errors in the text file when reading each entry such as a letter instead of a number as if an entry like this was found.
Andrew Van Den J 88 95 85

You should better use a separator instead of space. The separator could be :, |, ^ or anything that cannot be part of your data. With this approach, your data should be stored as:
Dave Light:89:71:91:89
Hua Tran Du:81:79:80
And then you can use find, find_first_of, strchr or strstr or any other searching (and re-searching) to find relevant data.

This non-regex solution:
std::string str = "Dave Light 89 71 91 89";
std::size_t firstDig = str.find_first_of("0123456789");
std::string str1 = str.substr (0,firstDig);
std::string str2 = str.substr (firstDig);
would give you the letter part in str1 and the number part in str2.
Check this code at ideone.com.
It sounds like it's something like this you want...(?) I'm not quite sure what kind of errors you mean to pick. As paxdiablo pointed out, a name could be quite complex, so getting the letter part probably would be the safest.

Try this code.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main(){
std::vector<std::string> data {"Dave Light 89 71 91 ","Hua Tran Du 81 79 80","zyx 1 2 3 4","zyx 1 2"};
std::regex pat {R"((^[A-Za-z\s]*)(\d+)\s*(\d+)\s*(\d+)(\s*)$)"};
for(auto& line : data) {
std::cout<<line<<std::endl;
std::smatch matches; // matched strings go here
if (regex_search(line, matches, pat)) {
//std::cout<<"size:"<<matches.size()<<std::endl;
if (matches.size()==6)
std::cout<<"Name:"<<matches[1].str()<<"\t"<<"data1:"<<matches[2].str()<<"\tdata2:"<<matches[3].str()<<"\tdata3:"<<matches[4].str()<<std::endl;
}
}
}
With regex number of lines code reduced greatly. Main trick in regex is using right pattern.
Hope this will help you.

Related

Printing elements of a tuple

I am trying to print the elements of a tuple returned by a function where I am comparing the elements of a vector of addresses to those in a database. The fields are: 32-bit int representing the address, int for prefix matching, string containing ASN, string containing matching address, string containing the original address being queried.
for (auto itr = IPs.begin(); itr != IPs.end(); itr++) {
tuple<int,int,string,string,string> entry = Compare(*itr, database);
string out = get<3>(entry) + "/" + to_string(get<1>(entry)) + " " + get<2>(entry) + " " + get<4>(entry) + "\n";
cout << out;
}
I want each line of the output to look like this:
"{prefix}/{# bits of prefix} {ASN} {address}\n"
However, the output looks like this:
12.105.69.1528 15314
12.125.142.190 6402
57.0.208.2450 6085
208.148.84.30 4293
208.148.84.16 4293
208.152.160.797 5003
192.65.205.2509 5400
194.191.154.806 2686
199.14.71.79 1239
199.14.70.79 1239
The expected output is:
12.105.69.144/28 15314 12.105.69.152
12.125.142.16/30 6402 12.125.142.19
57.0.208.244/30 6085 57.0.208.245
208.148.84.0/30 4293 208.148.84.3
208.148.84.0/24 4293 208.148.84.16
208.152.160.64/27 5003 208.152.160.79
192.65.205.248/29 5400 192.65.205.250
194.191.154.64/26 2686 194.191.154.80
199.14.71.0/24 1239 199.14.71.79
199.14.70.0/24 1239 199.14.70.79
The part that confuses me the most is the fact that when I print each element on separate lines by replacing each separator with line breaks, it prints the elements correctly:
12.105.69.144
28
15314
12.105.69.152
12.125.142.16
30
6402
12.125.142.19
57.0.208.244
30
6085
57.0.208.245
208.148.84.0
30
4293
208.148.84.3
208.148.84.0
24
4293
208.148.84.16
208.152.160.64
27
5003
208.152.160.79
192.65.205.248
29
5400
192.65.205.250
194.191.154.64
26
2686
194.191.154.80
199.14.71.0
24
1239
199.14.71.79
199.14.70.0
24
1239
199.14.70.79
I suppose that I could just write another function that formats the line breaks into the correct format afterwards, but I am curious about what is causing this. Any ideas?

Could you provide a little more code, so it can be debugged to precisely track the problem?
I think tuple and get are used correctly.
I guess the problem is in the content of strings or at least in the string returned by `get<2>(entry).
Here is a little example which shows what might be wrong
std::string aa = "AAAAA\r"; //"\r" is extra character in aa string
std::string bb = "bbb";
std::cout << aa + " " + bb; //output is " bbbA" not "AAAAA bbb"
The problem obviously doesn't occur when each strings is printed separately in each line.
Double check if string returned by get<X> doesn't contain any special characters or contain OSX end of line mixed with Linux or Windows end of line

BOOST Regex global Search behavior

My question is about whether the boost regex engine can do "global searches".
I've tried and I can't get it to do it.
The match_results class contains the base pointer of the string, so after incrementing the
starting position manually then setting the match_flag_type to match_not_bob | match_prev_avail,
I would have thought the boost regex engine would be able to know it is in the middle of a string.
Since I'm using this engine in my software, I'd like to know if this engine can infact do this correctly and I'm doing something wrong, or global searching is not possible with this engine.
Below are sample code/output using BOOST regex, and an equivalent Perl script.
Edit: Just to clarify, in the below boost example the Start iterator is always treated as a boundry. The engine doesn't seem to consider text to the left of that position when making a match.
At least in this case.
7/22/2014 - The Solution for Global Search
Posting this update as the solution. Its not a workaround or kludge.
After googling 'regex_iterator' I knew that regex_iterator sees the text to the left of the
current search position. And, I came across all the same source code. One site (like the others)
had an passing simple explanation of how it works that said it calls 'regex_search()'
when the regex_iterator is incremented.
So down in the bowels of the regex_iterator class, I saw that it indeed called regex_search() when
the iterator was incremented ->Next().
This 'regex_search()' overload wasn't documented and comes in only 1 type.
It includes a BIDI parameter at the end named 'base'.
bool regex_search(BidiIterator first, BidiIterator last,
match_results<BidiIterator, Allocator>& m,
const basic_regex<charT, traits>& e,
match_flag_type flags,
BidiIterator base)
{
if(e.flags() & regex_constants::failbit)
return false;
re_detail::perl_matcher<BidiIterator, Allocator, traits> matcher(first, last, m, e, flags, base);
return matcher.find();
}
It appears the base is the wall to the left of the start BIDI from where initial lookbehind's could use to check conditions..
So, I tested it out and it seemed to work.
The bottom line is to set base BIDI to the start of the input, and put the start BIDI anywhere after.
Effectively, this is like setting the pos() variable in Perl.
And, to emulate global positional increment on a zero-length match, a simple conditional is all that's
needed:
Start = ( _M[0].length() == 0) ? _M[0].first + 1 : _M[0].second; (see below)
BOOST Regex 1.54 regex_search() using 'base' BIDI
Note - in this example, Start always = _M[0].second;
The regex is purposely unlike the two other examples (below it), to demonstrate in fact
the text from 'Base' to 'Start' is considered each time when matching this regex.
#typedef std::string::const_iterator SITR;
boost::regex Rx( "(?<=(.)).", regex_constants::perl );
regex_constants::match_flag_type Flags = match_default;
string str("0123456789");
SITR Start = str.begin();
SITR End = str.end();
SITR Base = Start;
boost::smatch _M;
while ( boost::regex_search( Start, End, _M, Rx, Flags, Base) )
{
string str1(_M[1].first, _M[1].second );
string str0(_M[0].first, _M[0].second );
cout << str1 << str0 << endl;
// This line implements the Perl global match flag m//g ->
Start = ( _M[0].length() == 0) ? _M[0].first + 1 : _M[0].second;
}
output:
01
12
23
34
45
56
67
78
89
Perl 5.10
use strict;
use warnings;
my $str = "0123456789";
while ( $str =~ /(?<=(..))/g )
{
print ("$1\n");
}
output:**
01
12
23
34
45
56
67
78
89
BOOST Regex 1.54 regex_search() no 'base'
string str("0123456789");
std::string::const_iterator Start = str.begin();
std::string::const_iterator End = str.end();
boost::regex Rx("(?<=(..))", regex_constants::perl);
regex_constants::match_flag_type Flags = match_default;
boost::smatch _M;
while ( boost::regex_search( Start, End, _M, Rx, Flags) )
{
string str(_M[1].first, _M[1].second );
cout << str << "\n";
Flags |= regex_constants::match_prev_avail;
Flags |= regex_constants::match_not_bob;
Start = _M[0].second;
}
output:
01
23
45
67
89

Updated in response to the comments Live On Coliru:
#include <boost/regex.hpp>
int main()
{
using namespace boost;
std::string str("0123456789");
std::string::const_iterator start = str.begin();
std::string::const_iterator end = str.end();
boost::regex re("(?<=(..))", regex_constants::perl);
regex_constants::match_flag_type flags = match_default;
boost::smatch match;
while (start<end &&
boost::regex_search(start, end, match, re, flags))
{
std::cout << match[1] << "\n";
start += 1; // NOTE
//// some smartness that should work for most cases:
// start = (match.length(0)? match[0] : match.prefix()).first + 1;
flags |= regex_constants::match_prev_avail;
flags |= regex_constants::match_not_bob;
std::cout << "at '" << std::string(start,end) << "'\n";
}
}
Prints:
01 at '123456789'
12 at '23456789'
23 at '3456789'
34 at '456789'
45 at '56789'
56 at '6789'
67 at '789'
78 at '89'
89 at '9'

reading and parsing a file, assigning each piece of the parsed string to its own variable

89 int Student::loadStudents() {
90 Student newStudent;
91 string comma;
92 string line;
93 ifstream myfile("student.dat");
94 string name,email="";
95 string status="";
96 int id;
97 if (myfile.is_open()){
98 while ( getline (myfile,line) ) {
99 //parse line
100 string myText(line);
101 istringstream iss(myText);
102 if(!(iss>>id)) id=0;
103
104 std::ignore(1,',');
105 std::getline(iss,name,',');
106 std::getline(iss,status,',');
107 std::getline(iss,email,',');
108 cout<<name<<endl;
109 Student newStudent(id,name,status,email);
110 Student::studentList.insert(std::pair<int,Student>(id,newStudent));
Above is the method I am defining. When the cout is executed the output is:
John Doe
Matt Smith
Before I added in the second getline(iss,name,',') the cout did nothing.
Can anyone explain why it works with the line repeated and why the same code won't work for status and email?
example line from file:
1,john doe,freshman,jd#email.com
EDIT:
I used std::ignore(1,',') before the first getline(iss,name,',') and received the error 'ignore' is undeclared in this namespace 'std'.

Can anyone explain why it works with the line repeated and why the same code won't work for status and email?
Because your first operation on isa is iss>>id.
Presumably your input file is of the form id,name,status,email. That first operation reads up to but not including the first comma. That first comma is still in the input stream. This means your first std::getline(iss,name,',') reads all the stuff remaining before that first comma and that first comma. All the stuff remaining before that first comma -- that's an empty string.
It's best not to mix parsing concepts. Split the line along the commas, then parse each of those split elements.
Edit
Another way to deal with this issue: call std::ignore instead of that first call to std::getline. The next character to be read should be a comma, so just ignore it. This is okay if you can assume a properly formatted input file. It is not okay if you have to deal with the vagaries of input files created by humans.
Another issue: Suppose someone's name is "John Doe, PhD" or the email address is "John Doe, PhD "?
Edit 2
Just to clarify, suppose the line contains "1234,John Doe,freshman,jdoe#college_name.edu".
Input pointer prior to iss>>id:
1234,John Doe,freshman,jdoe#college_name.edu
^
The call to iss>>id sets id to 1234 and advances the input pointer to the first non-numeric character -- the first comma.
Input pointer after iss>>id (prior to first call to std::getline):
1234,John Doe,freshman,jdoe#college_name.edu
____^
The first std::getline(iss,name,',') sees the input pointer is at a comma. It sets name to the empty string and advances the input pointer to just after the comma.
Input pointer after first call to std::getline (prior to second call to std::getline):
1234,John Doe,freshman,jdoe#college_name.edu
_____^
The second std::getline(iss,name,',') reads up to the second comma. It sets name to "John Doe" empty string and advances the input pointer to just after the second comma.
Input pointer after second call to std::getline (prior to third call to std::getline):
1234,John Doe,freshman,jdoe#college_name.edu
______________^

How to show regex output in pair in c++?

I dont know if it makes any sense or not but here it is
Is there a way that i can get two words from a regex result each time?
supose i have a text file which contains an string such as the following :
Alex Fenix is an Engineer who works for Ford Automotive Company. His
Personal ID is <123456>;etc....
basically if i use \w i would get a list of :
Alex
Fenix
is
an
Engineer
and etc
They are all separated by white space and punctuation marks
what i am asking is , whether there is a way to have a list such as :
Alex Fenix
is an
Engineer who
works for
Ford Automotive
Company His
Personal ID
is 123456
How can i achieve such a format?
Is it even possible or should i store those first results in an array and then iterate through them and create the second list?
By the way please note that the item Alex Fenix is actually an abstraction of a map or any container like that.
The reason i am asking is that i am trying to see if there is any way that i can directly read a file and apply a regex on it and get this second list without any further processing overhead
(I mean reading into a map or string , then iterating through them and creating pairs of the tokens and then carry on what ever is needed )

Try this regex
\w \w
It will match any word followed by a space and another word.
Although you can achieve such a format relatively easy without using a regex. Take a look at this for instance:
#include <iostream>
#include <sstream>
#include <string>
#include <algorithm>
int main() {
std::string s("Alex Fenix is an Engineer who works for Ford Automotive Company. His Personal ID is <123456>");
// Remove any occurences of '.', '<' or '>'.
s.assign(begin(s), std::remove_if(begin(s), end(s), [] (const char c) {
return (c == '.' || c == '<' || c == '>');
}));
// Tokenize.
std::istringstream iss(s);
std::string t1, t2;
while (iss >> t1 >> t2) {
std::cout << t1 << " " << t2 << std::endl;
}
}
Output:
Alex Fenix
is an
Engineer who
works for
Ford Automotive
Company His
Personal ID
is 123456

How to extract a string that is present between two brackets?

For example if the string is:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
The output should be:
20 BB EC 45 40 C8 97 20 84 8B 10
int main()
{
char input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char output[500];
// what to write here so that i can get the desired output as:
// output = "20 BB EC 45 40 C8 97 20 84 8B 10"
return 0;
}

In C, you could do this with a scanset conversion (though it's a bit RE-like, so the syntax gets a bit strange):
sscanf(input, "[%*[^]]][%[^]]]", second_string);
In case you're wondering how that works, the first [ matches an open bracket literally. Then you have a scanset, which looks like %[allowed_chars] or %[^not_allowed_chars]. In this case, you're scanning up to the first ], so it's %[^]]. In the first one, we have a * between the % and the rest of the conversion specification, which means sscanf will try to match that pattern, but ignore it -- not assign the result to anything. That's followed by a ] that gets matched literally.
Then we repeat essentially the same thing over again, but without the *, so the second data that's matched by this conversion gets assigned to second_string.
With the typo fixed and a bit of extra code added to skip over the initial XYZ ::, working (tested) code looks like this:
#include <stdio.h>
int main() {
char *input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char second_string[64];
sscanf(input, "%*[^[][%*[^]]][%[^]]]", second_string);
printf("content: %s\n", second_string);
return 0;
}

Just find the second [ and start extracting (or just printing) until next ]....

You can use string::substr if you are willing to convert to std::string
If you don't know the location of brackets, you can use string::find_last_of for the last bracket and again string::find_last_of to find the open bracket.

Well, say, your file looks like this:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
XYZ ::[1][Maybe some other text]
XYZ ::[1][Some numbers maybe: 123 98345 123 9-834 ]
XYZ ::[1][blah-blah-blah]
The code that will extract the data will look something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//opening the file to read from
std::ifstream file( "in.txt" );
if( !file.is_open() )
{
cout << "Cannot open the file";
return -1;
}
std::string in, out;
int blockNumber = 1;//Which bracket block we are looking for. We are currently looking for the second one.
while( getline( file, in ) )
{
int n = 0;//Variable for storing index in the string (where our target text starts)
int i = 0;//Counter for [] blocks we have encountered.
while( i <= blockNumber )
{
//What we are doing here is searching for the position of [ symbol, starting
//from the n + 1'st symbol of the string.
n = in.find_first_of('[', n + 1);
i++;
}
//Getting our data and printing it.
out = in.substr( n + 1, ( in.find_first_of(']', n) - n - 1) );
std::cout << out << std::endl;
}
return 0;
}
The output after executing this will be:
20 BB EC 45 40 C8 97 20 84 8B 10
Maybe some other text
Some numbers maybe: 123 98345 123 9-834
blah-blah-blah

The simplest solution is something along the lines of:
std::string
match( std::string const& input )
{
static boost::regex const matcher( ".*\\[[^]]*\\]\\[(.*)\\]" );
boost::smatch matched;
return regex_match( input, matched, matcher )
? matched[1]
: std::string();
}
The regular expression looks a bit complicated because you need to match
meta-characters, and because the compiler I use doesn't support raw
strings yet. (With raw strings, I think the expression would be
R"^(.*\[[^]]\]\[(.*)\])^". But I can't verify that.)
This returns an empty string in case there is no match; if you're sure
about the format, you might prefer to throw an exception. You can also
extend it to do as much error checking as necessary: in general, the
more you validate a text input, the better it is, but you didn't give
precise enough information about what was legal for me to fill it out
completely. (For your example string, for example, you might replace
the ".*" at the beginning of the regular expression with
"\\u{3}\\s*::": three upper case characters followed by zero or more
whitespace, then two ':'. Or the first [] group might be
"\\[\\d\\]", if you're certain it's always a single digit.

This could work for you in a very specific sense:
std::string str(input);
std::string output(input.find_last_of('['), input.find_last_of(']'));
out = output.c_str();
The syntax isnt quite correct so you will need to look that up. You probably need to define your question a little better as well as this will only work if you want the brcketed string at the end.

Using string library in C. I'll give a code snippet that process a single linewhich can be used in a loop that reads the file line by line. NOTE: string.h should be included
int length = strlen( input );
char* output = 0;
// Search
char* firstBr = strchr( input, '[' );
if( 0 != firstBr++ ) // check for null pointer
{
char* secondBr = strchr( firstBr, '[' );
// we don't need '['
if( 0 != secondBr++ )
{
int nOutLen = strlen( secondBr ) - 1;
if( 0 < nOutLen )
{
output = new char[nOutLen+1];
strncpy( output, secondBr, nOutLen );
output[ nOutLen ] = '\0';
}
}
}
if( 0 != output )
{
cout << output;
delete[] output;
output = 0;
}
else
{
cout << "Error!";
}

You could use this regex to get what is inside "<" and ">":
// Regex: "<%999[^>]>" (Max of 999 Bytes)
int n1 = sscanf(source, "<%999[^>]>", dest);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js