Arabic regex matching - c++ - c++

I need to find the given string has arabic letters. It ranges from \u0600-\u06FF\u0750-\u077F.
I have written the below program:
std::vector<STD_STRING> strFieldvalues;
std::string pattern = "/[\u0600-\u06FF\u0750-\u077F]/";
std:string strFieldVal;
gboolArabic = false;
int i = 0;
int j = 0;
for ( ;i < fieldValues.size() && j< fieldNames.size(); i++,j++) //for loop its entering
{
strFieldVal=fieldValues[i].GetPString();
if (std::regex_match(strFieldVal, std::regex("(sub)(/[\u0600-\u06FF\u0750-\u077F]/)")))
{
gboolArabic = true;
gArabicFieldNames.push_back(fieldNames[j].GetPString());
}
}
strFieldVal is coming as <0067><062A><0627>. But its not entering into the if block.
Can anyone help .
Sample program given below is working in online compiler. In visual studio, not entering into the if block. Adding screenshots.

It appears you need to remove regex delimiters on both ends of the regex and apply a + quantifier to the regex pattern because regex_match requires a full string match:
#include <iostream>
#include <regex>
int main() {
std::string strFieldVal("المتحدة");
std::regex pattern("[\u0600-\u06FF\u0750-\u077F]+");
if (std::regex_match(strFieldVal, pattern))
{
std::cout << strFieldVal << " is Arabic.\n";
}
return 0;
}
See the C++ demo, output: المتحدة is Arabic..

#include <iostream>
#include <regex>
int main() {
std::wstring strFieldVal(L"المتحدة");
std::wregex pattern(L"[\u0600-\u06FF\u0750-\u077F]+");
if (std::regex_match(strFieldVal, pattern))
{
std::cout << strFieldVal << " is Arabic.\n";
}
return 0;
}
The above one works correctly.In Visual Studio When i Add c++ source file and add this content, its asking for encoding,i given Yes. Then it worked perfectly.

Related

Using the already-defined regex pattern in another regex pattern and a question about applying regex to file

How can I use the already-defined regex pattern in another regex pattern. For example in the following code sign and number are defined and I want to use them in defining relation:
regex sign("=<|=|>|<=|<>|>=");
regex number("^[1-9]\\d*");
regex relation(number, sign, number)
So, I need to find all matches (to the pattern like 23<=34 or 123<>2000) in the given file.
Since I haven't completed the relation, I've been testing with sign:
#include <iostream>
#include <fstream>
#include <regex>
using namespace std;
int main() {
regex sign("=<|=|>|<=|<>|>=");
regex digit("[0-9]");
regex number("^[1-9]\\d*");
//regex relation("^[1-9]\d*[=<|=|>|<=|<>|>=]^[1-9]\d*"); (this part is what I couldn't do)
string line;
ifstream fin;
fin.open("name.txt");
if (fin.good()) {
while (getline(fin, line)) {
bool match_sign = regex_search(line, sign);
if (match_sign) {
cout << line << endl; // but I need to print the match only
}
}
}
return 0;
}
When I want to print the matches in the file, it prints the whole line which contains any match. How can I make it print only the match itself but not the whole line?
Update:
#include <iostream>
#include <fstream>
#include <vector>
#include <regex>
using namespace std;
#define REGEX_SIGN "=<|=|>|<=|<>|>="
#define REGEX_DIGIT "[0-9]"
#define REGEX_NUMBER "^" REGEX_DIGIT "\\d*"
int main() {
regex sign(REGEX_SIGN);
regex digit(REGEX_DIGIT);
regex number(REGEX_NUMBER);
regex relation(REGEX_NUMBER REGEX_SIGN REGEX_NUMBER);
string line, text;
ifstream fin;
fin.open("name.txt");
if (fin.good()) {
while (getline(fin, line)) {
text += line + " ";
}
int count = 0;
string word = "";
for (int i = 0; i < text.length(); i++) {
if (text[i] == ' ') {
cout << "word = " << word << " | match: " << regex_match(word, relation) << endl;
if (regex_match(word, relation)) {
cout << word << endl;
}
word = "";
}
else {
word += text[i];
}
}
}
// cout << text << endl;
return 0;
}
Current name.txt looks like this:
But I think the regular expression is not working right:
It says no word matches. Where is the problem?
The problem of "reusing" a smaller regex inside a larger regex is not really possible.
The only workaround I can see is to define the strings of the regexes as macros, and use the compilers literal-string concatenation feature to create larger strings:
#define REGEX_SIGN "=<|=|>|<=|<>|>="
#define REGEX_DIGIT "[0-9]"
#define REGEX_NUMBER "^" REGEX_DIGIT "\\d*"
regex sign(REGEX_SIGN);
regex digit(REGEX_DIGIT);
regex number(REGEX_NUMBER);
regex relation(REGEX_NUMBER REGEX_SIGN REGEX_NUMBER);
This doesn't reuse the actual regex objects, only create longer literal strings from smaller.

How to find all sentences except those defined using regular expressions?

The bottom line is that I need to find all the comments in some Python code and cut them out, leaving only the code itself.
But I can't do it from the opposite. That is, I find the comments themselves, but I cannot find everything except them.
I tried using "?!", Made up a regular expression like "(. *) (?! #. *)". But it does not work as I expected.
Just as in the code that I attached, there is an "else" that I tried to use too, that is, write to different variables, but for some reason it doesn't even go there
#include <iostream>
#include <fstream>
#include <string>
#include <regex>
int main()
{
std::string line;
std::string new_line;
std::string result;
std::string result_re;
std::string path;
std::smatch match;
std::regex re("(#.*)");
std::cout << "Enter the path\n";
std::cin >> path;
std::ifstream in(path);
if (in.is_open())
{
while (getline(in, line))
{
if (std::regex_search(line, match, re))
{
for (int i = 0; i < match.size(); i++)
result_re += match[i + 1];
result_re += "\n";
}
else
{
for (int i = 0; i < match.size(); i++)
result += match[i];
//result += "\n";
}
std::cout << line << std::endl;
}
}
in.close();
std::cout << result_re << std::endl;
std::cout << "End of program" << std::endl;
std::cout << result << std::endl;
system("pause");
return 0;
}
As I said above, I want to get everything except comments, and not the other way around.
I also need to do a search for multi-line comments, which are defined in """Text""".
But in this implementation, I can’t even imagine how to do it, since now it is reading line by line, and a multi-line comment in this case with the help of a regulars program is impossible for me to get
I would be grateful for your advices and help.
1. don't try parsing your input file line by line. Instead suck in the whole text and let regex to replace all the comments, this way your entire program would look like this:
#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
#include <regex>
using namespace std; // for brevity
int main() {
cout << "Enter the path: ";
string filename;
getline(cin, filename);
string pprg{ istream_iterator<char>(ifstream{filename, ifstream::in} >> noskipws),
istream_iterator<char>{} };
pprg = regex_replace(pprg, regex{"#.*"}, "");
cout << pprg << endl;
}
to handle multi-line Python literals """...""", with C++ regex is quite uneasy to do (unlike in the example above): there are few mutually exclusive requirements (imho):
regex should be extended POSIX, but
POSIX regex does not support empty regex matches, however
for crafting an RE to match a negated sequence of characters a negative look-ahead assert is required, which will be an empty match :(
thus it would mean, you'd need to think and put up some programming logic to remove multi-line Python text literals

Lexical Analyzer Project - Vector not outputting correctly

I have the following code which is part of a larger project. What this code is supposed to do is go through the line character by character looking for "tokens." The token I am looking for in this code is an ID. Which is defined as a letter followed by zero or more numbers or letters.
When a letter is detected it goes into the inner loop and loops through the next few characters, adding each character or letter to the idstring, until it finds the end of ID character(which is defined in the code) and then adds that idstring to a vector. At the end of the line it should output each element of the vector. Im not getting the output I need. I hope this is enough information to understand what is going on in the code. If someone could help me fix this problem I would be very great full. Thank you!
The output I need: ab : ab
What I get: a : a
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main()
{
std::vector<std::string> id;
std::regex idstart("[a-zA-Z]");
std::regex endID("[^a-z]|[^A-Z]|[^0-9]");
std::string line = "ab ab";
//Loops character by character through the line
//Adding each recognized token to the appropriate vector
for ( int i = 0; i<line.length(); i++ )
{
std::string tempstring(1,line[i]);
//Character is letter
if ( std::regex_match(tempstring,idstart) )
{
std::string tempIDString = tempstring;
int lineInc = 0;
for ( int j = i + 1; j<line.length(); j++)
{
std::string tempstring2(1,line[j]);
//Checks next character for end of potential ID
if ( std::regex_match(tempstring2,endID) )
{
i+=lineInc+1;
break;
}
else
{
tempIDString+=tempstring2;
lineInc++;
}
}
id.push_back(tempIDString);
}
}
std::cout << id.at(0) << " : " << id[1] << std::endl;
return 0;
}
The question is 2.5 year old and now you will maybe laugh seeing it. You break; the inner for when finding the second charcter that matches and so you will never assign tempstring2 to tempstring1.
But let's forget about that code. There is no good design here.
You had a good idea to use std::regex, but you did not know, how it worked.
So lets have a look at the correct implementation:
#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>
// Our test data (raw string). So, containing also \n and so on
std::string testData(
R"#( :-) IDcorrect1 _wrongID I2DCorrect
3FALSE lowercasecorrect Underscore_not_allowed
i3DCorrect,i4 :-)
}
)#");
std::regex re("(\\b[a-zA-Z][a-zA-Z0-9]*\\b)");
int main(void)
{
// Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };
// For debug output. Print complete vector to std::cout
std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
return 0;
}
This does all the work in the definition of the variable and by calling the range constructor. So, a typical one-liner.
Hope somebody can learn from this code . . .

How do I check for stored "\t" in a string?

Can someone explain to me how to properly search for a "tab" character stored in a string class?
For example:
text.txt contents:
std::cout << "Hello"; // contains one tab space
User enters on prompt: ./a.out < text.txt
main.cpp:
string arrey;
getline(cin, arrey);
int i = 0;
while( i != 10){
if(arrey[i] == "\t") // error here
{
std::cout << "I found a tab!!!!"
}
i++;
}
Since there is only one tab space in the textfile, I am assuming it is stored in index [0], but the problem is that I can't seem to make a comparison and I don't know any other way of searching it. Can someone help explain an alternative?
Error: ISO C++ forbids comparison between pointer and integer
First of all, what is i? And secondly, when you use array-indexing of a std::string object, you get a character (i.e. a char) and not a string.
The char is converted to an int and then the compiler tries to compare that int with the pointer to the string literal, and you can't compare plain integers with pointers.
You can however compare a character with another character, like in
arrey[i] == '\t'
std::string::find() might help.
Try this:
...
if(arrey.find('\t') != string::npos)
{
std::cout << "I found a tab!!!!";
}
More info on std::string::find is available here.
Why not using what C++ library provides? You could do it like this:
#include <iostream>
#include <string>
#include <algorithm>
using namespace std;
int main()
{
string arrey;
getline(cin, arrey);
if (arrey.find("\t") != std::string::npos) {
std::cout << "found a tab!" << '\n';
}
return 0;
}
The code is based on this answer. Here is the ref for std::find.
About your edit, how are sure that the input is going to be 10 positions? That might be too little or too big! If it is less than the actual size of the input, you won't look all the characters of the string and if it is too big, you are going to overflow!
You could use .size(), which says the size of the string and use a for loop like this:
#include <iostream>
#include <string>
using namespace std;
int main() {
string arrey;
getline(cin, arrey);
for(unsigned int i = 0; i < arrey.size(); ++i) {
if (arrey[i] == '\t') {
std::cout << "I found a tab!!!!";
}
}
return 0;
}

C++ tokenize a string using a regular expression

I'm trying to learn myself some C++ from scratch at the moment.
I'm well-versed in python, perl, javascript but have only encountered C++ briefly, in a
classroom setting in the past. Please excuse the naivete of my question.
I would like to split a string using a regular expression but have not had much luck finding
a clear, definitive, efficient and complete example of how to do this in C++.
In perl this is action is common, and thus can be accomplished in a trivial manner,
/home/me$ cat test.txt
this is aXstringYwith, some problems
and anotherXY line with similar issues
/home/me$ cat test.txt | perl -e'
> while(<>){
> my #toks = split(/[\sXY,]+/);
> print join(" ",#toks)."\n";
> }'
this is a string with some problems
and another line with similar issues
I'd like to know how best to accomplish the equivalent in C++.
EDIT:
I think I found what I was looking for in the boost library, as mentioned below.
boost regex-token-iterator (why don't underscores work?)
I guess I didn't know what to search for.
#include <iostream>
#include <boost/regex.hpp>
using namespace std;
int main(int argc)
{
string s;
do{
if(argc == 1)
{
cout << "Enter text to split (or \"quit\" to exit): ";
getline(cin, s);
if(s == "quit") break;
}
else
s = "This is a string of tokens";
boost::regex re("\\s+");
boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
boost::sregex_token_iterator j;
unsigned count = 0;
while(i != j)
{
cout << *i++ << endl;
count++;
}
cout << "There were " << count << " tokens found." << endl;
}while(argc == 1);
return 0;
}
The boost libraries are usually a good choice, in this case Boost.Regex. There even is an example for splitting a string into tokens that already does what you want. Basically it comes down to something like this:
boost::regex re("[\\sXY]+");
std::string s;
while (std::getline(std::cin, s)) {
boost::sregex_token_iterator i(s.begin(), s.end(), re, -1);
boost::sregex_token_iterator j;
while (i != j) {
std::cout << *i++ << " ";
}
std::cout << std::endl;
}
If you want to minimize use of iterators, and pithify your code, the following should work:
#include <string>
#include <iostream>
#include <boost/regex.hpp>
int main()
{
const boost::regex re("[\\sXY,]+");
for (std::string s; std::getline(std::cin, s); )
{
std::cout << regex_replace(s, re, " ") << std::endl;
}
}
Unlike in Perl, regular expressions are not "built in" into C++.
You need to use an external library, such as PCRE.
Regex are part of TR1 included in Visual C++ 2008 SP1 (including express edition) and G++ 4.3.
Header is <regex> and namespace std::tr1. Works great with STL.
Getting started with C++ TR1 regular expressions
Visual C++ Standard Library : TR1 Regular Expressions