Why PCRE regex only capture 19 groups?

Why PCRE regex only capture 19 groups? - c++

My Question:
My regex pattern is: (a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)(q)(r)(s)(t)(u)(v)(w)(x)(y)(z)
and My string is: abcdefghijklmnopqrstuvwxyz
the code's output is:
i_0:0 i_1:26 i_2:0 i_3:1 i_4:1 i_5:2 i_6:2 i_7:3 i_8:3 i_9:4 i_10:4 i_11:5 i_12:5 i_13:6 i_14:6 i_15:7 i_16:7 i_17:8 i_18:8 i_19:9 i_20:9 i_21:10 i_22:10 i_23:11 i_24:11 i_25:12 i_26:12 i_27:13 i_28:13 i_29:14 i_30:14 i_31:15 i_32:15 i_33:16 i_34:16 i_35:17 i_36:17 i_37:18 i_38:18 i_39:19 i_40:0 i_41:0 i_42:0 i_43:0 i_44:0 i_45:0 i_46:0 i_47:0 i_48:0 i_49:0 i_50:0 i_51:0 i_52:0 i_53:0 i_54:0 i_55:0 i_56:0 i_57:0 i_58:0 i_59:0
Question: Why PCRE regex only capture 19 groups?
My Code
#include <pcre.h>
#include <iostream>
pcre* _rex;
pcre_extra* _rexEx;
void CompileRexStr(const std::string& rex) {
const char* errorinfo;
int errpos = 0;
_rex = NULL;
_rexEx = NULL;
_rex = pcre_compile(rex.c_str(), PCRE_UTF8, &errorinfo, &errpos, NULL);
_rexEx = pcre_study(_rex, PCRE_STUDY_JIT_COMPILE, &errorinfo);
}
int main(){
std::string rex = "(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)(m)(n)(o)(p)(q)(r)(s)(t)(u)(v)(w)(x)(y)(z)";
CompileRexStr(rex);
std::string str = "abcdefghijklmnopqrstuvwxyz";
int result[60] = {0};
int cur = 0;
int pos = pcre_exec(_rex, _rexEx, str.c_str(), str.length(), cur, 0, result, 60);
for(int i=0;i < 60; i++) {
std::cout << "i_" << i << ":" << result[i] << " ";
}
return 0;
}

It returns 19 capture groups, because you provided space to return 20 matches, and one is used for whole matching string
Captured substrings are returned to the caller via a vector of integers whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes.
The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec() while matching capturing subpatterns, and is not available for passing back information. The number passed in ovecsize should always be a multiple of three. If it is not, it is rounded down.
Source: Manual for PCRE
If you have 26 capture groups, you need to pass a vector containing (26 + 1)×3 = 81 element at least.

Related

Logical error. Elements in std::string not replaced properly with for loop

I'm currently doing a programming exercise from a C++ book for beginners. The task reads as follows: "Write a function that reverses the characters in a text string by using two pointers. The only function parameter shall be a pointer to the string."
My issue is that I haven't been able to make the characters swap properly, see the output below. (And I also made the assumption that the function parameter doesn't count, hence why I'm technically using three pointers).
I am almost certain that the problem has to do with the for loop. I wrote this pseudocode:
Assign value of element number i in at_front to the 1st element in transfer_back.
Assign value of element number elem in at_back to element number i in at_front.
Assign value of the 1st element in transfer_back to element number elem in at_back.
Increment i, decrement elem. Repeat loop until !(i < elem)
I wasn't sure whether of not I was supposed to take the null terminator into account. I tried writing (elem - 1) but that messed up with the characters even more so I've currently left it as it is.
#include <iostream>
#include <string>
using namespace std;
void strrev(string *at_front) {
string *transfer_back = at_front, *at_back = transfer_back;
int elem = 0;
while(at_back->operator[](elem) != '\0') {
elem++;
}
for(int i = 0; i < elem; i++) {
transfer_back->operator[](0) = at_front->operator[](i);
at_front->operator[](i) = at_back->operator[](elem);
at_back->operator[](elem) = transfer_back->operator[](0);
elem--;
}
}
int main() {
string str = "ereh txet yna";
string *point_str = &str;
strrev(point_str);
cout << *point_str << endl;
return 0;
}
Expected output: "any text here"
Terminal window: "xany text her"
The fact that the 'x' has been assigned to the first element is something I haven't been able to grasp.

Here is the correct answer
void strrev(string *at_front) {
string *at_back = at_front;
char transfer_back;
int elem = 0;
while(at_back->operator[](elem) != '\0') {
elem++;
}
for(int i = 0; i <elem; i++) {
transfer_back = at_front->operator[](i);
at_front->operator[](i) = at_back->operator[](elem);
at_back->operator[](elem) = transfer_back;
elem--;
}
}
Let me explain why you have that error. string *transfer_back = at_front those two are pointed to the same reference, that is why when you change transfer_back->operator[](0) = at_front->operator[](i);this change will reflect in at_front string as well.

"Write a function that reverses the characters in a text string by using two pointers. The only function parameter shall be a pointer to the string."
This sounds to me like the question addresses C strings but not std::string.
Assuming my feeling is right, this could look like:
#include <iostream>
#include <string>
void strrev(char *at_front) {
char *at_back = at_front;
if (!*at_back) return; // early out in edge case
// move at_back to end (last char before 0-terminator)
while (at_back[1]) ++at_back;
// reverse by swapping contents of front and back
while (at_front < at_back) {
std::swap(*at_front++, *at_back--);
}
}
int main() {
char str[] = "ereh txet yna";
strrev(str);
std::cout << str << '\n';
return 0;
}
Output:
any text here
Live Demo on coliru
Note:
I stored the original string in a char str[].
If I had used char *str = "ereh txet yna"; I had assigned an address of a constant string to str. This feels very wrong as I want to modify the contents of str which must not be done on constants.
strrev():
The at_back[1] reads the next char after address in at_back. For a valid C string, this should be always possible as I excluded the empty string (consisting of 0-terminator only) before.
The swapping loop moves at_front as well as at_back. As the pointer is given as value, this has no "destructive" effect outside of strrev().
Concerning std::swap(*at_front++, *at_back--);:
The swapping combines access to pointer contents with pointer increment/decrement, using postfix-increment/-decrement. IMHO, one of the rare cases where the postfix operators are useful somehow.
Alternatively, I could have written:
std::swap(*at_front, *at_back); ++at_front; --at_back;
Please, note that std::string is a container class. A pointer to the container cannot be used to address its contained raw string directly. For this, std::string provides various access methods like e.g.
std::string::operator[]()
std::string::at()
std::string::data()
etc.

Trying to properly substring with String

Given the following function:
If I execute setColor("R:0,G:0,B:255,");
I'm expecting the red, grn, blu values to be:
0 0 255 except I'm getting 0 0 0
It's working fine for R:255,G:0,B:0, or R:0,G:255,B:0, though.
int setColor(String command) {
//Parse the incoming command string
//Example command R:123,G:100,B:50,
//RGB values should be between 0 to 255
int red = getColorValue(command, "R:", "G");
int grn = getColorValue(command, "G:", "B");
int blu = getColorValue(command, "B:", ",");
// Set the color of the entire Neopixel ring.
uint16_t i;
for (i = 0; i < strip.numPixels(); i++) {
strip.setPixelColor(i, strip.Color(red, grn, blu));
}
strip.show();
return 1;
}
int getColorValue(String command, String first, String second) {
int rgbValue;
String val = command.substring(command.indexOf(first)+2, command.indexOf(second));
val.trim();
rgbValue = val.toInt();
return rgbValue;
}

Without knowing your String implementation, I can only make an educated guess.
What happens is that indexOf(second) doesn't give you what you think.
"R:0,G:0,B:255,"
^ ^- indexOf("B:")
|- indexOf(",")
It works for your other cases as none of the things they look for occur more than once in the string.
Looking at the SparkCore Docs we find the documentation for both indexOf and substring.
indexOf()
Locates a character or String within another String. By default, searches from the beginning of the String, but can also start from a given index, allowing for the locating of all instances of the character or String.
string.indexOf(val)
string.indexOf(val, from)
substring()
string.substring(from)
string.substring(from, to)
So now to fix your problem you can use the second variant of indexOf and pass that the index you found from your first search.
int getColorValue(String command, String first, String second) {
int rgbValue;
int beg = command.indexOf(first)+2;
int end = command.indexOf(second, beg);
String val = command.substring(beg, end);
val.trim();
rgbValue = val.toInt();
return rgbValue;
}

In this instance, I would split the string using a comma as the delimiter then parse each substring into a key-value pair. You could use a vector of value for the second part if you always have the sequence "R,G,B" in which case why have the "R:", "G:" or "B:" at all?

I can suppose that command.indexOf(second) will always find you the first comma, therefore for B the val becomes empty string.
Assuming that indexOf is something similar to .Net's, maybe try
int start = command.indexOf(first)+2;
int end = command.indexOf(second, start)
String val = command.substring(start+2, end);
Note the second argument for the second call to indexOf, I think it will make indexOf to look for matches after start. I also think you'd better pass a "," as a second for all calls, and add +1 or -1 to end to compensate for this passing "," instead of "G" and "B".
Or just use another limiter for B part, like R:0,G:0,B:0. (dot instead of comma).

I ended up just modifying my code:
int setColor(String command) {
int commaIndex = command.indexOf(',');
int secondCommaIndex = command.indexOf(',', commaIndex+1);
int lastCommaIndex = command.lastIndexOf(',');
String red = command.substring(0, commaIndex);
String grn = command.substring(commaIndex+1, secondCommaIndex);
String blu = command.substring(lastCommaIndex+1);
// Set the color of the entire Neopixel ring.
uint16_t i;
for (i = 0; i < strip.numPixels(); i++) {
strip.setPixelColor(i, strip.Color(red.toInt(), grn.toInt(), blu.toInt()));
}
strip.show();
return 1;
}
I simply just do: 255,0,0 and it works a treat.

Looping through a string, more than one char gets output

I am using C++ and SDL to make a game for fun.
I show the kill count on the screen by turning it into a surface using TTF_RenderText, which needs a const char*. There are gaps in between where I want the individual digits to be shown so I split up the string into individual chars.
This is the code I wrote to split up the string and render it on the screen:
SpareStream.str("");
SpareStream << Kills;
std::string KillsString = SpareStream.str();
for (int i = 0; i <= 4; i++)
{
if(i < KillsString.size())
{
std::string Cheat = KillsString.substr(i,i+1);
const char *KillsChar = Cheat.c_str();
Message = TTF_RenderText_Solid(EightBitLimit,KillsChar,White);
}
else Message = TTF_RenderText_Solid(EightBitLimit,"0",White);
ApplySurface(540 + (45 * i),(500 - HUD->h) + 24,Message,Screen);
}
However, when the kill count exceeds more than 100, this happens:
the tens and ones are shown where only the tens should be.
Why is this happening?

from my reference, std::string::substr is:
string substr (size_t pos = 0, size_t len = npos) const;
Thats a start/length pair, not start/end, so you need:
std::string Cheat = KillsString.substr(i,1);
btw, while start/length pairs are rarely use for containers in teh standard library, they were so universal for char* management in C that it does frequently turn up in the string classes.

Switch every pair of words in a string (“ab cd ef gh ijk” becomes “cd ab gh ef ijk”) in c/c++

Switch every pair of words in a string (“ab cd ef gh ijk” becomes “cd ab gh ef ijk”) in c++ without any library function.
int main(){
char s[]="h1 h2 h3 h4";//sample input
switch_pair(s);
std::cout<<s;
return 0;
}
char * switch_pair(char *s){
char * pos = s;
char * ptr = s;
int sp = 0;//counts number of space
while(*pos){
if(*pos==' ' && ++sp==2){ //if we hit a space and it is second space then we've a pair
revStr_iter(ptr,pos-1);//reverse the pair so 'h1 h2' -> '2h 1h'
sp=0;//set no. of space to zero to hunt new pairs
ptr=pos+1;//reset ptr to nxt word after the pair i.e. h3'
}
pos++;
}
if(sp==1) //tackle the case where input is 'h1 h2' as only 1 space is there
revStr_iter(ptr,pos-1);
revWord(s); //this will reverse each individual word....i hoped so :'(
return s;
}
char* revStr_iter(char* l,char * r){//trivial reverse string algo
char * p = l;
while(l<r){
char c = *l;
*l = *r;
*r = c;
l++;
r--;
}
return p;
}
char* revWord(char* s){//this is the villain....need to fix it...Grrrr
char* pos = s;
char* w1 = s;
while(*pos){
if(*pos==' '){//reverses each word before space
revStr_iter(w1,pos-1);
w1=pos+1;
}
pos++;
}
return s;
}
Input - h1 h2 h3 h4
expected - h2 h1 h4 h3
actual - h2 h1 h3 4h
can any noble geek soul help plz :(((

IMO, what you're working on so far looks/seems a lot more like C code than C++ code. I think I'd start from something like:
break the input into word objects
swap pairs of word objects
re-construct string of rearranged words
For that, I'd probably define a really minimal string class. Just about all it needs (for now) is the ability to create a string given a pointer to char and a length (or something on that order), and the ability to assign (or swap) strings.
I'd also define a tokenizer. I'm not sure if it should really be a function or a class, but for the moment, let's jut say "function". All it does is look at a string and find the beginning and end of a word, yielding something like a pointer to the beginning, and the length of the word.
Finally, you need/want an array to hold the words. For a first-step, you could just use a normal array, then later when/if you want to have the array automatically expand as needed, you can write a small class to handle it.

int Groups = 1; // Count 1 for the first group of letters
for ( int Loop1 = 0; Loop1 < strlen(String); Loop1++)
if (String[Loop1] == ' ') // Any extra groups are delimited by space
Groups += 1;
int* GroupPositions = new int[Groups]; // Stores the positions
for ( int Loop2 = 0, Position = 0; Loop2 < strlen(String); Loop2++)
{
if (String[Loop2] != ' ' && (String[Loop2-1] == ' ' || Loop2-1 < 0))
{
GroupPositions[Position] = Loop2; // Store position of the first letter
Position += 1; // Increment the next position of interest
}
}
If you can't use strlen, write a function that counts any letters until it encounters a null terminator '\0'.

Efficiently check string for one of several hundred possible suffixes

I need to write a C/C++ function that would quickly check if string ends with one of ~1000 predefined suffixes. Specifically the string is a hostname and I need to check if it belongs to one of several hundred predefined second-level domains.
This function will be called a lot so it needs to be written as efficiently as possible. Bitwise hacks etc anything goes as long as it turns out fast.
Set of suffixes is predetermined at compile-time and doesn't change.
I am thinking of either implementing a variation of Rabin-Karp or write a tool that would generate a function with nested ifs and switches that would be custom tailored to specific set of suffixes. Since the application in question is 64-bit to speed up comparisons I could store suffixes of up to 8 bytes in length as const sorted array and do binary search within it.
Are there any other reasonable options?

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance
suffixes:
foo
bar
bao
reverse order suffix trie:
o
-a-b (matches bao)
-o-f (matches foo)
r-a-b (matches bar)
These can then be used to match your string:
"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix

You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.
Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

I would reverse all of the suffix strings, build a prefix tree of them and then test the reverse of your IP string against that.

I think that building your own automata would be the most efficient way.. it's a sort of your second solution, according to which, starting from a finite set of suffixes, it generates an automaton fitted for that suffixes.
I think you can easily use flex to do it, taking care of reversing the input or handling in a special way the fact that you are looking just for suffixes (just for efficienty matters)..
By the way using a Rabin-Karp approach would be efficient too since your suffixes will be short. You can fit a hashset with all the suffixes needed and then
take a string
take the suffix
calculate the hash of the suffix
check if suffix is in the table

Just create a 26x26 array of set of domains. e.g. thisArray[0][0] will be the domains that end in 'aa', thisArray[0][1] is all the domains that end in 'ab' and so on...
Once you have that, just search your array for thisArray[2nd last char of hostname][last char of hostname] to get the possible domains. If there's more than one at that stage, just brute force the rest.

I think that the solution should be very different depending on the type of input strings. If the strings are some kind of string class that can be iterated from the end (such as stl strings) it is a lot easier than if they are NULL-terminated C-strings.
String Class
Iterate the string backwards (don't make a reverse copy - use some kind of backward iterator). Build a Trie where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to match less than 8 characters - e.g. deny "*.org" would give a mask with 32 bits set. The mask is also used as termination criteria.
C strings
Construct an NDFA that matches the strings on a single-pass over them. That way you don't have to first iterate to the end but can instead use it in one pass. An NDFA can be converted to a DFA, which will probably make the implementation more efficient. Both construction of the NDFA and conversion to DFA will probably be so complex that you will have to write tools for it.

After some research and deliberation I've decided to go with trie/finite state machine approach.
The string is parsed starting from the last character going backwards using a TRIE as long as the portion of suffix that was parsed so far can correspond to multiple suffixes. At some point we either hit the first character of one of the possible suffixes which means that we have a match, hit a dead end, which means there are no more possible matches or get into situation where there is only one suffix candidate. In this case we just do compare remainder of the suffix.
Since trie lookups are constant time, worst case complexity is o(maximum suffix length). The function turned out to be pretty fast. On 2.8Ghz Core i5 it can check 33,000,000 strings per second for 2K possible suffixes. 2K suffixes totaling 18 kilobytes, expanded to 320kb trie/state machine table. I guess that I could have stored it more efficiently but this solution seems to work good enough for the time being.
Since suffix list was so large, I didn't want to code it all by hand so I ended up writing C# application that generated C code for the suffix checking function:
public static uint GetFourBytes(string s, int index)
{
byte[] bytes = new byte[4] { 0, 0, 0, 0};
int len = Math.Min(s.Length - index, 4);
Encoding.ASCII.GetBytes(s, index, len, bytes, 0);
return BitConverter.ToUInt32(bytes, 0);
}
public static string ReverseString(string s)
{
char[] chars = s.ToCharArray();
Array.Reverse(chars);
return new string(chars);
}
static StringBuilder trieArray = new StringBuilder();
static int trieArraySize = 0;
static void Main(string[] args)
{
// read all non-empty lines from input file
var suffixes = File
.ReadAllLines(#"suffixes.txt")
.Where(l => !string.IsNullOrEmpty(l));
var reversedSuffixes = suffixes
.Select(s => ReverseString(s));
int start = CreateTrieNode(reversedSuffixes, "");
string outFName = #"checkStringSuffix.debug.h";
if (args.Length != 0 && args[0] == "--release")
{
outFName = #"checkStringSuffix.h";
}
using (StreamWriter wrt = new StreamWriter(outFName))
{
wrt.WriteLine(
"#pragma once\n\n" +
"#define TRIE_NONE -1000000\n"+
"#define TRIE_DONE -2000000\n\n"
);
wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray);
wrt.WriteLine(
"inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+
" int len = trie[0];\n"+
" if (curr - str < len) return false;\n"+
" const char* cmp = (const char*)(trie + 1);\n"+
" while (len-- > 0) {\n"+
" if (*--curr != *cmp++) return false;\n"+
" }\n"+
" return true;\n"+
"}\n\n"+
"bool checkStringSuffix(const char* str, int len) {\n" +
" if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" +
" const char* curr = (str + len - 1);\n"+
" int currTrie = " + start.ToString() + ";\n"+
" while (curr >= str) {\n" +
" assert(*curr >= 0x20 && *curr <= 0x7f);\n" +
" currTrie = trieArray[currTrie + *curr - 0x20];\n" +
" if (currTrie < 0) {\n" +
" if (currTrie == TRIE_NONE) return false;\n" +
" if (currTrie == TRIE_DONE) return true;\n" +
" return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" +
" }\n"+
" --curr;\n"+
" }\n" +
" return false;\n"+
"}\n"
);
}
}
private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix)
{
int retVal = trieArraySize;
if (suffixes.Count() == 1)
{
string theSuffix = suffixes.Single();
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix);
++trieArraySize;
for (int i = 0; i < theSuffix.Length; i += 4)
{
trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i));
++trieArraySize;
}
retVal = -(retVal + 1);
}
else
{
var groupByFirstChar =
from s in suffixes
let first = s[0]
let remainder = s.Substring(1)
group remainder by first;
string[] trieIndexes = new string[0x60];
for (int i = 0; i < trieIndexes.Length; ++i)
{
trieIndexes[i] = "TRIE_NONE";
}
foreach (var g in groupByFirstChar)
{
if (g.Any(s => s == string.Empty))
{
trieIndexes[g.Key - 0x20] = "TRIE_DONE";
continue;
}
trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString();
}
trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix);
retVal = trieArraySize;
trieArraySize += 0x60;
}
return retVal;
}
So it generates code like this:
inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {
int len = trie[0];
if (curr - str < len) return false;
const char* cmp = (const char*)(trie + 1);
while (len-- > 0) {
if (*--curr != *cmp++) return false;
}
return true;
}
bool checkStringSuffix(const char* str, int len) {
if (len < 5) return false;
const char* curr = (str + len - 1);
int currTrie = 81921;
while (curr >= str) {
assert(*curr >= 0x20 && *curr <= 0x7f);
currTrie = trieArray[currTrie + *curr - 0x20];
if (currTrie < 0) {
if (currTrie == TRIE_NONE) return false;
if (currTrie == TRIE_DONE) return true;
return checkSingleSuffix(str, curr, trieArray - currTrie - 1);
}
--curr;
}
return false;
}
Since for my particular set of data len in checkSingleSuffix was never more than 9, I tried to replace the comparison loop with switch (len) and hardcoded comparison routines that compared up to 8 bytes of data at a time but it didn't affect overall performance at all either way.
Thanks for everyone who contributed their ideas!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why PCRE regex only capture 19 groups? - c++

Related

Logical error. Elements in std::string not replaced properly with for loop

Trying to properly substring with String

Looping through a string, more than one char gets output

Switch every pair of words in a string (“ab cd ef gh ijk” becomes “cd ab gh ef ijk”) in c/c++

Efficiently check string for one of several hundred possible suffixes

Categories

Resources