regex get comma seperated values between two words - regex

I have the following query & PRCE regex from which i want to get table names.
FROM student s, #prefix#.sometable, subject s, marks s WHERE ...
(?<=\sfrom)\s+\K(\w*)(?=\s+where)
Desired result student s subject s marks s
I cant figure out how to extract from 1st match.
I'm trying to find & replace in sublime text editor.

Try this: \s+(\w*\s)*s
pcre *myregexp;
const char *error;
int erroroffset;
myregexp = pcre_compile("\\s+(\\w*\\s)*s", PCRE_CASELESS | PCRE_EXTENDED | PCRE_MULTILINE | PCRE_DUPNAMES | PCRE_UTF8, &error, &erroroffset, NULL);
if (myregexp) {
int offsets[2*3]; // (max_capturing_groups+1)*3
int offsetcount = pcre_exec(myregexp, NULL, subject, strlen(subject), 0, 0, offsets, 2*3);
if (offsetcount > 0) {
pcre_get_substring(subject, &offsets, offsetcount, 1, &result);
// group offset = offsets[1*2];
// group length = offsets[1*2+1] - offsets[1*2];
} else {
result = NULL;
}
} else {
// Syntax error in the regular expression at erroroffset
result = NULL;
}

Using #bobblebubble solution which worked 90%, i added a bit more conditions to match my case. it works but its very aggressive and hangs the editor on large or multiple files. But i can live with what i have got. Hers the solution:
(?is)(?:\bFROM\b|\G(?!^))(?:[\s,]|#[^\s,]++)*(\b\K(?:\s*(?!WHERE|LEFT\b)\w+){4,})\b(?=.*?\bWHERE\b)

Related

If statement fails with regex comparison

public list[str] deleteBlockComments(list[str] fileLines)
{
bool blockComment = false;
list[str] sourceFile = [];
for(fileLine <- fileLines)
{
fileLine = trim(fileLine);
println(fileLine);
if (/^[\t]*[\/*].*$/ := fileLine)
{
blockComment = true;
}
if (/^[\t]*[*\/].*$/ := fileLine)
{
blockComment = false;
}
println(blockComment);
if(!blockComment)
{
sourceFile = sourceFile + fileLine;
}
}
return sourceFile;
}
For some reason, I am not able to detect /* at the beginning of a string. If I execute this on the command line, it seems to work fine.
Can someone tell me what I am doing wrong? In the picture below you can see the string to be compared above the comparison result (false).
[\/*] is a character set that matches forward slash or star, not both one after the other. Simply remove the square brackets and your pattern should start behaving as you expect.
While we're at it, let's also get rid of the superfluous square brackets around \t
^\t*\/*.*$

nginx module regex unable to match

Hi guys im trying to build a module for nginx and need to match a substring here is what im using to try and match
int match_chan(ngx_http_request_t *r, ngx_pool_t *temp_pool, ngx_str_t *body, ngx_str_t *channel) {
u_char errstr[NGX_MAX_CONF_ERRSTR];
ngx_regex_compile_t *rc;
int captures[2];
if ((rc = ngx_pcalloc(temp_pool, sizeof(ngx_regex_compile_t))) == NULL) {
ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "unable to allocate memory to compile agent patterns");
return 0;
}
//ngx_memzero(rc, sizeof(ngx_regex_compile_t));
ngx_str_t pat = ngx_string("test(:|%3[Aa])([a-zA-Z0-9]+)");
rc->pattern = pat;
rc->pool = temp_pool;
rc->err.len = NGX_MAX_CONF_ERRSTR;
rc->err.data = errstr;
if (ngx_regex_compile(rc) != NGX_OK) {
ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "unable to compile regex pattern %V", rc->pattern);
return 0;
}
ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "%V, %V", &pat, body);
if (ngx_regex_exec(rc->regex, body, captures, 2) >= 0) {
ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "It Matched");
//ngx_memcpy(channel->data, body->data + captures[0], body->len);
return 1;
}
ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, "It did not match");
return 0;
}
ngx_str_t *channel = NULL;
if(match_chan(r, temp_pool, aux, channel)) {
//ngx_log_error(NGX_LOG_ERR, r->connection->log, 0, " match: %c", match);
}
and the message that is passed looks like this
2014/07/04 13:28:49 [error] 10695#0: *38 test:([a-z0-9]+), MSG%0Atest%3Ahello%0A%0A%0Awins%00
2014/07/04 13:28:49 [error] 10695#0: *38 It did not match
taken from nginx log
ive tested the regex in a pure C app and that worked fine i thought nginx was similar but i guess it has its differences
ive looked all over google and ive tried looking at nginx modules with still no luck please help me :)
Thanks
Dave
The problem is that the string you are trying to match is URL-encoded, and due to this it doesn't match the pattern provided. There are two options:
Construct a regular expression so it will match encoded string as well ("test(:|%3[Aa])([a-zA-Z0-9]+)" will match both unescaped and escaped forms);
Unescape the string you are matching. In nginx, this is done with the ngx_unescape_uri() function.

RE2 regular expressions on streams?

Is it possible to use Google RE2 with streams? Some input literals that we are suppose to process with regular expressions can potentially be too large to hold in-memory.
If there is a maximum match length, you could read the data in blocks of at least twice that length. If the match fails, or starts less than that many characters from the end, cut the current string, and append another block.
The length of the match string would never be more than the block length + max match length.
Example in C#:
public static IEnumerable<StreamMatch> MatchesInStream(
this Regex pattern, TextReader reader,
int maxMatchLength, int blockLength)
{
if (maxMatchLength <= 0)
{
throw new ArgumentException("Must be positive", "maxMatchLength");
}
if (blockLength < maxMatchLength)
{
throw new ArgumentException("Must be at least as long as maxMatchLength", "blockLength");
}
char[] buffer = new char[blockLength];
string chunk = "";
int matchOffset = 0;
// Read one block, and append to the string
int charsRead = reader.ReadBlock(buffer, 0, blockLength);
chunk += new string(buffer, 0, charsRead);
while (charsRead > 0 && chunk.Length > maxMatchLength)
{
int cutPosition = 0;
foreach (Match match in pattern.Matches(chunk))
{
if (match.Index > chunk.Length - maxMatchLength)
{
// The match could possibly have matched more characters.
// Read another block before trying again.
break;
}
yield return new StreamMatch(matchOffset, match);
cutPosition = match.Index + match.Length;
}
cutPosition = Math.Max(cutPosition, chunk.Length - maxMatchLength);
matchOffset += cutPosition;
chunk = chunk.Substring(cutPosition);
charsRead = reader.ReadBlock(buffer, 0, blockLength);
chunk += new string(buffer, 0, charsRead);
}
// Stream has ended. Try to match the last remaining characters.
foreach (Match match in pattern.Matches(chunk))
{
yield return new StreamMatch(matchOffset, match);
}
}
public class StreamMatch
{
public int MatchOffset { get; private set; }
public Match Match { get; private set; }
public StreamMatch(int matchOffset, Match match)
{
MatchOffset = matchOffset;
Match = match;
}
}
// One horrible XML parser
var reader = new StreamReader(stream);
var pattern = new Regex(#"<(/?)([\w:-]{1,15})([^<>]{0,50}(?<!/))(/?)>");
foreach (StreamMatch match in pattern.MatchesInStream(reader, 69, 128))
{
Console.WriteLine(match.Match.Value);
}

extract domain between two words

I have in a log file some lines like this:
11-test.domain1.com Logged ...
37-user1.users.domain2.org Logged ...
48-me.server.domain3.net Logged ...
How can I extract each domain without the subdomains? Something between "-" and "Logged".
I have the following code in c++ (linux) but it doesn't extract well. Some function which is returning the extracted string would be great if you have some example of course.
regex_t preg;
regmatch_t mtch[1];
size_t rm, nmatch;
char tempstr[1024] = "";
int start;
rm=regcomp(&preg, "-[^<]+Logged", REG_EXTENDED);
nmatch = 1;
while(regexec(&preg, buffer+start, nmatch, mtch, 0)==0) /* Found a match */
{
strncpy(host, buffer+start+mtch[0].rm_so+3, mtch[0].rm_eo-mtch[0].rm_so-7);
printf("%s\n", tempstr);
start +=mtch[0].rm_eo;
memset(host, '\0', strlen(host));
}
regfree(&preg);
Thank you!
P.S. no, I cannot use perl for this because this part is inside of a larger c program which was made by someone else.
EDIT:
I replace the code with this one:
const char *p1 = strstr(buffer, "-")+1;
const char *p2 = strstr(p1, " Logged");
size_t len = p2-p1;
char *res = (char*)malloc(sizeof(char)*(len+1));
strncpy(res, p1, len);
res[len] = '\0';
which is extracting very good the whole domain including subdomains.
How can I extract just the domain.com or domain.net from abc.def.domain.com ?
is strtok a good option and how can I calculate which is the last dot ?
#include <vector>
#include <string>
#include <boost/regex.hpp>
int main()
{
boost::regex re(".+-(?<domain>.+)\\s*Logged");
std::string examples[] =
{
"11-test.domain1.com Logged ...",
"37-user1.users.domain2.org Logged ..."
};
std::vector<std::string> vec(examples, examples + sizeof(examples) / sizeof(*examples));
std::for_each(vec.begin(), vec.end(), [&re](const std::string& s)
{
boost::smatch match;
if (boost::regex_search(s, match, re))
{
std::cout << match["domain"] << std::endl;
}
});
}
http://liveworkspace.org/code/1983494e6e9e884b7e539690ebf98eb5
something like this with boost::regex. Don't know about pcre.
Is the in a standard format?
it appears so, is there a split function?
Edit:
Here is some logic.
Iterate through each domain to be parsed
Find a function to locate the index of the first string "-"
Next find the index of the second string minus the first string "Logged"
Now you have the full domain.
Once you have the full domain "Split" the domain into your object of choice (I used an array)
now that you have the array broken apart locate the index of the value you wish to reassemble (concatenate) to capture only the domain.
NOTE Written in C#
Main method which defines the first value and the second value
`static void Main(string[] args)
{
string firstValue ="-";
string secondValue = "Logged";
List domains = new List { "11-test.domain1.com Logged", "37-user1.users.domain2.org Logged","48-me.server.domain3.net Logged"};
foreach (string dns in domains)
{
Debug.WriteLine(Utility.GetStringBetweenFirstAndSecond(dns, firstValue, secondValue));
}
}
`
Method to parse the string:
`public string GetStringBetweenFirstAndSecond(string str, string firstStringToFind, string secondStringToFind)
{
string domain = string.Empty;
if(string.IsNullOrEmpty(str))
{
//throw an exception, return gracefully, whatever you determine
}
else
{
//This can all be done in one line, but I broke it apart so it can be better understood.
//returns the first occurrance.
//int start = str.IndexOf(firstStringToFind) + 1;
//int end = str.IndexOf(secondStringToFind);
//domain = str.Substring(start, end - start);
//i.e. Definitely not quite as legible, but doesn't create object unnecessarily
domain = str.Substring((str.IndexOf(firstStringToFind) + 1), str.IndexOf(secondStringToFind) - (str.IndexOf(firstStringToFind) + 1));
string[] dArray = domain.Split('.');
if (dArray.Length > 0)
{
if (dArray.Length > 2)
{
domain = string.Format("{0}.{1}", dArray[dArray.Length - 2], dArray[dArray.Length - 1]);
}
}
}
return domain;
}
`

Getting word under caret - C++, wxWidgets

I am writing a text editor using the wxWidgets framework. I need to get the word under caret from the text control. Here is what I came up with.
static bool IsWordBoundary(wxString& text)
{
return (text.Cmp(wxT(" ")) == 0 ||
text.Cmp(wxT('\n')) == 0 ||
text.Cmp(wxT('\t')) == 0 ||
text.Cmp(wxT('\r')) == 0);
}
static wxString GetWordUnderCaret(wxTextCtrl* control)
{
int insertion_point = control->GetInsertionPoint();
wxTextPos last_position = control->GetLastPosition();
int start_at, ends_at = 0;
// Finding starting position:
// from the current caret position, move back each character until
// we hit a word boundary.
int caret_pos = insertion_point;
start_at = caret_pos;
while (caret_pos)
{
wxString text = control->GetRange (caret_pos - 1, caret_pos);
if (IsWordBoundary (text)) {
break;
}
start_at = --caret_pos;
}
// Finding ending position:
// from the current caret position, move forward each character until
// we hit a word boundary.
caret_pos = ends_at = insertion_point;
while (caret_pos < last_position)
{
wxString text = control->GetRange (caret_pos, caret_pos + 1);
if (IsWordBoundary (text)) {
break;
}
ends_at = ++caret_pos;
}
return (control->GetRange (start_at, ends_at));
}
This code works as expected. But I am wondering is this the best way to approach the problem? Do you see any possible fixes on the above code?
Any help would be great!
Is punctuation part of a word? It is in your code -- is that what you want?
Here is how I would do it:
wxString word_boundary_marks = " \n\t\r";
wxString text_in_control = control->GetValue();
int ends_at = text_in_control.find_first_of( word_boundary_marks, insertion_point) - 1;
int start_at = text_in_control.Mid(0,insertion_point).find_last_of(word_boundary_marks) + 1;
I haven't tested this, so there likely are one or two "off-by-one" errors and you should add checks for "not found", end of string, and any other word markers. My code should give you the basis for what you need.