Regular expression appears to ignore tab character

Regular expression appears to ignore tab character - regex

I have a regular expression that parses lines in a driver inf file to extract just the variable names and values ignoring whitespace and end of line comments that begin with a semicolon.
It looks like this:
"^([^=\s]+)[ ]*=[ ]*([^;\r\n]+)(?<! )"
Most of the time it works just fine as per the example here: regex example 1
However, when it encounters a line that has a tab character anywhere between the variable name and the equals sign, the expression fails as per the example here: regex example 2
I have tried replacing "\s" with "\t" and "\x09" and it still doesnt work. I have edited the text file that contains the tab character with a hex editor and confirmed that it is indeed ASCII "09". I don't want to use a positive character match as the variable could actually contain quite a large number of special characters.
The appearance of the literal "=" seems to cause the problem but I cannot understand why.
For example, if I strip back the expression to this: regex example 3
and use the line with the tab character in it, it works fine. But as soon as I add the literal "=" as per the example here: regex example 4, it no longer matches, appearing to ignore the tab character.

The two [ ]* match only space characters (U+0020 SPACE) and not other whitespace characters.
Change both to [ \t]* to match tabs as well. The result would now look like:
"^([^=\s]+)[ \t]*=[ \t]*([^;\r\n]+)(?<! )"

You've just added the \t tab character in the wrong part I think.
This was your example 2 (not working):
^([^=\s]+)[ ]*=[ ]*([^;\r\n]+)(?<! )
This is your example 2 ... working (with a tab):
^([^=\s]+)[ \t]*=[ ]*([^;\r\n]+)(?<! )
^^ tab here
Seems to do the trick and match your first example: http://regex101.com/r/kQ1zH4/1

^([^=\s]+)\s*=\s*([^;\r\n]+)(?<!\s)
Try this.see demo.
http://regex101.com/r/tV8oH3/2

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.

To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+

If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex to match text between single, double and triple quotes

I have a text file that I want to parse strings from. The thing is that there are strings enclosed in either single ('), double (") or 3x single (''') quotes within the exact same file. The best result I was able to get so far is to use this:
((?<=["])(.*?)(?=["]))|((?<=['])(.*?)(?=[']))
to match only single-line strings between single and double quotes. Please note that the strings in the file are enclosed in each type of quotes can be either single- or multi-line and that each type of string repeats several times within the file.
Here's a sample string:
<thisisthefirststring
'''- This is the first line of text
- This is the second line of text
- This is the third line of text
'''
>
<thisisanotheroption
"Just a string between quotes"
>
<thisisalsopossible
'Single quotes
Multiple lines.
With blank lines in between
'
>
<lineBreaksDoubleQoutes
"This is the first sentence here
After the first sentence, comes the blank line, and then the second one."
>

Use this:
((?:'|"){1,3})([^'"]+)\1
Test it online
Using the group reference \1, you can simplify your work
Also, to get only what is inside of the quotes, use the 2nd group of the match

This regex: ('{3}|["']{1})([^'"][\s\S]+?)\1
does what you want.
Some results:

Using Notepad++, you can use: ('''|'|")((?:(?!\1).)+)\1
Explanation:
('''|'|") : group 1, all types of quote
( : group 2
(?:(?!\1).)+ : any thing that is not the quote in group 1
) : end group 2
\1 : back reference to group 1 (i.e. same quote as the beginning)
Here is a screen capture of the result.

Here's something that may work for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
Replace the triple double quotes with triple single quotes. See it in action at regex101.com.

Named Group Version
Avoids problems when used in larger expressions by explicitly referring to the name of the group storing the last found quote.
Should work for most systems:
(?<Qt>'''|'|")(.*?)\k<Qt>
.NET version:
(?<Qt>'''|'|"")(.*?)\k<Qt>
Works as follows:
'''|'|": Check first for ''', then ', and finally ". Done in this order so ''' has priority over '.
(?<Qt>'''|'|""): When matched, place the match in <Qt> for later use.
(.*?): Capture the results of a lazy search for 0 or more of anything .*? - will return empty strings. To prevent empty strings from being returned, change to a lazy search for 1 or more of anything .+?.
\k<Qt>: Search for the value last stored in <Qt>.

How do you "quantify" a variable number of lines using a regexp?

Say you know the starting and ending lines of some section of text, but the chars in some lines and the number of lines between the starting and ending lines are variable, á la:
aaa
bbbb
cc
...
...
...
xx
yyy
Z
What quantifier do you use, something like:
aaa\nbbbb\ncc\n(.*\n)+xx\nyyy\nZ\n
to parse those sections of text as a group?

You can use the s flag to match multilines texts, you can do it like:
~\w+ ~s.
There is a similar question here:
Javascript regex multiline flag doesn't work

If I understood correctly, you know that your text begins with aaa\nbbbb\ncc and ends with xx\nyyy\nZ\n. You could use aaa.+?bbbb.+?cc(.+?)xx.+?yyy.+?Z so that all operators are not greedy and you don't accidentally capture two groups at once. The text inbetween these groups would be in match group 1. You also need to turn the setting that causes dot to match new line on.

Try this:
aaa( |\n)bbbb( |\n)cc( |\n)( |\n){0,1}(.|\n)*xx( |\n)yyy( |\n)Z
( |\n) matches a space or a newline (so your starting and ending phrases can be split into different lines)
RegExr

At the end of the day what worked for me using Kate was:
( )+aaa\n( )+bbbb\n( )+cc\n(.|\n)*( )+xx\n( )+yyy\n( )+Z\n
using such regexps you can clear pages of quite a bit of junk.

find a single quote at the end of a line starting with "mySqlQueryToArray"

I'm trying to use regex to find single quotes (so I can turn them all into double quotes) anywhere in a line that starts with mySqlQueryToArray (a function that makes a query to a SQL DB). I'm doing the regex in Sublime Text 3 which I'm pretty sure uses Perl Regex. I would like to have my regex match with every single quote in a line so for example I might have the line:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name'");
I want the regex to match in that line both of the quotes around $name but no other characters in that line. I've been trying to use (?<=mySqlQueryToArray.*)' but it tells me that the look behind assertion is invalid. I also tried (?<=mySqlQueryToArray)(?<=.*)' but that's also invalid. Can someone guide me to a regex that will accomplish what I need?

To find any number of single quotes in a line starting with your keyword you can use the \G anchor ("end of last match") by replacing:
(^\h*mySqlQueryToArray|(?!^)\G)([^\n\r']*)'
With \1\2<replacement>: see demo here.
Explanation
( ^\h*mySqlQueryToArray # beginning of line: check the keyword is here
| (?!^)\G ) # if not at the BOL, check we did match sth on this line
( [^\n\r']* ) ' # capture everything until the next single quote
The general idea is to match everything until the next single quote with ([^\n\r']*)' in order to replace it with \2<replacement>, but do so only if this everything is:
right after the beginning keyword (^mySqlQueryToArray), or
after the end of the last match ((?!^)\G): in that case we know we have the keyword and are on a relevant line.
\h* accounts for any started indent, as suggested by Xælias (\h being shortcut for any kind of horizontal whitespace).

https://stackoverflow.com/a/25331428/3933728 is a better answer.
I'm not good enough with RegEx nor ST to do this in one step. But I can do it in two:
1/ Search for all mySqlQueryToArray strings
Open the search panel: ⌘F or Find->Find...
Make sure you have the Regex (.* ) button selected (bottom left) and the wrap selector (all other should be off)
Search for: ^\s*mySqlQueryToArray.*$
^ beginning of line
\s* any indentation
mySqlQueryToArray your call
.* whatever is behind
$ end of line
Click on Find All
This will select every occurrence of what you want to modify.
2/ Enter the replace mode
⌥⌘F or Find->Replace...
This time, make sure that wrap, Regex AND In selection are active .
Them search for '([^']*)' and replace with "\1".
' are your single quotes
(...) si the capturing block, referenced by \1 in the replace field
[^']* is for any character that is not a single quote, repeated
Then hit Replace All
I know this is a little more complex that the other answer, but this one tackles cases where your line would contain several single-quoted string. Like this:
mySqlQueryToArray($con, "SELECT * FROM Template WHERE Name='$name' and Value='1234'");
If this is too much, I guess something like find: (?<=mySqlQueryToArray)(.*?)'([^']*)'(.*?) and replace it with \1"\2"\3 will be enough.

You can use a regex like this:
(mySqlQueryToArray.*?)'(.*?)'(.*)
Working demo
Check the substitution section.

You can use \K, see this regex:
mySqlQueryToArray[^']*\K'(.*?)'
Here is a regex demo.

Regex with Tab delimited text containing \x09

I've got a tough one.
I've got tab-delimited text to match with a regex.
My regex looks like:
^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$
and an example source text is (tabs converted to \t for clarity):
JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone
However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.
Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.
If not, is there any character that could be safely used for delimiting text that contains a regex string?

I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.

Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:
String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";
If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression appears to ignore tab character - regex

The two [ ]* match only space characters (U+0020 SPACE) and not other whitespace characters. Change both to [ \t]* to match tabs as well. The result would now look like: "^([^=\s]+)[ \t]=[ \t]([^;\r\n]+)(?<! )"

^([^=\s]+)\s=\s([^;\r\n]+)(?<!\s) Try this.see demo. http://regex101.com/r/tV8oH3/2

Related

Regex to match(extract) string between dot(.)

Regex to match text between single, double and triple quotes

How do you "quantify" a variable number of lines using a regexp?

find a single quote at the end of a line starting with "mySqlQueryToArray"

Regex with Tab delimited text containing \x09

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expression appears to ignore tab character - regex

The two [ ]* match only space characters (U+0020 SPACE) and not other whitespace characters. Change both to [ \t]* to match tabs as well. The result would now look like: "^([^=\s]+)[ \t]*=[ \t]*([^;\r\n]+)(?<! )"

^([^=\s]+)\s*=\s*([^;\r\n]+)(?<!\s) Try this.see demo. http://regex101.com/r/tV8oH3/2

Related

Regex to match(extract) string between dot(.)

Regex to match text between single, double and triple quotes

How do you "quantify" a variable number of lines using a regexp?

find a single quote at the end of a line starting with "mySqlQueryToArray"

Regex with Tab delimited text containing \x09

Categories

Resources

The two [ ]* match only space characters (U+0020 SPACE) and not other whitespace characters. Change both to [ \t]* to match tabs as well. The result would now look like: "^([^=\s]+)[ \t]=[ \t]([^;\r\n]+)(?<! )"

^([^=\s]+)\s=\s([^;\r\n]+)(?<!\s) Try this.see demo. http://regex101.com/r/tV8oH3/2