Regex with Tab delimited text containing \x09

Regex with Tab delimited text containing \x09 - regex

I've got a tough one.
I've got tab-delimited text to match with a regex.
My regex looks like:
^([\w ]+)\t(\d*)\t(\d+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)$
and an example source text is (tabs converted to \t for clarity):
JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\x20\x62\x3b\x0a\x09\x61\x2e\x53\x74\x61\x72/"\tNone
However, the problem is that in my source text, the 6th field contains a regex string. Therefore, it can contain \x09, which naturally blows up the regex since it's seen as a tab as well.
Is there any way to tell the regex engine, "Match on \t but not on the text \x09." My guess is no, since they're the same thing.
If not, is there any character that could be safely used for delimiting text that contains a regex string?

I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.

Seems like a problem with the test case. A regex might have tabs in it, but your sample above doesn't. Your string in Java would look like:
String testString = "JJ\t345\t0\tTest\tSome test text\tmore text: pcre:"/\\x20\\x62\\x3b\\x0a\\x09\\x61\\x2e\\x53\\x74\\x61\\x72/"\tNone";
If you look at this string in the debugger you'll have \x09 as 4 characters instead of as 1 (the tab).

Related

Regex expression that selects only specific word

Welcome guys, I am just new to this community!
Here is the case, I am having some strings like these
thatisanappleaaa
thatisanappleaaa bad
thatisanappleaaa.bad
thatisanapplebadaaa
thatisanbadappleaaa
thatisanbadbadappleaaa
badthatisanappleaaa
and trying to use Sublime Text 3 Find and replace function to achieve the following (note that only the first line is being replaced)
thatisanorangeaaa
thatisanappleaaa bad
thatisanappleaaa.bad
thatisanapplebadaaa
thatisanbadappleaaa
thatisanbadbadappleaaa
badthatisanappleaaaa
Is there a regex that filters "apple" in "thatisanappleaaa"(which is line one) only without the presence of "bad" in any position (except between "apple") in the string, given that the string "bad" does not change every time it appears?

Try
(\w+)apple(\w+).*
will select all text wrapped around apple
if you want to select text trailing after apple use
apple(\w+).*

After reading your description I'm assuming you want to replace the word apple only in sentences which do not have any occurrences of the word bad.
I've used a regex which uses a negative lookahead and used parentheses to capture apple which can then be replaced with any word, in your case orange.
Regex: ^(?!.*bad).*(apple)
DEMO

Regular expression appears to ignore tab character

I have a regular expression that parses lines in a driver inf file to extract just the variable names and values ignoring whitespace and end of line comments that begin with a semicolon.
It looks like this:
"^([^=\s]+)[ ]*=[ ]*([^;\r\n]+)(?<! )"
Most of the time it works just fine as per the example here: regex example 1
However, when it encounters a line that has a tab character anywhere between the variable name and the equals sign, the expression fails as per the example here: regex example 2
I have tried replacing "\s" with "\t" and "\x09" and it still doesnt work. I have edited the text file that contains the tab character with a hex editor and confirmed that it is indeed ASCII "09". I don't want to use a positive character match as the variable could actually contain quite a large number of special characters.
The appearance of the literal "=" seems to cause the problem but I cannot understand why.
For example, if I strip back the expression to this: regex example 3
and use the line with the tab character in it, it works fine. But as soon as I add the literal "=" as per the example here: regex example 4, it no longer matches, appearing to ignore the tab character.

The two [ ]* match only space characters (U+0020 SPACE) and not other whitespace characters.
Change both to [ \t]* to match tabs as well. The result would now look like:
"^([^=\s]+)[ \t]*=[ \t]*([^;\r\n]+)(?<! )"

You've just added the \t tab character in the wrong part I think.
This was your example 2 (not working):
^([^=\s]+)[ ]*=[ ]*([^;\r\n]+)(?<! )
This is your example 2 ... working (with a tab):
^([^=\s]+)[ \t]*=[ ]*([^;\r\n]+)(?<! )
^^ tab here
Seems to do the trick and match your first example: http://regex101.com/r/kQ1zH4/1

^([^=\s]+)\s*=\s*([^;\r\n]+)(?<!\s)
Try this.see demo.
http://regex101.com/r/tV8oH3/2

How to extract big mgrs using regex

I have an input json:
{"id":12345,"mgrs":"04QFJ1234567890","code":"12345","user":"db3e1a-3c88-4141-bed3-206a"}
I would like to extract with regular expression MGRS of 1000 kilometer, in my example result should be: 04QFJ1267
First 2 symbols always digits, next 3 always chars and the rest always digits. MGRS have a fix length of 15 chars at all.
Is it possible?
Thanks.

All you really need to do is remove characters 8-10 and 13-15. If you want/need to do that using regex, then you could use the replace method with regex: (EDIT Edited to remove the rest of the string).
.*?(\w{7})\d{3}(\d{2})\d+.*
and replacement string:
$1$2
I see now you are using Java. So the relevant code line might look like:
resultString = subjectString.replaceAll(".*?(\\w{7})\\d{3}(\\d{2})\\d+.*", "$1$2");
The above assumes all your strings look like what you showed, and there is no need to test to be sure that "mgrs" is in the string.

regular expression to extract JSON string from text

I'm looking for regex to extract json string from text.
I have the text below, which contains
JSON string(mTitle, mPoster, mYear, mDate)
like that:
{"999999999":"138138138","020202020202":{"846":{"mTitle":"\u0430","mPoster":{"
small":"\/upload\/ms\/b_248.jpg","middle":"600.jpg","big":"400.jpg"},"mYear"
:"2013","mDate":"2014-01-01"},"847":{"mTitle":"\u043a","mPoster":"small":"\/upload\/ms\/241.jpg","middle":"600.jpg","big":"
138.jpg"},"mYear":"2013","mDate":"2013-12-26"},"848":{"mTitle":"\u041f","mPoster":{"small":"\/upload\/movies\/2
40.jpg","middle":"138.jpg","big":"131.jpg"},"mYear":"2013","mDate":"2013-12-19"}}}
In order to parse JSON string I should extract JSON string from the text.
That is why, my question: Could you help me to get only JSON string
from text? Please help.
I've tried this regular expression with no success:
{"mTitle":(\w|\W)*"mDate":(\w|\W)*}

The following regex should work:
\{\s*"mTitle"\s*:\s*(.+?)\s*,\s*"mPoster":\s*(.+?)\s*,\s*"mYear"\s*:\s*(.+?)\s*,\s*"mDate"\s*:\s*(.+?)\s*\}
Check demo here.
The main difference from your regex is the .+? part, that, broken down, means:
Match any character (.)
One or more times (+)
As little as possible (?)
The ? operator after the + is very important here --- because if you removed it, the first .+ (in \{\s*"mTitle"\s*:\s*(.+?)) would match the whole text, not the text up to the "mPoster" word, that is what you want.
Notice it is just a more complicated version of \{"mTitle":(.+?),"mPoster":(.+?),"mYear":(.+?),"mDate":(.+?)\} (with \s* to match spaces, allowed by the JSON notation).

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)

/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.

In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;

Assuming you want to match ###token2### as well...
/###.+###/

Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.

Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex with Tab delimited text containing \x09 - regex

I would recommend encoding all of the characters in the pcre string prior to running the regular expression against it.

Related

Regex expression that selects only specific word

Regular expression appears to ignore tab character

How to extract big mgrs using regex

regular expression to extract JSON string from text

How to cycle through delimited tokens with a Regular Expression?

Categories

Resources