C# Regex match + next n characters - regex

I'm new to Regex and i need to parse sourcecode from a website. Can anyone tell me what would be the syntax to match a word followed by the next n characters in the string.
Let's say I wanna match word "country" followed by the next 15 chars in the string.
If string would be "...<tr class="hover"><td>country</td><td>RO</td></t......" I need to get "country</td><td>RO" , I can deal with the string like this , ideally would be only "country RO " but I don't wanna ask for too much.

Something like: (country)<\/td><td>(\.\.)
Using $1 $2 as your output should give you what you need.
Explaination:
Putting the () brackets around something lets you back reference it with the $1, etc.
Otherwise you are able to match exact characters.
Note to escape special regex chars like / with a backslash
The second match in brackets is just matching the next two characters no matter what they are. If you have a subset these can be (i.e. [A-Za-z]) it would be better to use that
With that assumption I would use something like: (country)<\/td><td>([A-za-z]{2})
Also helps to find a good reference: http://www.regular-expressions.info/reference.html

Depending on your flavor of Regex engine:
"country.{15}"
Should match "country" exactly, followed by 15 characters of any kind.
It's worth noting that this is an exact match. If there aren't exactly 15 characters following the words "country" this match will fail. That could be problematic for you.
"country.{1,15}"
This will match "country" exactly followed by any character (up to 15). Again, this could also be problematic depending on your use case.

Related

regex code: does or does not contain a character

I cant figure this out. I want to capture the string inside the square brackets, with or without characters in it.
[5123512], [412351, 1235123, 5125123], [12312-AA] and []
i want to convert the square brackets into double quote
[5123512] ==> "5123512"
[412351, 1235123, 5125123] ==> "412351, 1235123, 5125123"
[12312-AA] ==> "12312-AA"
[] == > ""
i tried this \[\d+\] and not working
This is my sample data, its a json format.
Square brackets inside the description need not to change, only the attributes.
{"results":
[{"listing": 4613456431,"sku": [5123512],"category":[412351, 1235123,
5125123],"subcategory": "12312-AA", "description":"This is [123sample]"}
{"listing": 121251,"sku":[],"category": [412351],"subcategory": "12312-AA",
"description": "product sample"}]}
TIA
Your regex doesn't work for three reasons :
[ is a meta-character that opens a character class. To match a literal [, you need to escape it with a backslash. ] also is a meta-character when it follows the [ meta-character, but if you escape the [ you shouldn't need to escape the ] (not that it hurts to do so).
\d only captures decimal digits, however your sample contains the letter A. If that's the hexadecimal digit, you will probably want to use [\dA-F] instead of \d, or [\dA-Fa-f] if the digits can be found in small case. If that can be any letter, you could use [\dA-Z] or [\dA-Za-z] depending on your need to match small case letters.
+ means "one or more occurences", so it wouldn't match an empty []. Use the * "0 or more occurences" quantifier instead.
Additionally, you probably need to capture the sequence of digits in a (capturing group) in order to be able to reference it in your replacement pattern.
However, as Andrew Morton suggests, it looks like you should be able to use a plain text search/replace.
First off, regex is a horrible tool for parsing JSON formatted data. I'm sure you'll find plenty of tools to simply read your JSON in vb.net and mangle it in simpler ways than taking it in as text... For example: How to parse json and read in vb.net
Original answer (edited slightly):
You're almost there, but here's a few things you need to change:
in your regex pattern, escape the square brackets: \[ and \]
if you only want to capture all characters in the brackets, then . is a good way to go
the plus sign + means "at least one" โ€” if you want to match empty brackets too, use *? instead
the question mark means "lazy" โ€” it explicitly tells the regex to match the shortest sequence of characters possible (instead of going over to the next square bracket...)
wrap the .*? into parenthesis so that you can reference to that part later when substituting the stuff
finally, the output value / pattern to substitute with is \1 or $1, depending on the context
or "\1" or "$1" if you really need the double quotes in the output โ€” maybe you just need a string variable?
All in all this becomes:
Find this: \[(.*?)\]
Replace with: \1

ColdFusion Regex Match for Digits of Exact Length

I need some assistance constructing a regular expression in a ColdFusion application. I apologize if this has been asked. I have searched, but I may not be asking for the correct thing.
I am using the following to search an email subject line for an issue number:
reMatchNoCase("[0-9]{5}", mailCheck.subject)
The issue number contains only numeric values, and should be exactly 5 digits. This is working except in cases where I have a longer number that appears in the string, such as 34512345. It takes the first 5 digits of that string as a valid issue number as well.
What I want is to retrieve only 5 digit numbers, nothing shorter or longer. I am then placing these into a list to be looped over and processed. Do I perhaps need to include spaces before and after in the regex to get the desired result?
Thank you.
The general way to exclude content from occurring before/after a match is to use negative lookbehind before the match and a negative lookahead afterwards. To do this for numeric digits would be:
(?<!\d)\d{5}(?!\d)
(Where \d is the shorthand for [0-9])
CF's regex supports lookaheads, but unfortunately not lookbehinds, so that wouldn't work directly in rematch - however that probably doesn't matter in this case because it's likely that you don't want, for example, abc12345 to match either - so what you more likely want is:
\b\d{5}\b
Where \b is a "word boundary" - roughly, it checks for a change between a "word character" and a non-word character (or visa versa) - so in this case the first \b will check that there is NOT one of [a-zA-Z0-9_] before the first digit, and the second \b will check that there isn't one after the fifth digit. A \b does not append any characters to the match (i.e. it is a zero-width assertion).
Since you're not dealing with case, you don't need the nocase variable and can simply write:
rematch( '\b\d{5}\b' , mailCheck.subject )
The benefit of this over simply checking for spaces is that the result is five digits (no need to trim), but the downside is that it would match values such as [12345] or 3.14159^2 which are probably not what you want?
To check for spaces, or the start/end of the string, you can do:
rematch( '(?:^| )\d{5}(?= |$)' , mailCheck.subject )
Then use trim on each result to remove spaces.
If that's not what you're after, go ahead and provide more details.

regex not working as it should

I'm trying to catch up on regex and I have made one as below;
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ ,\d]{1};(\d){8};(\d){1};(\d){1}; $
and the sample is;
รค;1234;00126434;K;11821111;00000000;SOME TEXT ; 0;00000000;0;0;
As far as I've read
. is all chars, \d is digits, {n} and variations indicates n time and depending on variation, more repetitions.
What could be the problem?
A few suggestions/observations:
You can remove all {1}s, they don't do anything.
[A,K] means "A, , or K". If you want to match any letter between A and K, use [A-K].
You should place the capturing group around the repetitions: (\d{7,8}) captures a 7-8 digit number; (\d){7,8} will only capture the last digit.
[ ,\d]{1} fails on your regex because there are two characters (space and 0) at that point in the string.
you might need to remove the space before the final $, unless there actually is a space in your string after the last semicolon.
Here's a version that matches (and captures each element in a separate group):
^(.);(\d{4});(\d{8});([A-K]);(\d{7,8});(\d{8});([A-Z ]+);([ ,\d]+);(\d{8});(\d);(\d); *$
See it in action on regex101.com.
Please, don't abuse regexps for everything.
Your format is a CSV format, just split at ; and the validate the individual parts properly. This is perfectly valid, usually similarly efficient, and easier to debug.
With regexp, make sure you properly escape (i.e. double escape!). In most programming languages, \ is a reserved character in strings, and you will need to use \\ to get the desired effect.
Try this:
^(.){1};(\d){4};(\d){8};[A-K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ \d]{2};(\d){8};(\d){1};(\d){1};$
Here what was happening in your regex
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};[ ,\d]{1};(\d){8};(\d){1};(\d){1}; $
You have extra space before $ at the end.
To specify range use - and not comma, Your range should be [A-K].
In [ ,\d] range You have restricted it to 1 character {1} it should be {2} one for
space and 1 for digit.
Additional: You don't need to specify {1} as it will match one preceding token by default
If yours does not work, you can try this one :
^(.){1};(\d){4};(\d){8};[A,K]{1};(\d){7,8};(\d){8};[A-Z ]{1,};( \d){1};(\d){8};(\d){1};(\d){1};$

Can't use regular expression to match exact string

Given a string below:
String s = "sschk##123456sschk##123456gme##100&200&300&1,2,3,4,5$6,7,8,9,0sschk##123456";
I apply a pattern, sschk##\\d+? or sschk##.+? want to get all sschk##123456 and replace them with an empty string. Please note that number after sschk## might different each time I got it, for example sschk##321321.
But I only got
[sschk##1, sschk##1, sschk##1]
What pattern should I apply to get exact each sschk##123456, so that I can do find and replace later.
Thanks a lot.
The problem with your regex was that you have used "?" marker which toggles the greediness of the "+" in your regex, so your regex "sschk##\d+?" means "a string sschk## followed by 1 or more numbers, but match as less digits as possible". Removing "?" would mean "a string sschk## followed by 1 or more numbers (match as much digits as possible)"
Your regex statement might look like this perhaps: sschk##\\d{6} and it would match a string "sschk##" followed by exactly 6 digits. If you want to match the string "sschk##" followed with variable length of digits, but not more than 6, you might use sschk##\\d{1,6}. If you need to match any number of digits after the string "sschk##" then use sschk##\\d+
I think I got it done.
Just apply the pattern like this
(sschk##\\d+)

How can I make a regex match the next 4 characters immediately after finding something?

I'm trying to write a regex to sift through a sizable amount of data. After it finds something, I want it to match the next 4 characters whatever they are. How can I do this?
/match long stuff here..../
The . in a regex is "Any character." Four of them gets you four characters. You could also do:
/match long stuff here.{4}/
This may depend on what language you are writing your regex in.
The expression .... matches any four characters. Append that to your pattern, and put parenthesis around it so that whatever those characters are will be captured.
For example:
[Hh]ello [Ww]orld(....)
Look at this example: I want to match an IP and the next 4 characters after it.I have a regex
(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})(.{4})
if you match that against the following string 192.167.45.45xabc the first part (?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) will match the IP and the last part (.{4}) will match xabc. (I had added ?: at the beginning to make the first block noncapturing - if you want to capture the IP to just remove ?:)
I hope this helps