Regular Expression for formatting a file - regex

My file has data with each line starting with a specific pattern
1000000179|abcd.....
1000000180|wedwedw...
1000000181|wnewedwed...
i've opened the file in visual studio and need an RE to find any line not beginning in the correct sequence. Like below line 3 and 4 are not valid. How to isolate them using RE
1000000179|abcd.....
1000000180|wedwedw...
1000xyadaa|wnewedwed...
%dfgxyadaa|wnewedwed...

Something as simple as ^[^0-9]{1,10}[^|].*$ should detect any lines that don't start with ten numerals and a pipe.
If you just want to select just the first part of the line, then ^[^0-9]{1,10}[^|]
NB: you can replace [^0-9] with \D (case sensitive!) if you prefer that syntax, eg ^\D{1,10}[^|]
To reverse the logic (ie find the correct lines), use ^[0-9]{10}\|.*$ or ^\d{10}\|
EDIT: For VS2005's search/replace "regular expressions":
To find lines that DO NOT start with 10 numerics followed by a pipe: ^~([0-9]^10\|)
To find lines that DO start with 10 numerics followed by a pipe: ^[0-9]^10\|
Note that the \d and \D syntax does not work, as per #KennethK.'s comment below. The equivalent for a single digit ie [0-9] in VS regular expressions is :d.
Refer to http://msdn.microsoft.com/en-us/library/2k3te2cs(VS.80).aspx for the list of regular expressions available in VS2005.

If I understand what you are trying to find, try the following expression:
^~(1000000).*$
Where the ^, .*, and $ all function as in typical regex, and the ~(...) means "not match". So the overall intent of the pattern is to find lines that do not start with the string "1000000".

Related

Matching an expression including arbitrary lines with regex in Vim

In a text file opened with Vim, I'm trying to match the occurrence of two strings, DRIVER_ACTIVITY and DriverGroup, with an arbitrary amount of lines in between:
2013-07-01 05:06:23,801 DRIVER_ACTIVITY
2013-07-01 05:06:23,804 text
2013-07-01 05:06:23,804 more text
2013-07-01 05:06:23,805 DriverGroup
using:
/DRIVER_ACTIVITY(.*)DriverGroup/s
/DRIVER_ACTIVITY((.|\n|\r)*)DriverGroup
/\vDRIVER_ACTIVITY((.|\n|\r)*)DriverGroup
/DRIVER_ACTIVITY\[\S\s\]*DriverGroup
Nothing matches. How do I match all the lines/new lines?
If you want to use the more common (...) for grouping, you need to include the \v atom to switch Vim's regular expression syntax to "very magic"; else, it's \(...\). But for your case, Vim has a special atom that matches arbitrary characters including newlines: \_., like this:
/DRIVER_ACTIVITY\_.*DriverGroup
There's no way around learning Vim's different regular expression dialect; see :help pattern.
The \_s construct searches spaces including newlines
/DRIVER_ACTIVITY\(\_s\|.\)*DriverGroup
Ok, I see the problem. In this sample file, the third try matches, as does Ingo Karkat's and Explosion Pills' suggestions. The reason I didn't succeed is because all these seem to be greedy. That's why none of these matches in "the big file", 'cause it's greedy and keeps on looking, not returning a match in several seconds, though the marker is located on the same line where the first match should appear. So it actually matches but my patience is the problem :)
I made it non greedy and it worked:
/DRIVER_ACTIVITY_.{-}DriverGroup

How to extract file location using Regular Expressions(VB.NET)

I am facing a problem whereby I am given a string that contains a path to a file and the file's name and I only want to extract the path (without the file's name)
For example, I will receive something like
C:\Users\OopsD\Projects\test.acdbd
and from that string I want to extract only
C:\Users\OopsD\Projects
I was trying to create a RegEx to match a backslash followed by a word, followed by a dot followed by another word - this is to match the
\test.acdbd
part and replace it with empty string so that the final result is
C:\Users\OopsD\Projects
Can anyone, familiar with RegEx, help me on this one? Also, I will be using regular expressions quite a lot in the future. Is there a (free) program I can download to create regular expressions?
Are you really sure you need to be using Regex for such as simple task? How about this:
Dim file As New IO.FileInfo(" C:\Users\OopsD\Projects\test.acdbd")
MsgBox(file.Directory.FullName)
Regarding the free program on Regex, I would definitely recommend http://www.gskinner.com/RegExr/ - using it all the time. But you always have to consider alternatives, before going the Regex way.
The regex that you are looking for is as below:
[^/]+$
where,
^ (caret):Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.
$ (dollar):Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break.
+ (plus):Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.
More reference can be found out at this link.
Many Regex softwares and tools are out there. Some of them are:
www.gskinner.com/RegExr/
www.txt2re.com
Rubular- It is not just for Ruby.

Notepad++ Regex: Find all 1 and 2 letter words

I’m working with a text file with 200.000+ lines in Notepad++. Each line has only one word. I need to strip out and remove all words which only contains one letter (e.g.: I) and words which contains only two letters (e.g.: as).
I thought I could just pas in regular regex like this [a-zA-Z]{1,2} but I does not recognize anything (I’m trying to Mark them).
I’ve done manual search and I know that there do exists words of that length so therefor can it only be my regex code that’s wrong. Anyone knows how to do this in Notepad++ ???
Cheers,
- Mestika
If you want to remove only the words but leave the lines empty, this works:
^[a-zA-Z]{1,2}$
Replace this with an empty string. ^ and $ are anchors for the beginning and the end of a line (because Notepad++'s regexes work in multi-line mode).
If you want to remove the lines completely, search for this:
^[a-zA-Z]{1,2}\r\n
And replace with an empty string. However, this won't work before Notepad++ 6, so make sure yours is up-to-date.
Note that you will have to replace \r\n with the specific line-endings of your file!
As Tim Pietzker suggested, a platform independent solution that also removes empty lines would be:
^[a-zA-Z]{1,2}[\r\n]+
A platform-independent solution that does not remove empty lines but only those with one or two letters would be:
^[a-zA-Z]{1,2}(\r\n?|\n)
I don't use Notepad++ but my guess is it could be because you have too many matches - try including word boundaries (your exp will match every set of 2 letters)
\b[a-zA-Z]{1,2}\b
The regex you specified should find 1-or-2 characters (even in Notepad++'s Find-dialog), but not in the way you'd think. You want to have the regex make sure it starts at the beginning of the line and ends at the end with ^ and $, respecitevely:
^[a-zA-Z]{1,2}$
Notepad++ version 6.0 introduced the PCRE engine, so if this doesn't work in your current version try updating to the most recent.
You seem to use the version of Notepad++ that doesn't support explicit quantifiers: that's why there's no match at all (as { and } are treated as literals, not special symbols).
The solution is to use their somewhat more lengthy replacement:
\w\w?
... but that's only part of the story, as this regex will match any symbol, and not just short words. To do that, you need something like this:
^\w\w?$

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

Regular Expression Using the Dot-Matches-All Mode

Normally the . doesn't match newline unless I specify the engine to do so with the (?s) flag. I tried this regexp on my editor's (UltraEdit v14.10) regexp engine using Perl style regexp mode:
(?s).*i
The search text contains multiple lines and each line contains many 'i' characters.
I expect the above regexp means: search as many characters (because with the '?s' the . now matches anything including newline) as possible (because of the greediness for *) until reaching the character 'i'.
This should mean "from the first character to the last 'i' in the last sentence" (greediness should reach the last sentence, right?).
But with UltraEdit's test, it turns out to be "from the first character to the last 'i' in the first sentence that contains an i". Is this result correct? Did I make any wrong interpretation of my reg expression?
e.g. given this text
aaa
bbb
aiaiaiaiaa
bbbicicid
it is
aaa
bbb
aiaiaiai
matched. But I expect:
aaa
bbb
aiaiaiaiaa
bbbicici
Your regex is correct, and so are your expectations of its performance.
This is a long-known bug in UltraEdit's regex implementation which I have written repeatedly to support about. As far as I know, it still hasn't been fixed. The problem appears to lie in the fact that UE's regex implementation is essentially line-based, and additional lines are taken into the match only if necessary. So .* will match greedily on the current line, but it will not cross a newline boundary if it doesn't have to in order to achieve a match.
There are some other subtle bugs with line endings. For example, lookbehind doesn't work across newlines, either.
Write to IDM support, or change to an editor with decent regex support. I did both.
Yes you are right this looks like a bug.
Your interpretation is correct. If you are in Perl mode and not Posix.
However it should apply to posix as well.
Altough defining the modifiers like you do is very rare.
Mostly you provide a string with delimiters and the modifier afterwards like /.*i/s
But this doesn't matter because your way is correct too. And if it wouldnt be supported, it wouldn't match the first newline either.
So yes, this is definately a bug in your program.
You're right that that regex should match the entire string (all 4 lines). My guess is that UltraEdit is attempting to do some sort of optimization by working line by line, and only accumulating new lines "when necessary".