I have a huge CSV list that needs to be broken up into smaller pieces (say, groups of 100 values each). How do I match 100 lines? The following does not work:
(^.*$){100}
If you must, you can use (flags: multi-line, not global):
(^.*[\r\n]+){100}
But, realistically, using regex to find lines probably is the worst-performing method you could come up with. Avoid.
You don't need regex for this, there should be other tools in your language, even if there is none, you can do simple string processing to get those lines.
However this is the regular expression that should match 100 lines:
/([^\n]+\n){100}/
But you really shouldn't use that, it's just to show how to do such task if ever needed (it searches for non newlines [^\n]+ followed by a newline \n repeated for {100} times).
Related
I am wondering if there is a better approach than what I am currently taking to parse this file. I have a string that is in the general format of:
[Chunk of text]
--------------------
[Another chunk of text]
(There can be multiple chunks of text with the same separator between them)
I am trying to parse the chunks of text into elements of a list, which I can do with data.split('-'*20) [in this case], however if there are not exactly 20 hyphens the split will not work as intended. I have been playing around with regex however am currently unsure of a proper regex that could be used.
Are there any better methods that I should use in this situation, or is there a regex I should use oppose to the .split() method?
You want a regex split. I'm not python-literate, but I found the function in the official 2.7.10 documentation, and modified to your case:
>>> re.split('\n\-{4,}\n', input)
4 is the minimum amount of dashes you want to match.
\n are the newlines before and after. You probably don't want those in your text.
I would try to use re.split() with the regex --+ which means:
- - one hyphen
-+ - one or more hyphens
... this way it would not match a single hyphen, but everything more than one, alternatively you could use -{2,} which means two or more.
Searched and found some apparently similar questions, that weren't quite.
I often find myself needing to replace leading 4-space indentations with tabs. I always do this with RegEx ^(\t*) {4}, replacing with $1\t. And then I just do multiple passes to catch nested indents. It works, it's easy. But I'm wondering, is it possible to write a RegEx that can do this in one pass (to handle nested indents)?
EDIT
Apologies for lack of input/output examples, I was in a hurry. Here's an example, let s mean space and t mean tab:
SMA
ssssRTP
ssssssssATR
ssssssssOLN
ssssOWH
ssssERE
TOGO
Output:
SMA
tRTP
ttATR
ttOLN
tOWH
tERE
TOGO
Essentially, the RegEx would need to allow for arbitrarily deeply nested chunks of 4 spaces. It does not need to allow for tabs following spaces in the initial input.
PCRE
(^\t*|\G) {4} replace with $1\t or (^|\G)( {4}|\t) replace with \t. You should use multiline mode.
How this works:
^\t* — this match start of string followed by any numbers of tabs.
\G — this match end of previous match.
{4} — this match four spaces.
So this regular expression match four spaces at the start of string or four spaces following four spaces already matched by this regular expression.
Tested this with .NET's regex engine. JavaScript's (at least Mozilla's) won't work, though; it relies on lookbehind, which isn't available. PCRE wants fixed-length lookbehinds, so this won't work there either, unfortunately.
(?<=^( {4}|\t)*) {4}
Basic idea is to match four spaces preceded by the beginning of a line plus all the spots where a previous match would naturally go. Since replacement is done atomically, there's no chance of missing such a spot; all such matches are gathered at once. Then make sure you're using Multiline flag and replace with a single tab character and you're good to go.
Test data, which is just random pseudocode in a vaguely Pythonesque style:
def a:
return true
# comment with embedded spaces etc.
I have a file containing many lines of text, and I want to match only those lines that contain a number of words. All words must be present in the line, but they can come in any order.
So if we want to match one, two, three, the first 2 lines below would be matched:
three one four two <-- match
four two one three <-- match
one two four five
three three three
Can this be done using QRegExp (without splitting the text and testing each line separately for each word)?
Yes it is possible. Use a lookahead. That will check the following parts of the subject string, without actually consuming them. That means after the lookahead is finished the regex engine will jump back to where it started and you can run another lookahead (of course in this case, you use it from the beginning of the string). Try this:
^(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)[^\r\n]*$
The negated character classes [^\r\n] make sure that we can never look past the end of the line. Because the lookaheads don't actually consume anything for the match, we add the [^\r\n]* at the end (after the lookaheads) and $ for the end of the line. In fact, you could leave out the $, due to greediness of *, but I think it makes the meaning of the expression a bit more apparent.
Make sure to use this regex with multi-line mode (so that ^ and $ match the beginning of a line).
EDIT:
Sorry, QRegExp apparently does not support multi-line mode m:
QRegExp does not have an equivalent to Perl's /m option, but this can be emulated in various ways for example by splitting the input into lines or by looping with a regexp that searches for newlines.
It even recommends splitting the string into lines, which is what you want to avoid.
Since QRegExp also does not support lookbehinds (which would help emulating m), other solutions are a bit more tricky. You could go with
(?:^|\r|\n)(?=[^\r\n]*one)(?=[^\r\n]*two)(?=[^\r\n]*three)([^\r\n]*)
Then the line you want should be in capturing group 1. But I think splitting the string into lines might make for more readable code than this.
You can use the MultilineOption PatternOption from the new Qt5 QRegularExpression like:
QRegularExpression("\\w+", QRegularExpression::MultilineOption)
I'm trying to run a unix regEXP on every log file in a 1.12 GB directory, then replace the matched pattern with ''. Test run on a 4 meg file took about 10 minutes, but worked. Obviously something is damaging performance by several orders of magnitude.
UPDATE: I am noticing that searching for ^(155[0-2]).*$ takes ~7 seconds in a 5.6 MB file with 77 matches. Adding the Negative Lookahead Assertion, ?!, so that the regExp becomes ^(?!155[0-2]).*$ is causing it to take at least 5-10 minutes; granted, there will be thousands and thousands of matches.
Should the negative lookahead assertion be extremely detrimental to performance when there are many matches?
If you can get rid of that .* at the beginning it would help. What can be before it, just whitespace? If so, try:
^(?!\s*155[0-2][0-9]{4}\s).*$
If it really can be anything, try making it non-greedy:
^(?!.*?155[0-2][0-9]{4}\s).*$
Note: in both examples, I removed the second .*, since the third one would match the same thing as well.
It helps to think about what the regex engine will actually be doing.
Match ^ (beginning of line). No problem.
Try to match the negative look-ahead assertion
Grab as much as possible with .*. This means it grabs the entire line.
Is the next character 1? If not, make the .* match one fewer character and repeat until it does match a 1.
You can see that this means for every line that doesn't match, it will backtrack through the entire line. Now, if you just use \s* at the beginning, then that will only grab whitespace, not the entire line. If it really can be anything, .*? will be faster on lines that do match the 155 pattern, and it will be about the same on lines that don't. (On lines that don't match, it will keep growing the .* until it has grabbed the whole line.)
Basically: The regex implementation you are using is non-linear and can only deal with a subset of the regular expression language with any efficiency. See my question about a regex implementation that can handle machine generated regexes efficiently for more background.
If you can select another implementation, you're in luck; back when I was looking these were scarce. Two reasonable options are RE2, and TRE, but both are libraries, not standalone executables.
Another option you have is to use the unix utility (grep?) you've used in the past; grep certainly has a windows port as do many other unix utilities.
I'm trying to get a regex that will match:
somefile_1.txt
somefile_2.txt
somefile_{anything}.txt
but not match:
somefile_16.txt
I tried
somefile_[^(16)].txt
with no luck (it includes even the "16" record)
Some regex libraries allow lookahead:
somefile(?!16\.txt$).*?\.txt
Otherwise, you can still use multiple character classes:
somefile([^1].|1[^6]|.|.{3,})\.txt
or, to achieve maximum portability:
somefile([^1].|1[^6]|.|....*)\.txt
[^(16)] means: Match any character but braces, 1, and 6.
The best solution has already been mentioned:
somefile_(?!16\.txt$).*\.txt
This works, and is greedy enough to take anything coming at it on the same line. If you know, however, that you want a valid file name, I'd suggest also limiting invalid characters:
somefile_(?!16)[^?%*:|"<>]*\.txt
If you're working with a regex engine that does not support lookahead, you'll have to consider how to make up that !16. You can split files into two groups, those that start with 1, and aren't followed by 6, and those that start with anything else:
somefile_(1[^6]|[^1]).*\.txt
If you want to allow somefile_16_stuff.txt but NOT somefile_16.txt, these regexes above are not enough. You'll need to set your limit differently:
somefile_(16.|1[^6]|[^1]).*\.txt
Combine this all, and you end up with two possibilities, one which blocks out the single instance (somefile_16.txt), and one which blocks out all families (somefile_16*.txt). I personally think you prefer the first one:
somefile_((16[^?%*:|"<>]|1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
somefile_((1[^6?%*:|"<>]|[^1?%*:|"<>])[^?%*:|"<>]*|1)\.txt
In the version without removing special characters so it's easier to read:
somefile_((16.|1[^6]|[^1).*|1)\.txt
somefile_((1[^6]|[^1]).*|1)\.txt
To obey strictly to your specification and be picky, you should rather use:
^somefile_(?!16\.txt$).*\.txt$
so that somefile_1666.txt which is {anything} can be matched ;)
but sometimes it is just more readable to use...:
ls | grep -e 'somefile_.*\.txt' | grep -v -e 'somefile_16\.txt'
somefile_(?!16).*\.txt
(?!16) means: Assert that it is impossible to match the regex "16" starting at that position.
Sometimes it's just easier to use two regular expressions. First look for everything you want, then ignore everything you don't. I do this all the time on the command line where I pipe a regex that gets a superset into another regex that ignores stuff I don't want.
If the goal is to get the job done rather than find the perfect regex, consider that approach. It's often much easier to write and understand than a regex that makes use of exotic features.
Without using lookahead
somefile_(|.|[^1].+|10|11|12|13|14|15|17|18|19|.{3,}).txt
Read it like: somefile_ followed by either:
nothing.
one character.
any one character except 1 and followed by any other characters.
three or more characters.
either 10 .. 19 note that 16 has been left out.
and finally followed by .txt.