Remove all lines after an 4 digit number from a large number of .txt files - regex

I have files that are split in two blocks, the first half contains the information I need, the second half always starts with a 4 digit number (between 1400 and 1900). I need to delete the second block, hence my question:
How do I delete all lines in a file after (and including) the first 4 digit number?
I believe that should be doable using notepad++ and regular expressions, but I'm new to regex and have no idea how...
I know it's a noob-ish question but nonetheless, help would be greatly appreciated.

The Regex 1(?:[4-8]\d\d|900)(?:.|[\r\n])+\z will select a text starting with 1400-1900 till the end of a file.

Related

regex search window size

I have a large document that I am running a regex on. Below is an example of a similar expression:
(?=( aExample| bExample)(?=.*(XX))(?=.*(P1)))
This works a lot of times, but sometimes due to other text within the document the condition is met by looking in the entire document, e.g., there might be 10 characters between "aExample" and "XX, but 1,000 characters between "XX" and "P1". I would like to contain the expression to N characters (lets say 50 for the sake of the example) so that the regex is a little more conservative. Any help is appreciated. How can I go about reducing the size of the window of the regex to N characters instead of the entire string/document? Thanks!
(?=( aExample| bExample)(?=.{1,50}(XX))(?=.{1,50}(P1)))
You want to limit the number of .s to look at so you can just use braces.

Regex (Python) data extraction - overlapping or incomplete results

I'm trying to extract data from some WHO codebooks that I've converted from PDF to text with Python slate library.
The text I want to hit starts with 2 digits, dash, 2 digits, followed by some text and ends with "Q"+1 or 2 digits and again "Q"+1 or 2 digits
17-17How old are you?Q1Q1
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Sometimes those phrases end with a blank, sometimes the next questions starts immediately (here are three question), observe Q4Q424-29 and Q5Q530-30
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q424-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q530-30During the past 30 days, how often did you go hungry because there was not enough food in your home?Q6Q7
With
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d)*?
I get pretty close, but I'm missing the second digit when the second "Q" has two digits.
I've tried to add a negative lookahead
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d((\d)(?!\d\d-))
to exclude the start of the pattern with two digits and a dash.
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d{1,2}
includes the second digit of the "Q" but generates overlapping results e.g. at Q4Q424-29 where the first string ends with Q4Q42 and the second string starts with 4-29.
The regex with parts of the original sample text is here: https://regex101.com/r/d9Dlga/2/
Any suggestions who to extract the correct strings like:
17-17How old are you?Q1Q1
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q4
24-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q5
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
Thanks!
I see the problem now. New attempt that I think works:
\d{2}-\d{2}.+?Q\d{1,2}Q\d{1,2}(?!\d-\d{2})
I put a negative lookahead at the end to test if a new section has begun.
9 matches
Correctly grabs the full 2-digit endings
Demo
The following pattern should work:
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d(?!\d-))?

How to get a count of the word sizes in a large amount of text?

I have a large amount text - roughly 7000 words.
I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.
I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.
EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.
Don't forget word boudary token \b. If you don't put it at both ends of a \w{n} token then all words longer than n characters are also found. For a 4 character word \b\w{4}\b for a six character long word use \b\w{6}\b. Here is a demo with 7000 words as input string.
Java implementation:
String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.
You could generate regexes for each size you want.
\w{6} would get each word with 6 letters exactly
\w{7} would get each word with 7 letters exactly
and so on...
So you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.

RegEx to find numbers over certain value with commas, and another text value, on same line

I'm new to Regular Expressions, and I have been trying to figure out how to code this: I need to find numbers greater than 25000 where the same line also has the number " 19" somewhere on that line (that's a space then 19). The problem is that the numbers have commas in them. I tried a few options:
This finds lines with any numbers over 25000:
^.*(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).*$
This finds line with both " 19" and 26, (but not with the comma behind the 26)
^.*( 19.*26).*$
Any help is appreciated!
Numbers over 25000 can be represented as follows :
\d{6,}|2[5-9]\d{3}|[3-9]\d{4}
That is, in english :
numbers of 6 digits or more
numbers of 5 digits starting with 2 and another digit equal or greater than 5
numbers of 5 digits starting with a digit greater than 2
So the complete regex would look like this :
.*(\d{6,}|2[5-9]\d{3,}|[3-9]\d{4,}).* 19.*
Which is said number somewhere in the line, followed by 19 somewhere in the line.
Here is a test run on regex101 for you to test with your data.
I also second the comment that this isn't a job for regular expressions, which as you can see work on characters rather than numbers.
I would try something like this:
^(([0-9,]*([3-9][0-9]|2[5-9]),?[0-9]{3})\s?)$
That should handle the numeric part. You didn't really explain if the " 19" would come before or after that, and what would delimit that from the numeric part, but just insert (\s19) wherever that bit needs to go.
example
Thanks everyone. The following RegEx worked for me:
^.* 19.(25,|26,|27,|28,|29,|30,|31,|32,|33,|34,|35,|36,|37,|38,|39,|40,|41,|42,|43,|44,|45,|46,|47,|48,|49,|50,|51,|52,|53,|54,|55,|56,|57,|58,|59,|60,|61,|62,|63,|64,|65,|66,|67,|68,|69,|70,|71,|72,|73,|74,|75,|76,|77,|78,|79,|80,|81,|82,|83,|84,|85,|86,|87,|88,|89,|90,|91,|92,|93,|94,|95,|96,|97,|98,|99,|100,|101,|102,|103,|104,|105,|106,|107,|108,|109,|110,|111,|112,|113,|114,|115,|116,|117,|118,|119,|120,|121,|122,|123,|124,).$
This finds lines that have " 19" first in the line then a number greater than 25K later in the line, when the numbers have commas in them. I couldn't use the shortcut "number ranges" that were suggested because there are other numbers on the lines without commas that are over 25K that I don't want to flag. Maybe there's any easier way that my brute force method, but if not, at least this works. Thanks again!

Regex in dreamweaver and notepad++

I have a problem. I need to use regex in notepad++ or dreamweaver or someother editor to handle a large number of .html files.
I need to find all html files that contains line below - but - there is a important thing.
/myfolder/401(something)a.js
It must find files that contains line above but ONLY those files that have at least one digit between
/myfolder/401(at least one digit 0-9)c.js!!!
It can contain letters but it must have in one place between 201------a.js at least one or more digits.
If there is no digits between 401--a.js than skip it(dont mark that one).
For example:
/myfolder/401dhfgsadfdf1a.js
/myfolder/401d7sd7fdf8a.js
Those above mark as correct but:
/myfolder/401dfdsfsdfsa.js
The above don't mark because it doesn't contain not a single digit between 401 and a.js
Is there any regex expert around here? Thank's in advance for any help.
Inside notepad++ i ran this query in the find dialog
/myfolder/401.*\d.*a\.js
Locate something that starts with /myfolder/401 > has anything with at least a digit > a.js
With the following as my test data
/myfolder/401dhfgsadfdf1a.js
/myfolder/401d7sd7fdf8a.js
/myfolder/401dfdsfsdfsa.js
and the result of "File all in Current Document" were:
Search "/myfolder/401.*\d.*a\.js" (2 hits in 1 file)
new 2 (2 hits)
Line 1: /myfolder/401dhfgsadfdf1a.js
Line 2: /myfolder/401d7sd7fdf8a.js