Select files between specified range with regex - regex

I have a folder with 100 folders, named like:
parent_folder/05/01/
parent_folder/05/02/
parent_folder/05/03/
parent_folder/05/04/
...
parent_folder/05/29/
parent_folder/05/30/
How can I specify a path, with regex, that would select only the contents of folders 01 to 10, then 11 to 20 and, finally, 21 to 30 ?
I am trying
"parent_folder/05/[1-10]*/*"
but it also selects 11, 12, ... all the way to 19.
EDIT: I want to read a large dataset in pyspark by 10-day intervals, and all suggested answers, so far, seem to fail.

If you want the "10" to be grouped with your 01...09 set, you are going to use something like this:
parent_folder\/05\/(0[1-9]|10)\/
then, for your 10...20 set,
parent_folder\/05\/(1[1-9]|20)\/
and so on.
You can try these regexps with the following link : https://regex101.com/r/cXAYbS/2
In python, you are going to need:
regex = r"parent_folder\/05\/(1[1-9]|20)\/"
The link above has a "python" generator, where you can borrow some code:
https://regex101.com/r/cXAYbS/2/codegen?language=python

How about this:
parent_folder/05/(?:0[1-9]|10)/
The '?:' is used for non-captering groups.

Related

Using REGEXEXTRACT in an array, searching multiple columns

Can someone please tell me what I am doing wrong in this formula?
=ARRAYFORMULA(REGEXEXTRACT((A2:A&"")+(B2:B&"")+(C2:C&"")), "02(\d{14})37")
I'm trying to extract a 14 digit number that sits between 02 and 37 that may be in columnA, columnB or columnC.
I've tried this also, with the expected result showing on the first row only:
=ARRAYFORMULA(REGEXEXTRACT(textjoin(" ",true,A2:C),"02(\d{6,14})37"))
I'm really confuzzled.
it needs to be like this:
=ARRAYFORMULA(IFERROR(IFERROR(IFERROR(IFERROR(
REGEXEXTRACT(A2:A&"", "02(\d{14})37"),
REGEXEXTRACT(B2:B&"", "02(\d{14})37")),
REGEXEXTRACT(C2:C&"", "02(\d{14})37")))))

how to use regexp to select files in a specific order? - matlab

Let's say I have 14 files with names:
file_001.txt file_002.txt file_003.txt file_004.txt ... file_014.txt
I'm trying to write a regex that selects my files in a specific order. Assuming ls outputs:
file_001.txt file_002.txt file_003.txt ... file_014.txt
regexp(ls ,'file_0+([135]|[246])\.txt','match') gives me:
file_001.txt
file_002.txt
file_003.txt
file_004.txt
file_005.txt
file_006.txt
but what I'm aiming at is:
file_001.txt
file_003.txt
file_005.txt
file_002.txt
file_004.txt
file_006.txt
Regex is simply not the right tool for this.
You'll end up with an expression that looks like this:
file_([1-9][0-9]?|100)[1-5][5-8](12[1-9]|1[3-9][0-9]|[2-4][0-9]{2}|5[0-2][0-9])(73[89]|7[4-9][0-9]|8[0-9]{2}|9[0-8][0-9]|99[01])(9|[1-9][0-9]{1,2}|[1-7][0-9]{3}|80[0-9]{2}|81[01][0-9]|812[0-8])(83[4-9]|8[4-9][0-9]|9[0-9]{2}|[1-9][0-9]{3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])\.txt

Regex code how to filter all names that contain only numbers and end with .jpg and/or _number.jpg?

How to filter all names that consist of numbers and end with .jpg and/or _number.jpg?
Background info:
In SSIS 2008 I have a foreach loop that will store the filename into a variable for all jpg files. The enumorator configuration for Files is currently: *.jpg
This will handle all jpg files.
What is the code so it will only handle names likes?:
3417761506233.jpg
3417761506233_1.jpg
5414233177487.jpg
5414233177487_1.jpg
5414233177487_14.jpg
but not names like:
abc.jpg
abc123.jpg
def.png
456.png
The numbers represent EAN codes by the way.
I thought about this code:
\d|_|.jpg
but SSIS returns an error stating there are no files that meet the criteria eventhough the files(names) are in the folder.
You could use a Script Task within the loop to do the regex filtering:
http://microsoft-ssis.blogspot.com/2012/04/regex-filter-for-foreach-loop.html
Or you could use a (free) Third Party Enumerator:
http://microsoft-ssis.blogspot.com/2012/04/custom-ssis-component-foreach-file.html
For that, you can use the following regex:
^\d+(_\d+)?.jpg$
Demo: http://regex101.com/r/qC7oV3
^(\d+(?:_\d+)?\.jpg$)
DEMO --> http://regex101.com/r/dM9rJ7
Matches:
3417761506233.jpg
3417761506233_1.jpg
5414233177487.jpg
5414233177487_1.jpg
5414233177487_14.jpg
Excludes:
abc.jpg
abc123.jpg
def.png
456.png

Searching for Social security number using Lucene 4 regexp

I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444 or 222 33 4444.
As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444 to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/" but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?
If you simply have a field of identifiers, or some such, use a StringField, or some other untokenized field, in which case a simple RegExpQuery is simple enough to define.
If you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery API to construct the appropriate query:
SpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{3}")));
SpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{2}")));
SpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{4}")));
Query query = new SpanNearQuery({span1, span2, span3}, 0, true);
searcher.search(query, maxResults)
You can use the INTERVAL flag:
/<000-999>/ /<00-99>/ /<0000-9999>/
> INTERVAL
I do not know about lucene, but this regex works:
'\d{3}[ \-]\d{2}[ \-]\d{4}'
It matches both:
222 33 4444
and
222-33-4444

Stacking related lines together in notepad++

Hi so I'm trying to use find and replace in notepad++ with regular expression to do the following:
I have two set of lines
first set:
[c][eu][e]I37ANKCB[/e]
[c][eu][e]OIL8ZEPW[/e]
[c][eu][e]4OOEL75O[/e]
[c][eu][e]PPNW5FN4[/e]
[c][eu][e]E2BXCWUO[/e]
[c][eu][e]SD9UQNT8[/e]
[c][eu][e]E6BK6IGO[/e]
second set:
[u]7ubju2jvioks[u2]_261
[u]89j408tah1lz[u2]_262
[u]j673xnd49tq0[u2]_263
[u]dv73osmh1wzu[u2]_264
[u]twz3u4yiaeqr[u2]_265
[u]cuhtg6r71kud[u2]_266
[u]yts0ktvt9a3r[u2]_267
now I want to the second set to by places after each of the first set like this:
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
any suggestions?
You can mark the second block in column mode using ALT and the left mouse button. Then just copy paste it at the end of the first row.
No need/Not possible using regex.
I would solve this via a simple script written in Python or Ruby or something equally quick. This works, for example:
import os
path = os.path.dirname(__file__)
with open(os.path.join(path, 'file1')) as file1:
with open(os.path.join(path, 'file2')) as file2:
lines = zip(file1.readlines(), file2.readlines())
print ''.join([a.rstrip() + b for a, b in lines])
Running it gives the correct result:
> python join.py
[c][eu][e]I37ANKCB[/e][u]7ubju2jvioks[u2]_261
[c][eu][e]OIL8ZEPW[/e][u]89j408tah1lz[u2]_262
[c][eu][e]4OOEL75O[/e][u]j673xnd49tq0[u2]_263
[c][eu][e]PPNW5FN4[/e][u]dv73osmh1wzu[u2]_264
[c][eu][e]E2BXCWUO[/e][u]twz3u4yiaeqr[u2]_265
[c][eu][e]SD9UQNT8[/e][u]cuhtg6r71kud[u2]_266
[c][eu][e]E6BK6IGO[/e][u]yts0ktvt9a3r[u2]_267
Customize to suit your needs.