REGEX - how to extract a specific number of rows from a text - regex

I need to find out how to extract a specific number of rows from a text( the number of rows that i want to extract would be variable).
In this case, i want to extract anything from 07/06/2021, up to SOLD FINAL ZI 1
TEXT
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
07/06/2021 P2P 00.00
T d r 0000 R A cc R A
r : aadr
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 :00000000000 P R R
A B OO 0000000000 v e: 00.00 n 0000000000
c t 0.00 n
REF. ------------------
P2P 00.00
T d r 0000 R A c R A
rr : Saracie
REF. ------------------
P2P 00.00
T d r 0000 A. B c R A rr : Sanity
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 00000000000 P R R
D OO 0000000000 V T: 00.00 n 0000000000 c
T 0.00 n
REF. ------------------
XX/XX/XXXX RULAJ ZI 1 3
SOLD FINAL ZI 1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
In regex, i start with \n(\d{2}/\d{2}/\d{4}) in order to get the data 07/06/2021, but i don't know how to extract the rest.
Thank you in advance!

Hello and welcome to stackoverflow,
your question might not solve your actual problem. Do you REALLY want to "extract a specific number of rows"? This might be a XYProblem.
I like the solution from MDR to extract everything up to SOLD FINAL:
^(\d{2}\/\d{2}\/\d{4})[\s\S]+SOLD FINAL.
I like this because I guess you know the word at the end and not the number of lines. But we can't tell.
Anyway to give you the answer to your question (as your actual problem might look different than we expect) you can use this regex:
^(\d{2}\/\d{2}\/\d{4}).*$(\n^.*$){n}
^ --> look at the beginning of a row
(\d{2}\/\d{2}\/\d{4}) --> your regex for the date
.*$ --> also take the rest of the line
(\n^.*$){n} --> take the next n lines
\n --> the line break
^ --> again: beginning of a new line
.* --> as much characters as needed to match the next (non greedy)
$ --> the end of a line
{n}--> the number of lines you want to extract (replace n ;) )

Related

Use a regular expression extract substring from data frame columns in R

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.
Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"
str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.
The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

pattern matching in R using grepl

I have a dataframe dat like this
P pedigree cas
1 M rs2745406 T
2 M rs6939431 A
3 M SNP_DPB1_33156641 G
4 M SNP_DPB1_33156664_G P
5 M SNP_DPB1_33156664_A A
6 M SNP_DPB1_33156664_T A
I want to exclude all rows where the pedigree column starts with SNP_ and ends with either G, C, T, or A (_[GCTA]). In this case, this would be rows 4,5,6.
How can I achieve this in R? I have tried
multisnp <- which(grepl("^SNP_*_[GCTA]$", dat$pedigree)=="TRUE")
new_dat <- dat[-multisnp,]
My multisnp vector is empty, but I can't figure out how to fix it so that it matches the pattern I want. I think it is my wildcard * usage that is wrong.
You can use the following with .*? (match everything in non greedy way):
multisnp <- which(grepl("^SNP_.*?_[GCTA]$", dat$pedigree))
^^^
You can subset dat like this
new_dat <- dat[!grepl("^SNP_.*_[GCTA]$", dat$pedigree), ]
Regarding the code that you've tried, I'm not sure that grepl("^SNP_*_[GCTA]$") will complete without an error since you aren't passing in an x vector to grepl. See ?grepl for more info.

Remove Data from Address Line

I have the following address that I pulled from a database. I am trying to clear everything up until ST|AVE|BLVD. I am trying to get rid of 1ST or the random 1.
9999-1000 N CLARK ST 1 1
4567-5678 W BELMONT AVE
1200 N HAMLIN AVE 1ST 1
8220 W CERMAK RD 1ST
1240 W 69TH ST 1ST
7901 W ADDISON ST 1ST
So that it reads:
1. 9999-1000 N CLARK ST
2. 4567-5678 W BELMONT AVE
3. 1200 N HAMLIN AVE
4. 8220 W CERMAK RD
5. 1240 W 69TH ST
6. 7901 W ADDISON ST
You can try the following regex:
^(.*?)(\s*(?:ST|AVE|BLVD).*)$
Your data is in capturing group 1.
See example here.

Solr - grab previous/next X amount words from keywords

Is there a way to query for keywords and grab the previous x amount of words and the next x amount of words?
Example
(Searching for "Test")
Aa bb cc dd ee ff gg hh ii jj kk ll Test mm nn oo pp qq rr ss tt…
Where x was 5 would return
“hh ii jj kk ll Test mm nn oo pp qq rr ss”
With “Test” highlighted.
or
(Searching for "Test" AND/OR "Spam")
Aa bb cc dd ee ff gg hh ii jj kk ll Test mm nn Spam oo pp qq rr ss tt…
Where x was 5 would return
“hh ii jj kk ll Test mm nn Spam oo pp qq rr ss tt”
With “Test” and "Spam" highlighted.
Any help would be much appreciated. I've been looking into Regex but I'm quite clueless there. Here are the resources I've been using. Also, my contains $,. and other random punctuation (tried going down the isolate by sentences). Let's just assume spaces to separate.
http://lucidworks.lucidimagination.com/display/solr/Highlighting#Highlighting-UsingBoundaryScannerswiththeFastVectorHighlighter
http://wiki.apache.org/solr/HighlightingParameters/
Thanks!
Use the Highlighting tool - it will give you snippets of the matched document with the search terms italicized (in HTML). You can then home in on those markers (<em>) and then go backward and forward character by character until you accumulate five space characters.

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).