Solr - grab previous/next X amount words from keywords - regex

Is there a way to query for keywords and grab the previous x amount of words and the next x amount of words?
Example
(Searching for "Test")
Aa bb cc dd ee ff gg hh ii jj kk ll Test mm nn oo pp qq rr ss tt…
Where x was 5 would return
“hh ii jj kk ll Test mm nn oo pp qq rr ss”
With “Test” highlighted.
or
(Searching for "Test" AND/OR "Spam")
Aa bb cc dd ee ff gg hh ii jj kk ll Test mm nn Spam oo pp qq rr ss tt…
Where x was 5 would return
“hh ii jj kk ll Test mm nn Spam oo pp qq rr ss tt”
With “Test” and "Spam" highlighted.
Any help would be much appreciated. I've been looking into Regex but I'm quite clueless there. Here are the resources I've been using. Also, my contains $,. and other random punctuation (tried going down the isolate by sentences). Let's just assume spaces to separate.
http://lucidworks.lucidimagination.com/display/solr/Highlighting#Highlighting-UsingBoundaryScannerswiththeFastVectorHighlighter
http://wiki.apache.org/solr/HighlightingParameters/
Thanks!

Use the Highlighting tool - it will give you snippets of the matched document with the search terms italicized (in HTML). You can then home in on those markers (<em>) and then go backward and forward character by character until you accumulate five space characters.

Related

REGEX - how to extract a specific number of rows from a text

I need to find out how to extract a specific number of rows from a text( the number of rows that i want to extract would be variable).
In this case, i want to extract anything from 07/06/2021, up to SOLD FINAL ZI 1
TEXT
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
07/06/2021 P2P 00.00
T d r 0000 R A cc R A
r : aadr
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 :00000000000 P R R
A B OO 0000000000 v e: 00.00 n 0000000000
c t 0.00 n
REF. ------------------
P2P 00.00
T d r 0000 R A c R A
rr : Saracie
REF. ------------------
P2P 00.00
T d r 0000 A. B c R A rr : Sanity
REF. ------------------
P l p 00.00
P XX/XX/XXXX 0000000000 00000000000 P R R
D OO 0000000000 V T: 00.00 n 0000000000 c
T 0.00 n
REF. ------------------
XX/XX/XXXX RULAJ ZI 1 3
SOLD FINAL ZI 1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
cccccccccccccccccccccccccccccccccccccccccccccccc
In regex, i start with \n(\d{2}/\d{2}/\d{4}) in order to get the data 07/06/2021, but i don't know how to extract the rest.
Thank you in advance!
Hello and welcome to stackoverflow,
your question might not solve your actual problem. Do you REALLY want to "extract a specific number of rows"? This might be a XYProblem.
I like the solution from MDR to extract everything up to SOLD FINAL:
^(\d{2}\/\d{2}\/\d{4})[\s\S]+SOLD FINAL.
I like this because I guess you know the word at the end and not the number of lines. But we can't tell.
Anyway to give you the answer to your question (as your actual problem might look different than we expect) you can use this regex:
^(\d{2}\/\d{2}\/\d{4}).*$(\n^.*$){n}
^ --> look at the beginning of a row
(\d{2}\/\d{2}\/\d{4}) --> your regex for the date
.*$ --> also take the rest of the line
(\n^.*$){n} --> take the next n lines
\n --> the line break
^ --> again: beginning of a new line
.* --> as much characters as needed to match the next (non greedy)
$ --> the end of a line
{n}--> the number of lines you want to extract (replace n ;) )

How do I assign value to a new variable from another variable's value in SAS

So I create a new database called sitesweb2 from another one called sitesweb because I just want to keep certain variables. Those are only binary variables.
Then I want to create a new variable called fonction which will take:
Value 1 when there is a 1 in the M N O P Q R S T U variables.
Value 2 when there is a 1 in the AA AB AC AD AE AF AG AH AI AJ variables.
Value 3 when there is a 1 in the AK AL variables.
I have the following code but it doesn't create the fonction variable:
Data DEV1.SITESWEB2;
set DEV1.SITESWEB ;
keep INDUSTRIE M N O P Q R S T U AA AB AC AD AE AF AG AH AI AJ AK AL ;
if M or N or O or P or Q or R or S or T or U in ('1') then fonction = 1 ;
else if AA or AB or AC or AD or AE or AF or AG or AH or AI or AJ in ('1') then fonction = 2;
else if AK or AL in ('1') then fonction = 3;
run;
What is wrong?
Needs operator/right operand for each variable. It's not meeting the condition:-
If M='1' or N='1' or ... then fonction=1;
If they were number variables you could add them up in a sum function:
if sum(M,N,...)>=1 then...
but apparently they are text variables rather than actual binary (number) variables.
Incidentally I'm sure you could also use a %do loop and an asc() function within a %sysfunc() passing the appropriate ascii numbers to iterate through the alphabet variables and then pass the results as macro variables into the datastep to automate building the if. It requires a bit more effort but if you're repeatedly running this, it's a more elegant solution.
Hope this helps,
Phil
Check for '1' in the result of concatenating an array that references the variables.
array _1_vars M N O P Q R S T U;
array _2_vars AA AB AC AD AE AF AG AH AI AJ;
array _3_var3 AK AL;
select;
when (index(cats(of _1_vars(*),'1') then fonction = '1';
when (index(cats(of _2_vars(*),'1') then fonction = '2';
when (index(cats(of _3_vars(*),'1') then fonction = '3';
otherwise ;
end;
This will work for both character and numeric variables. The variables of an array must all be of the same type.
To check if a value appears in a list of variables you can use the WHICH() or WHICHC() function. It will return the number of the variable where it is first found. If not found it will return zero. SAS treats 0 as false and any other number as true.
data DEV1.SITESWEB2;
set DEV1.SITESWEB ;
keep INDUSTRIE M N O P Q R S T U AA AB AC AD AE AF AG AH AI AJ AK AL ;
if whichc('1', of M N O P Q R S T U) then fonction = 1 ;
else if whichc('1', of AA AB AC AD AE AF AG AH AI AJ) then fonction = 2;
else if whichc('1', of AK AL) then fonction = 3;
keep fonction ;
run;
Make sure to KEEP your new variable.

Use a regular expression extract substring from data frame columns in R

I am fairly new to R so please go easy on me if this is a stupid question.
I have a dataframe called foo:
< head(foo)
Old.Clone.Name New.Clone.Name File
1 A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
2 B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
3 C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
4 D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
5 E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
6 F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
I want to extract codes from the File column that match the regular expression (S[A-Z]{3}[0-9]{1,2}-[0-9]_02), to give me:
SAEE7-1_02
SADQ15-1_02
SAEC16-1_02
SAEJ6-1_02
SAED9-1_02
SAGP3-1_02
I then want to use these codes to search another directory for other files that contain the same code.
I fail, however, at the first hurdle and cannot extract the codes from that column of the data frame.
I have tried:
library('stringr')
str_extract(foo[3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = TRUE))
but this just returns [1] NA.
Am I simply missing something obvious? I look forward to cracking this with a bit of help from the community.
Hello if you are reading the data as a table file then foo[3] is a list and str_extract does not accept lists, only strings, then you should use lapply to extract the match of every element.
lapply(foo[3], function(x) str_extract(x, "[sS][a-zA-Z]{3}[0-9]{1,2}-[0-9]_02"))
Result:
[1] "SAEE7-1_02" "SADQ15-1_02" "SAEC16-1_02" "SAEJ6-1_02" "SAED9-1_02"
[6] "SAGP3-1_02"
str_extract(foo[3],"(?i)S[A-Z]{3}[0-9]{1,2}-[0-9]_02")
seems to work. Somehow, my R gave me
"Error in check_pattern(pattern, string) : could not find function "regex""
when using your original expression.
The following code will repeat what you asked (just copy and paste to your R console):
library(stringr)
foo = scan(what='')
Old.Clone.Name New.Clone.Name File
A Aa A_mask_MF_final_IS2_SAEE7-1_02.nrrd
B Bb B_mask_MF_final_IS2ViaIS2h_SADQ15-1_02.nrrd
C Cc C_mask_MF_final_IS2ViaIS2h_SAEC16-1_02.nrrd
D Dd D_mask_MF_final_IS2ViaIS2h_SAEJ6-1_02.nrrd
E Ee F_mask_MF_final_IS2_SAED9-1_02.nrrd
F Ff F_mask_MF_final_IS2ViaIS2h_SAGP3-1_02.nrrd
foo = matrix(foo,ncol=3,byrow=T)
colnames(foo)=foo[1,]
foo = foo[-1,]
foo
str_extract(foo[,3],regex("(S[A-Z]{3}[0-9]{1,2}-[0-9]_02)", ignore_case = T))
The reason you get NULL is hidden: R stores entries by column, hence foo[3] is the 3rd row and 1st column of foo matrix/data frame. To quote the third column, you may need to use foo[,3]. or foo<-data.frame(foo); foo[[3]].

match elements from two files, how to write the intended format to a new file

I am trying to update my text file by matching the first column to another updated file's first column, after match it, it will update the old file.
Here is my oldfile:
Name Chr Pos ind1 in2 in3 ind4
foot 1 5 aa bb cc
ford 3 9 bb cc 00
fake 3 13 dd ee ff
fool 1 5 ee ff gg
fork 1 3 ff gg ee
Here is the newfile:
Name Chr Pos
foot 1 5
fool 2 5
fork 2 6
ford 3 9
fake 3 13
The updated file will be like:
Name Chr Pos ind1 in2 in3 ind4
foot 1 5 aa bb cc
fool 2 5 ee ff gg
fork 2 6 ff gg ee
ford 3 9 bb cc 00
fake 3 13 dd ee ff
Here is my code:
#!/usr/bin/env python
import sys
inputfile_1 = sys.argv[1]
inputfile_2 = sys.argv[2]
outputfile = sys.argv[3]
inputfile1 = open(inputfile_1, 'r')
inputfile2 = open(inputfile_2, 'r')
outputfile = open(outputfile, 'w')
ind = inputfile1.readlines()
cm = inputfile2.readlines()[1:]
outputfile.write(ind[0]) #add header
for i in ind:
i = i.split()
for j in cm:
j = j.split()
if j[0] == i[0]:
outputfile.writelines(j[0:3] + i[3:])
outputfile.write('\n')
inputfile1.close()
inputfile2.close()
outputfile.close()
When I ran it, ./compare_substitute_2files.py oldfile newfile output
the values were updated for the file, but they did not follow the order of the new file, and no space was there as indicated in the output below.
Name Chr Pos ind1 in2 in3 ind4
foot15aabbcc
ford39bbcc00
fake313ddeeff
fool25eeffgg
fork26ffggee
My question is how to match to the exact order and give spaces to each element in the list when write them out? Thanks!
file.write accepts string as its parameter.
If you want write sequences of strings instead of string, use file.writelines method instead:
outputfile.writelines(j[0:2] + i[3:])

Regular expression puzzle

This is not homework, but an old exam question. I am curious to see the answer.
We are given an alphabet S={0,1,2,3,4,5,6,7,8,9,+}. Define the language L as the set of strings w from this alphabet such that w is in L if:
a) w is a number such as 42 or w is the (finite) sum of numbers such as 34 + 16 or 34 + 2 + 10
and
b) The number represented by w is divisible by 3.
Write a regular expression (and a DFA) for L.
This should work:
^(?:0|(?:(?:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\
+)*[369]0*)*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:
\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[
258](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0
\+)*[147])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+)(?:\+(?:0|(?:(?
:[369]|[147](?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)
*\+?(?:0\+)*[258])*(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]|0*(?:\+?(?:0\+)*
[369]0*)*\+?(?:0\+)*[147]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])|[258](?:0*(?
:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147])*
(?:0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[147]|0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)
*[258]0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*[258]))0*)+))*$
It works by having three states representing the sum of the digits so far modulo 3. It disallows leading zeros on numbers, and plus signs at the start and end of the string, as well as two consecutive plus signs.
Generation of regular expression and test bed:
a = r'0*(?:\+?(?:0\+)*[369]0*)*\+?(?:0\+)*'
b = r'a[147]'
c = r'a[258]'
r1 = '[369]|[147](?:bc)*(?:c|bb)|[258](?:cb)*(?:b|cc)'
r2 = '(?:0|(?:(?:' + r1 + ')0*)+)'
r3 = '^' + r2 + r'(?:\+' + r2 + ')*$'
r = r3.replace('b', b).replace('c', c).replace('a', a)
print r
# Test on 10000 examples.
import random, re
random.seed(1)
r = re.compile(r)
for _ in range(10000):
x = ''.join(random.choice('0123456789+') for j in range(random.randint(1,50)))
if re.search(r'(?:\+|^)(?:\+|0[0-9])|\+$', x):
valid = False
else:
valid = eval(x) % 3 == 0
result = re.match(r, x) is not None
if result != valid:
print 'Failed for ' + x
Note that my memory of DFA syntax is woefully out of date, so my answer is undoubtedly a little broken. Hopefully this gives you a general idea. I've chosen to ignore + completely. As AmirW states, abc+def and abcdef are the same for divisibility purposes.
Accept state is C.
A=1,4,7,BB,AC,CA
B=2,5,8,AA,BC,CB
C=0,3,6,9,AB,BA,CC
Notice that the above language uses all 9 possible ABC pairings. It will always end at either A,B,or C, and the fact that every variable use is paired means that each iteration of processing will shorten the string of variables.
Example:
1490 = AACC = BCC = BC = B (Fail)
1491 = AACA = BCA = BA = C (Success)
Not a full solution, just an idea:
(B) alone: The "plus" signs don't matter here. abc + def is the same as abcdef for the sake of divisibility by 3. For the latter case, there is a regexp here: http://blog.vkistudios.com/index.cfm/2008/12/30/Regular-Expression-to-determine-if-a-base-10-number-is-divisible-by-3
to combine this with requirement (A), we can take the solution of (B) and modify it:
First read character must be in 0..9 (not a plus)
Input must not end with a plus, so: Duplicate each state (will use S for the original state and S' for the duplicate to distinguish between them). If we're in state S and we read a plus we'll move to S'.
When reading a number we'll go to the new state as if we were in S. S' states cannot accept (another) plus.
Also, S' is not "accept state" even if S is. (because input must not end with a plus).