Finding position of a sequence of words (strings) in a sentence - regex

I have a sentence with two markers <e1> and </e1>. I need the index of the position of the sequence of the words between these two markers. Note that the , and other possible characters are counted!
sent="Hi please help me to <e1>solve, this problem please</e1> Thank you."
What I need (the desired output):
[5, 6, 7, 8, 9]
If you count each word from the beginning of the sentence, I need the index of the sequence between two markers:
solve -> 5
, -> 6
this -> 7
problem -> 8
please -> 9
I tried these two solutions:
Solution 1:
sent="Hi please help me to <e1>solve, this problem please</e1> Thank you."
E1 = re.search('<e1>(.*)</e1>', sent).group(1)
sent = sent.replace('<e1>', '')
sent = sent.replace('</e1>', '')
sent = word_tokenize(sent)
E1_indx = []
E1_lis = word_tokenize(E1)
print(E1_lis)
for item in E1_lis:
E1_indx.append(sent.index(item))
print(E1_indx)
But the output is:
[5, 6, 7, 8, 1]
Solution 2:
sent="Hi please help me to <e1>solve, this problem please</e1> Thank you."
e1_st = re.findall(r'<e1>\w+', sent)
e1_end = re.findall(r'\w+</e1>', sent)
e1_st=(''.join(str(x) for x in e1_st))
e1_end=(''.join(str(x) for x in e1_end))
e1_st = e1_st.replace('<e1>', '')
e1_end = e1_end.replace('</e1>', '')
sent = sent.replace('<e1>', '')
sent = sent.replace('</e1>', '')
sent = word_tokenize(sent)
print(list(range(sent.index(e1_st), sent.index(e1_end)+1)))
Output:
[]
The problem arises when there is a repetitive word of sequence before it (here "please").
Is there any straightforward solution?

It looks like this question.
If you compute the offsets as following and remove the markers you should have the expected results.
sub_b = sent.find('<e1>')
sent = sent.replace('<e1>')
sub_e = sent.find('</e1>')
sent = sent.replace('</e1>')

Related

In Power Query, how can I remove duplicates either side of a delimiter?

I wish to turn : into :
For example amazon:amazon becomes amazon:
This is doable by hand using the replace values function but I need a way to do it programatically.
Thanks!
You can try this Transform but if it doesn't work, provide detail as to the
nature of the failure
examples of data on which it doesn't work
any error messages and the line which returns the error
remDups = Table.TransformColumns(#"Changed Type",{"Column1", each
let
sep = ":",
splitList = Text.Split(_, " "),
sepString = List.FindText(splitList,sep){0},
sepStringPosition = List.PositionOf(splitList,sepString),
//rem if the same remove last
splitSep = Text.Split(sepString, sep),
replString = if splitSep{0} = splitSep{1} then splitSep{0} & sep else sepString,
//put the string backtogether
replList = List.ReplaceRange(splitList,sepStringPosition,1,{replString})
in
Text.Combine(replList," ")
})

Find starting and ending index of each unique charcters in a string in python

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
mo = re.search(item,x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
Output :
a 0 1
b 3 4
c 7 8
Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output
x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)
output:
b 2 5
The above function is providing correct result but here I can't pass the characters to be matched dynamically.
It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance
String literal formatting to the rescue:
import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
# for patterns better use raw strings - and format the letter into it
mo = re.search(fr"{item}+",x) # fr and rf work both :) its a raw formatted literal
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n) # fix upper limit by n-1
Output:
a 0 3 # you do see that the upper limit is off by 1?
b 3 7 # see above for fix
c 7 9
Your pattern does not need the [] around the letter - you are matching just one anyhow.
Without regex1:
x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
if last_ch == ch:
continue
else:
print(last_ch,start_idx, idx-1)
last_ch = ch
start_idx = idx
print(ch,start_idx,idx)
output:
a 0 2 # not off by 1
b 3 6
c 7 8
1RegEx: And now you have 2 problems...
Looking at the output, I'm guessing that another option would be,
import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
print(output)
Output
a 0 3
b 3 7
c 7 9
I think it'll be in the Order of N, you can likely benchmark it though, if you like.
import re, time
timer_on = time.time()
for i in range(10000000):
x = "aabbbbccc"
xs = re.findall(r"((.)\2*)", x)
start = 0
output = ''
for item in xs:
end = start + len(item[0])
output += (f"{item[1]} {start} {end}\n")
start = end
timer_off = time.time()
timer_total = timer_off - timer_on
print(timer_total)

How do you print a list of values in one column?

I have calculated a list of values from an equation and I want to print all 140 of the output values in a single column, so that I can convert it to a txt document with one column of data. When I say print(values), it prints the output in multiple columns.
For example:
N = [1,2,3,4,5]
print(N)
This is the result: [1, 2, 3, 4, 5]
I want these values in a single column.
l=[1,2,3,4] #let this be the list
for i in range(len(l)):
print(l[i],"/n")
N = [1,2,3,4,5] #your list here
for i in range(0,len(N)):
print(N[i],"\n")
Just rmb to use backslash in "\n" to go to next line instead of "/n".

Python: referring to each duplicate item in a list by unique index

I am trying to extract particular lines from txt output file. The lines I am interested in are few lines above and few below the key_string that I am using to search through the results. The key string is the same for each results.
fi = open('Inputfile.txt')
fo = open('Outputfile.txt', 'a')
lines = fi.readlines()
filtered_list=[]
for item in lines:
if item.startswith("key string"):
filtered_list.append(lines[lines.index(item)-2])
filtered_list.append(lines[lines.index(item)+6])
filtered_list.append(lines[lines.index(item)+10])
filtered_list.append(lines[lines.index(item)+11])
fo.writelines(filtered_list)
fi.close()
fo.close()
The output file contains the right lines for the first record, but multiplied for every record available. How can I update the indexing so it can read every individual record? I've tried to find the solution but as a novice programmer I was struggling to use enumerate() function or collections package.
First of all, it would probably help if you said what exactly goes wrong with your code (a stack trace, it doesn't work at all, etc). Anyway, here's some thoughts. You can try to divide your problem into subproblems to make it easier to work with. In this case, let's separate finding the relevant lines from collecting them.
First, let's find the indexes of all the relevant lines.
key = "key string"
relevant = []
for i, item in enumerate(lines):
if item.startswith(key):
relevant.append(item)
enumerate is actually quite simple. It takes a list, and returns a sequence of (index, item) pairs. So, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')].
What I had written above can be achieved with a list comprehension:
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
So, we have the indexes of the relevant lines. Now, let's collected them. You are interested in the line 2 lines before it and 6 and 10 and 11 lines after it. If your first lines contains the key, then you have a problem – you don't really want lines[-1] – that's the last item! Also, you need to handle the situation in which your offset would take you past the end of the list: otherwise Python will raise an IndexError.
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
You could also catch the IndexError, but that won't save us much typing, as we have to handle negative indexes anyway.
The whole program would look like this:
key = "key string"
with open('Inputfile.txt') as fi:
lines = fi.readlines()
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
with open('Outputfile.txt', 'a') as fi:
fi.writelines(out)
To get rid of duplicates you can cast list to set; example:
x=['a','b','a']
y=set(x)
print(y)
will result in:
['a','b']

Reading In Integers in Python

So, my question is simple. I'm simply struggling with syntax here. I need to read in a set of integers, 3, 11, 2, 4, 4, 5, 6, 10, 8, -12. What I want to do with those integers is place them in a list as I'm reading them. n = n x n array in which these will be presented. so if n = 3, then i will be passed something like this 3 \n 11 2 4 \n 4 5 6 \n 10 8 -12 ( \n symbolizing a new line in input file)
n = int(raw_input().strip())
a = []
for a_i in xrange(n):
value = int(raw_input().strip())
a.append(value)
print(a)
I receive this error from the above code code:
value = int(raw_input().strip())
ValueError: invalid literal for int() with base 10: '11 2 4'
The actual challenge can be found here, https://www.hackerrank.com/challenges/diagonal-difference .
I have already completed this in Java and C++, simply trying to do in Python now but I suck at python. If someone wants to, they don't have too, seeing the proper way to read in an entire line, say " 11 2 4 ", creating a new list out that line, and adding it to an already existing list. So then all I have to do is search said index of list[ desiredInternalList[ ] ].
You can split the string at white space and convert the entries into integers.
This gives you one list:
for a_i in xrange(n):
a.extend([int(x) for x in raw_input().split()])
and this a list of lists:
for a_i in xrange(n):
a.append([int(x) for x in raw_input().split()]):
You get this error because you try to give all inputs in one line. To handle this issue you may use this code
n = int(raw_input().strip())
a = []
while len(a)< n*n:
x=raw_input().strip()
x = map(int,x.split())
a.extend(x)
print(a)