Attempting to splice a recurring item out of a list - list

I have extracted files from an online database that consist of a roughly 100 titles. Associated with each of these titles is a DOI number, however, the DOI number is different for each title. To program this endeavor, I converted the contents of the website to a list. I then created for loop to iterate through each item of the list. What I want the program to do is iterate through the entire list and find where it says "DOI:" then to take the number which follows this. However, with the for loop I created, all it seems to do is print out the first DOI number, then terminates. How to I make the loop keep going once I have found the first one.
Here is the code:
resulttext = resulttext.split()
print(resulttext)
for item in resulttext:
if item == "DOI:":
DOI=resulttext[resulttext.index("DOI:")+1] #This parses out the DOI, then takes the item which follows it
print(DOI)

Related

find index based on first element in a nested list

I have a list that contains sublists. The sequence of the sublist is fixed, as are the number of elements.
schedule = [['date1', 'action1', beginvalue1, endvalue1],
['date2', 'action2', beginvalue2, endvalue2],
...
]
Say, I have a date and I want find what I have to do on that date, meaning I require to find the contents of the entire sublist, given only the date.
I did the following (which works): I created a intermediate list, with all the first values of the sublists. Based on the index i was able to retrieve its entire contents, as follows:
dt = 'date150' # To just have a value to make underlying code more clear
ls_intermediate = [item[0] for item in schedule]
index = ls_intermediate.index(dt)
print(schedule[index])
It works but it just does not seem the Python way to do this. How can I improve this piece of code?
To be complete: there are no double 'date' entries in the list. Every date is unique and appears only once.
Learning Python, and having quite a journey in front of me...
thank you!

How to create new column that parses correct values from a row to a list

I am struggling on creating a formula with Power Bi that would split a single rows value into a list of values that i want.
So I have a column that is called ID and it has values such as:
"ID001122, ID223344" or "IRRELEVANT TEXT ID112233, MORE IRRELEVANT;ID223344 TEXT"
What is important is to save the ID and 6 numbers after it. The first example would turn into a list like this: {"ID001122","ID223344"}. The second example would look exactly the same but it would just parse all the irrelevant text from between.
I was looking for some type of an loop formula where you could use the text find function to find ID starting point and use middle function to extract 8 characters from the start but I had no progress in finding such. I tried making lists from comma separator but I noticed that not all rows had commas to separate IDs.
The end results would be that the original value is on one column next to the list of parsed values which then could be expanded to new rows.
ID Parsed ID
"Random ID123456, Text;ID23456" List {"ID123456","ID23456"}
Any of you have former experience?
Hey I found the answer by myself using a good article similar to my problem.
Here is my solution without any further text parsing which i can do later on.
each let
PosList = Text.PositionOf([ID],"ID",Occurrence.All),
List = List.Transform(PosList, (x) => Text.Middle([ID],x,8))
in List
For example this would result "(ID343137,ID352973) ID358388" into {ID343137,ID352973,ID358388}
Ended up being easier than I thought. Suppose the solution relied again on the lists!

Using regex extract a particular text from a paragraph

I have used the below to extract a string from a paragraph.
data = '''actions/steps to (re-) produce the problem:
1) Media--> Music collectio--> on right side--> click on Add Favourite icon--> on clicking Add from Favourite icon--> (Delete from favourite ) will display--> again click on Delete the favourite
expected result/behaviour:
it should display the track as well
observed result/behavior:
1st track list will display then
2nd list of songs will display
3rd no records will display
this behaviour will appear again and again
possible impact:
this can be an issue while driving
actions/steps to recover from error:
software version tested (including supplied software or CAF version if relevant):
MGU :- 17w.25.4-2'''
observed=[]
for i in data["Error Description"]:
if len(re.findall(r'(Observed result\/behavior:|observed result\/behavior:)([^(]*)Possible impact:', i))==1:
observed.append((re.findall(r'(Observed result\/behavior:|observed result\/behavior:)([^(]*)Possible impact:', i))[0][1])
else:
observed.append(" ".join((re.findall(r'(Observed result\/behavior:|observed result\/behavior:)([^(]*)Possible impact:', i))))
OUTPUT :
It shows nothing as the "observed:" has 4 lines. If it generally has one line and the immediate preceding is "possible impact:" then it displays the output.
I need my output though if the observed has n no of lines
Please help.
This should work on the assumption that observed result/behavior: will have one blank line before the next paragraph:
begin = data.index('observed result/behavior:')
end = data[begin:].index('\n\n')
output = data[begin:(begin+end)]
print(output)
observed result/behavior:
1st track list will display then
2nd list of songs will display
3rd no records will display
this behaviour will appear again and again

Why does random.sample() add square brackets and single quotes to the item sampled?

I'm trying to sample an item (which is one of the keys in a dictionary) from a list and later use the index of that item to find its corresponding value (in the same dictionary).
questions= list(capitals.keys())
answers= list(capitals.values())
for q in range(10):
queswrite = random.sample(questions,1)
number = questions.index(queswrite)
crtans = answers[number]
Here,capitals is the original dectionary from which the states(keys) and capitals(values) are being sampled.
But,apparently random.sample() method adds square brackets and single quotes to the sampled item and thus prevents it from being used to reference the list containing the corresponding values.
Traceback (most recent call last):
File "F:\test.py", line 30, in
number = questions.index(queswrite)
ValueError: ['Delaware'] is not in list
How can I prevent this?
random.sample() returns a list, containing the number of elements you requested. See the documentation:
Return a k length list of unique elements chosen from the population sequence or set. Used for random sampling without replacement.
If you wanted to pick just one element, you don't want a sample however, you wanted to choose just one. For that you'd use the random.choice() function instead:
question = random.choice(questions)
However, given that you are using a loop, you probably really wanted to get 10 unique questions. Don't use a loop over range(10), instead pick a sample of 10 random questions. That's exactly what random.sample() would do for you:
for question in random.sample(questions, 10):
# pick the answer for this question.
Next, putting both keys and values into two separate lists, then using the index of one to find the other is... inefficient and unnecessary; the keys you pick can be used directly to find the answers:
questions = list(capitals)
for question in random.sample(questions, 10):
crtans = capitals[question]

Why does ConditionalFreqDist not work in NLTK?

cfd = nltk.ConditionalFreqDist(
(target,fileid[:4])
for target in ['america']
for fileid in inaugural.fileids()
It works fine, but I don't know why the samples got 1 in every file?
This has nothing to do with the FreqDist. It's about what you feed it – you need to know how the generator expressions work.
In your case it's a two-way nested generator. Look at it like a for loop:
for target in ['america']:
for fileid in inaugural.fileids():
# do something with target and fileid
In this case, the "do something" part is simply adding a pair of strings to the FreqDist. The string pairs look like this:
('america', <prefix_of_file_1>)
('america', <prefix_of_file_2>)
('america', <prefix_of_file_3>)
...
The first element is always the same, because you have just one item in the target list. The second element is made of the first 4 characters of the file ID. You get exactly one entry per file, regardless of whether 'america' is in the file or not, because you don't look at the content of the file, you just iterate over file IDs.
The way to do it is like the first example in your original post, before you deleted it:
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for target in ['america']
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
if w.lower().startswith(target))
Let's have a look at this three-way nested generator expression, written as for loops:
for target in ['america']:
for fileid in inaugural.fileids():
for w in inaugural.words(fileid):
if w.lower().startswith(target)):
# add target and fileid[:4] to the FreqDist
So here you iterate over all words (inner-most loop) in every file (middle loop), and you do for every target (first loop; here there is just one so there's not much looping). And then you skip all words that do not start with "america".
For example, let's say file 1 has two occurrences of "America" (or "American"), the second file has no mention of the target, and the third file has 3 occurrences. Then the pairs added to the FreqDist will look like this:
('america', <prefix_of_file_1>)
('america', <prefix_of_file_1>)
('america', <prefix_of_file_3>)
('america', <prefix_of_file_3>)
('america', <prefix_of_file_3>)
...
So for every occurrence of the target, you give the FreqDist an entry to count. Files without an occurrence are not counted, and multiple occurrences are counted multiple times.