Separating columns from a file by last column - regex

I'm trying to do this: if the last column is negative number from 1-5 then write second and last column to a file "neg.txt". If a last column is positive number, second and last column need to be written to "pos.txt". My both output files end up empty after execution. I don't know what's wrong with the code, when I think if statement can handle multiple conditions. I also tried with regular expressions but it did't work so I made it as simple as possible to see what is not working.
The input file looks like this:
abandon odustati -2
abandons napusta -2
abandoned napusten -2
absentee odsutne -1
absentees odsutni -1
aboard na brodu 1
abducted otet -2
accepted prihvaceno 1
My code is:
from urllib.request import urlopen
import re
pos=open('lek_pos.txt','w')
neg=open('lek_neg.txt','w')
allCondsAreOK1 = ( parts[2]=='1' and parts[2]=='2' and
parts[2]=='3' and parts[2]=='4' and parts[2]=='5' )
allCondsAreOK2 = ( parts[2]=='-1' and parts[2]=='-2' and
parts[2]=='-3' and parts[2]=='-4' and parts[2]=='-5' )
with open('leksicki_resursi.txt') as pos:
for line in pos:
parts=line.split() # split line into parts
if len(parts) > 1: # if at least 2 columns (parts)
if allCondsAreOK:
pos.write(parts[1]+parts[2])
elif allCondsAreOK2:
neg.write(parts[1]+parts[2])
else:
print("nothing matches")

You don't need a regex, you just need an if/elif checking if after casting to int the last value falls between -5 and -1, if it does you write to the neg file or if the value is any non negative number you write to the pos file:
with open('leksicki_resursi.txt') as f, open('lek_pos.txt','w')as pos, open('lek_neg.txt','w') as neg:
for row in map(str.split, f):
a, b = row[1], int(row[-1])
if b >= 0:
pos.write("{},{}\n".format(a, b))
elif -5 <= b <= -1:
neg.write("{},{}\n".format(a, b))
If the positive nums must also be between 1-5 then you can do something similar to the negative condition:
if 5 >= int(b) >= 0:
pos.write("{},{}\n".format(a, b))
elif -5 <= int(b) <= -1:
neg.write("{},{}\n".format(a, b))
Also if you have empty lines you can filter them out:
for row in filter(None,map(str.split, f)):

Related

Efficient way for the following code

I saw this problem on hackerrank.com, the problem is to find a 4 letter palindrome from a given string which can be a long string also.
Constraint is as follows:
where, |s| is the length of the string and a,b,c,d are the positions of the corresponding letters in the palindrome.
I found out the solution for this, but it isn't efficient enough, as in during the processing time it gives 'time out' error. The code is as follows:
s='kkkkkkz'
n=0
c_i,c_j,c_k,c_l=0,0,0,0
for i in range(len(s)):
j=0;c_i+=1
while j>=0 and j<len(s):
c_j+=1
if j>i:
k=0
while k>=0 and k<len(s):
c_k+=1
if k>j:
l=0
while l>=0 and l<len(s):
c_l+=1
if l>k:
a=s[i]+s[j]+s[k]+s[l]
if a[0]==a[3] and a[1]==a[2]: n+=1
l+=1
k+=1
j+=1
print n
I thought of noticing the number of times each loop runs, which right now is 7,49,147 and 245.
It is still better than the techniques I followed before, but I am not able to to do better than this.
Suggestions please ?
One way is to use the following, but this will still not be efficient enough. Scores 12/40 ..
import itertools
s=WHATEVERSTRING
n=0
for a in itertools.combinations(s, 4):
n += (a[0] == a[3])*(a[1]==a[2])
print(n)
A working solution is to go down the following route: create a set of unique characters in the string, and map substring pairs to a dictionary. Then count all the occurrences of pairwise pairs.
from collections import defaultdict as di
data = [x for x in s.strip()]
chars = set(data)
sum_a = 0
for c in chars:
a = 0
b = di(int)
double_pairs = 0
for d in data:
if d == c:
sum_a += double_pairs
double_pairs += b[c]
b[c]+=a
a += 1
else:
double_pairs += b[d]
b[d] += a
print(sum_a%(10**9+7))

Grabbing columns with special characters and upper case letters

I have a data frame and I'm trying to loop through the data frame to identify those columns which contain a special character or which are all capital letters.
I have tried a few things but nothing where I'm apple to catch the column names within the loop.
data = data.frame(one=c(1,3,5,1,3,5,1,3,5,1,3,5), two=c(1,3,5,1,3,5,1,3,5,1,3,5),
thr=c("A","B","D","E","F","G","H","I","J","H","I","J"),
fou=c("A","B","D","A","B","D","A","B","D","A","B","D"),
fiv=c(1,3,5,1,3,5,1,3,5,1,3,5),
six=c("A","B","D","E","F","G","H","I","J","H","I","J"),
sev=c("A","B","D","A","B","D","A","B","D","A","B","D"),
eig=c("A","B","D","A","B","D","A","B","D","A","B","D"),
nin=c(1.24,3.52,5.33,1.44,3.11,5.33,1.55,3.66,5.33,1.32,3.54,5.77),
ten=c(1:12),
ele=rep(1,12),
twe=c(1,2,1,2,1,2,1,2,1,2,1,2),
thir=c("THiS","THAT34","T(&*(", "!!!","#$#","$Q%J","who","THIS","this","this","this","this"),
stringsAsFactors = FALSE)
data
colls <- c()
spec=c("$","%","&")
for( col in names(data) ) {
if( length(strings[stringr::str_detect(data[,col], spec)]) >= 1 ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
for( col in names(data) ) {
if( any(data[,col]) %in% spec ){
print("HORRAY")
colls <- c(collls, col)
}
else print ("NOOOOOOOOOO")
}
Can anyone shed light on a good way to tackle this problem.
EDIT:
The end goal is to have a vector with a name of column names which meet that criteria. Sorry for my poor SO question, but hopefully this will help with what I'm trying to do
I would use grep() to search for the pattern you are interested in. See here.
[:upper:] Matches any upper case letters.
Combining it with anchors (^,$) and match one or more times (+) gives ^[[:upper:]]+$ and should only match entries completely in capitals.
The following would match the special characters in your toy data set (but is not guaranteed to match all special characters in your real data set i.e form feeds, carriage returns)
[:punct:] #Matches punctuation - ! " # $ % & ' ( ) * + , - . / : ; < = > ? # [ \ ] ^ _ ` { | } ~.
Note that rather than use [:punct:] you could define your special characters manually.
We can try the resultant code on the first row of your data set:
#Using grepl() rather than grep() so that we return a list of logical values.
grepl(x= data[1,], pattern = "^[[:upper:]]+$|[[:punct:]]")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
This gives us our expected response except for column nine which has the value 1.24. Here the decimal point is being recognised as punctuation and is being flagged as a match.
We can add a "negative lookahead assertion" - (?!\\.) - to remove any periods from consideration, before they are even tested for being punctuation characters. Note we use \ to escape the period.
grepl(x= data[1,], perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
This returns a better response - it now no longer matches decimal places. NOTE: This might not be what you want as this pattern also won't match any fullstops in character fields. You would need to refine the pattern further.
Rather than use a 'for loop' to reiterate this code across every row in your dataframe I would use vectorization instead which is 'more R like'.
To do this we must convert our script into a function which we will call with apply()
myFunction <- function(x){
matches <- grepl(x= x, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
#Given a set of logical vectors 'matches', is at least one of the values true? using any()
return(any(matches))
}
apply(X = data, 1, myFunction)
The 1 above instructs apply() to reiterate across rows rather than columns.
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
In your example data set all rows have an entry containing a special character or a string of all capital letters. This is unsurprising as many columns in your example data set are a list of single capital letters.
If you are just interested in which values in column thirteen fit the stated criteria you can use:
matches <- grepl(x= data$thir, perl = TRUE, pattern = "(?!\\.)(^[[:upper:]]+$|[[:punct:]])")
matches
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
To subset your dataframe on matching rows:
data[matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
3 5 5 D D 5 D D D 5.33 3 1 1 T(&*(
4 1 1 E A 1 E A A 1.44 4 1 2 !!!
5 3 3 F B 3 F B B 3.11 5 1 1 #$#
6 5 5 G D 5 G D D 5.33 6 1 2 $Q%J
8 3 3 I B 3 I B B 3.66 8 1 2 THIS
To subset your dataframe on non-matching rows:
data[!matches,]
one two thr fou fiv six sev eig nin ten ele twe thir
1 1 1 A A 1 A A A 1.24 1 1 1 THiS
2 3 3 B B 3 B B B 3.52 2 1 2 THAT34
7 1 1 H A 1 H A A 1.55 7 1 1 who
9 5 5 J D 5 J D D 5.33 9 1 1 this
10 1 1 H A 1 H A A 1.32 10 1 2 this
11 3 3 I B 3 I B B 3.54 11 1 1 this
12 5 5 J D 5 J D D 5.77 12 1 2 this
Note that the regular expression used doesn't match THAT34 as it isn't composed wholly of capitalised letters, having the number 34 at the end.
EDIT:
To get a list of column names identifying columns that fulfill the criteria in your edit use myFunction described above with:
colnames(data)[apply(X = data, 2, myFunction)]
"thr" "fou" "six" "sev" "eig" "thir"
The number in apply() changes from 1 to 2 to reiterate across columns rather than rows. We pass the output from apply(), a list of logical matches (TRUE or FALSE), to colnames(data) - this returns the matching column names via subsetting.
I would collapse the data into strings (one string per row)
strings = apply(data, 1, paste, collapse = "")
contains_only_caps = strings == toupper(strings)
strings[contains_only_caps]
# [1] "33BB3BBB3.52 212THAT34" "55DD5DDD5.33 311T(&*(" "11EA1EAA1.44 412!!!" "33FB3FBB3.11 511#$#"
# [5] "55GD5GDD5.33 612$Q%J" "33IB3IBB3.66 812THIS"
# escaping special characters
spec=c("\\$","%","\\&")
contains_spec = stringr::str_detect(strings, pattern = paste(spec, collapse = "|"))
strings[contains_spec]
# [1] "55DD5DDD5.33 311T(&*(" "33FB3FBB3.11 511#$#" "55GD5GDD5.33 612$Q%J"
You could also use which on contains_spec or contains_only_caps to get the corresponding row numbers for the original data frame. I think that using strings rather than row-wise data frame elements will by much faster - as long as you want to search the whole strings, not certain columns for certain conditions.

I before E program not working

Just wrote a python program to determine how useful the mnemonic "I before E except after C" is.
With the input:
'I before e except when conducting an efficient heist on eight foreign neighbors. I believe my friend has left the receipt for the diet books hidden in the ceiling'
It would display:
Number of times the rule helped: 5
Number of times the rule was broken: 5
Changed a few things and thought I changed them back but the code is now broken, any advice will be helpful
while True:
line = input("Line: ")
count = 0
h = 0
nh = 0
words = line.split()
for x in range(0, len(words)):
word = words[count]
if "ie" in word:
if "cie" in word:
nh += 1
else:
h +=1
if "ei" in word:
if "cei" in word:
h += 1
else:
nh += 1
else:
h += 0
count += 1
print("Number of times the rule helped:",h)
print("Number of times the rule was broken:",nh)
print()
Good god I'm an idiot. I've probably spent a total of 3 hours or so trying to fix this thing.
for x in range(0, len(words)):
word = words[count]
if "ie" in word:
if "cie" in word:
nh += 1
else:
h +=1
if "ei" in word:
if "cei" in word:
h += 1
else:
nh += 1
count += 1
Can anyone spot the difference between this and the corresponding part of the old code? That 'count+=1' at the end is just indented an additional time. All those hours wasted... sorry if I wasted anyone else's time here :|
You'd be well-served to better articulate your test cases. I don't exactly understand what you're trying to do.
It seems like all you need is to see how many times 'cie' or 'cei' occurred in the text.
In which case:
for i in range(0, len(line)):
print("scanning {0}".format(line[i:i+3]))
if line[i:i+3].lower() == "cie":
nh += 1
if line[i:i+3].lower() == "cei":
h += 1

Loop problems Even Count

I have a beginner question. Loops are extremely hard for me to understand, so it's come to me asking for help.
I am trying to create a function to count the amount of even numbers in a user input list, with a negative at the end to show the end of the list. I know I need to use a while loop, but I am having trouble figuring out how to walk through the indexes of the input list. This is what I have so far, can anyone give me a hand?
def find_even_count(numlist):
count = 0
numlist.split()
while numlist > 0:
if numlist % 2 == 0:
count += 1
return count
numlist = raw_input("Please enter a list of numbers, with a negative at the end: ")
print find_even_count(numlist)
I used the split to separate out the indexes of the list, but I know I am doing something wrong. Can anyone point out what I am doing wrong, or point me to a good step by step explanation of what to do here?
Thank you guys so much, I know you probably have something more on your skill level to do, but appreciate the help!
You were pretty close, just a couple of corrections:
def find_even_count(numlist):
count = 0
lst = numlist.split()
for num in lst:
if int(num) % 2 == 0:
count += 1
return count
numlist = raw_input("Please enter a list of numbers, with a negative at the end: ")
print find_even_count(numlist)
I have used a for loop rather than a while loop, stored the outcome of numlist.split() to a variable (lst) and then just iterated over this.
You have a couple of problems:
You split numlist, but don't assign the resulting list to anything.
You then try to operate on numlist, which is still the string of all numbers.
You never try to convert anything to a number.
Instead, try:
def find_even_count(numlist):
count = 0
for numstr in numlist.split(): # iterate over the list
num = int(numstr) # convert each item to an integer
if num < 0:
break # stop when we hit a negative
elif num % 2 == 0:
count += 1 # increment count for even numbers
return count # return the total
Or, doing the whole thing in one line:
def find_even_count(numlist):
return sum(num % 2 for num in map(int, numlist.split()) if num > 0)
(Note: the one-liner will fail in cases where the user tries to trick you by putting more numbers after the "final" negative number, e.g. with numlist = "1 2 -1 3 4")
If you must use a while loop (which isn't really the best tool for the job), it would look like:
def find_even_count(numlist):
index = count = 0
numlist = list(map(int, numlist.split()))
while numlist[index] > 0:
if numlist[index] % 2 == 0:
count += 1
index += 1
return count

Understanding Recursive Function

I'm working through the book NLP with Python, and I came across this example from an 'advanced' section. I'd appreciate help understanding how it works. The function computes all possibilities of a number of syllables to reach a 'meter' length n. Short syllables "S" take up one unit of length, while long syllables "L" take up two units of length. So, for a meter length of 4, the return statement looks like this:
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
The function:
def virahanka1(n):
if n == 0:
return [""]
elif n == 1:
return ["S"]
else:
s = ["S" + prosody for prosody in virahanka1(n-1)]
l = ["L" + prosody for prosody in virahanka1(n-2)]
return s + l
The part I don't understand is how the 'SSL', 'SLS', and 'LSS' matches are made, if s and l are separate lists. Also in the line "for prosody in virahanka1(n-1)," what is prosody? Is it what the function is returning each time? I'm trying to think through it step by step but I'm not getting anywhere. Thanks in advance for your help!
Adrian
Let's just build the function from scratch. That's a good way to understand it thoroughly.
Suppose then that we want a recursive function to enumerate every combination of Ls and Ss to make a given meter length n. Let's just consider some simple cases:
n = 0: Only way to do this is with an empty string.
n = 1: Only way to do this is with a single S.
n = 2: You can do it with a single L, or two Ss.
n = 3: LS, SL, SSS.
Now, think about how you might build the answer for n = 4 given the above data. Well, the answer would either involve adding an S to a meter length of 3, or adding an L to a meter length of 2. So, the answer in this case would be LL, LSS from n = 2 and SLS, SSL, SSSS from n = 3. You can check that this is all possible combinations. We can also see that n = 2 and n = 3 can be obtained from n = 0,1 and n=1,2 similarly, so we don't need to special-case them.
Generally, then, for n ≥ 2, you can derive the strings for length n by looking at strings of length n-1 and length n-2.
Then, the answer is obvious:
if n = 0, return just an empty string
if n = 1, return a single S
otherwise, return the result of adding an S to all strings of meter length n-1, combined with the result of adding an L to all strings of meter length n-2.
By the way, the function as written is a bit inefficient because it recalculates a lot of values. That would make it very slow if you asked for e.g. n = 30. You can make it faster very easily by using the new lru_cache from Python 3.3:
#lru_cache(maxsize=None)
def virahanka1(n):
...
This caches results for each n, making it much faster.
I tried to melt my brain. I added print statements to explain to me what was happening. I think the most confusing part about recursive calls is that it seems to go into the call forward but come out backwards, as you may see with the prints when you run the following code;
def virahanka1(n):
if n == 4:
print 'Lets Begin for ', n
else:
print 'recursive call for ', n, '\n'
if n == 0:
print 'n = 0 so adding "" to below'
return [""]
elif n == 1:
print 'n = 1 so returning S for below'
return ["S"]
else:
print 'next recursivly call ' + str(n) + '-1 for S'
s = ["S" + prosody for prosody in virahanka1(n-1)]
print '"S" + each string in s equals', s
if n == 4:
print '**Above is the result for s**'
print 'n =',n,'\n', 'next recursivly call ' + str(n) + '-2 for L'
l = ["L" + prosody for prosody in virahanka1(n-2)]
print '\t','what was returned + each string in l now equals', l
if n == 4:
print '**Above is the result for l**','\n','**Below is the end result of s + l**'
print 'returning s + l',s+l,'for below', '\n','='*70
return s + l
virahanka1(4)
Still confusing for me, but with this and Jocke's elegant explanation, I think I can understand what is going on.
How about you?
Below is what the code above produces;
Lets Begin for 4
next recursivly call 4-1 for S
recursive call for 3
next recursivly call 3-1 for S
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
"S" + each string in s equals ['SSS', 'SL']
n = 3
next recursivly call 3-2 for L
recursive call for 1
n = 1 so returning S for below
what was returned + each string in l now equals ['LS']
returning s + l ['SSS', 'SL', 'LS'] for below
======================================================================
"S" + each string in s equals ['SSSS', 'SSL', 'SLS']
**Above is the result for s**
n = 4
next recursivly call 4-2 for L
recursive call for 2
next recursivly call 2-1 for S
recursive call for 1
n = 1 so returning S for below
"S" + each string in s equals ['SS']
n = 2
next recursivly call 2-2 for L
recursive call for 0
n = 0 so adding "" to below
what was returned + each string in l now equals ['L']
returning s + l ['SS', 'L'] for below
======================================================================
what was returned + each string in l now equals ['LSS', 'LL']
**Above is the result for l**
**Below is the end result of s + l**
returning s + l ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] for below
======================================================================
This function says that:
virakhanka1(n) is the same as [""] when n is zero, ["S"] when n is 1, and s + l otherwise.
Where s is the same as the result of "S" prepended to each elements in the resulting list of virahanka1(n - 1), and l the same as "L" prepended to the elements of virahanka1(n - 2).
So the computation would be:
When n is 0:
[""]
When n is 1:
["S"]
When n is 2:
s = ["S" + "S"]
l = ["L" + ""]
s + l = ["SS", "L"]
When n is 3:
s = ["S" + "SS", "S" + "L"]
l = ["L" + "S"]
s + l = ["SSS", "SL", "LS"]
When n is 4:
s = ["S" + "SSS", "S" + "SL", "S" + "LS"]
l = ["L" + "SS", "L" + "L"]
s + l = ['SSSS", "SSL", "SLS", "LSS", "LL"]
And there you have it, step by step.
You need to know the results of the other function calls in order to calculate the final value, which can be pretty messy to do manually as you can see. It is important though that you do not try to think recursively in your head. This would cause your mind to melt. I described the function in words, so that you can see that these kind of functions is are descriptions, and not a sequence of commands.
The prosody you see, that is a part of s and l definitions, are variables. They are used in a list-comprehension, which is a way of building lists. I've described earlier how this list is built.