tab-delimited file output inconsistent - python-2.7

I am attempting to write elements from a nested list to individual lines in a file, with each element separated by tab characters. Each of the nested lists is of the following form:
('A', 'B', 'C', 'D')
The final output should be of the form:
A B C D
E F G H
. . . .
. . . .
However, my output seems to have reproducible inconsistencies such that the output is of the general form:
A B C D
E F G H
I J K L
M N O P
. . . .
. . . .
I've inspected the lists before writing and they seem identical in form. The code I'm using to write is:
with open("letters.txt", 'w') as outfile:
outfile.writelines('\t'.join(line) + '\n' for line in letter_list)
Importantly, if I replace '\t' with, for example, '|', the file is created without such inconsistencies. I know whitespace parsing can become an issue for certain file I/O operations, but I don't know how to troubleshoot it here.
Thanks for the time.
EDIT: Here is some actual input data (in nested-list form) and output:
IN
('5', '+', '5752624-5752673', 'alt_region_8161'), ('1', '+', '621461-622139', 'alt_region_67'), ('1', '+', '453907-454063', 'alt_region_60'), ('1', '+', '539611-539815', 'alt_region_61'), ('4', '+', '14610049-14610103', 'alt_region_6893'), ('4', '+', '14610049-14610144', 'alt_region_6895'), ('4', '+', '14610049-14610144', 'alt_region_6897'), ('4', '+', '14610049-14610144', 'alt_region_6896')]
OUT
4 + 12816011-12816087 alt_region_6808
1 + 21214720-21214747 alt_region_2377
4 + 9489968-9490833 alt_region_7382
1 + 12121545-12126263 alt_region_650
4 + 9489968-9490811 alt_region_7381
4 + 12816011-12816087 alt_region_6807
1 + 2032338-2032740 alt_region_157
5 + 4695084-4695628 alt_region_9316
1 + 22294677-22295134 alt_region_2424
1 + 22294677-22295139 alt_region_2425
1 + 22294677-22295139 alt_region_2426
1 + 22294677-22295139 alt_region_2427
1 + 22294677-22295134 alt_region_2422
1 + 22294677-22295134 alt_region_2423
1 + 22294384-22295198 alt_region_2428
1 + 22294384-22295198 alt_region_2429
5 + 20845105-20845211 alt_region_9784
5 + 20845105-20845206 alt_region_9783
3 + 2651447-2651889 alt_region_5562
EDIT: Thanks to everyone who commented. Sorry if the question was poorly phrased. I appreciate the help in clarifying the issue (or, apparently, non-issue).

There are no spaces (' ')in your output, only tabs ('\t').
>>> print(repr('1 + 21214720-21214747 alt_region_2377'))
'1\t+\t21214720-21214747\talt_region_2377'
^^ ^^ ^^
Tabs are not equivalent to a fixed number of spaces (in most editors). Rather, they move the character following the tab to the next available multiple of x characters from the left margin, where x varies - x is most commonly 8, though it is 4 here on SO.
>>> for i in range(7):
print('x'*i+'\tx')
x
x x
xx x
xxx x
xxxx x
xxxxx x
xxxxxx x
If you want your output to appear aligned to the naked eye, you should use string formatting:
>>> for line in data:
print('{:4} {:4} {:20} {:20}'.format(*line))
5 + 5752624-5752673 alt_region_8161
1 + 621461-622139 alt_region_67
1 + 453907-454063 alt_region_60
1 + 539611-539815 alt_region_61
4 + 14610049-14610103 alt_region_6893
4 + 14610049-14610144 alt_region_6895
4 + 14610049-14610144 alt_region_6897
4 + 14610049-14610144 alt_region_6896
Note, however, that this will not necessarily be readable by code that expects a tab-separated value file.

In some text editors, tabs are displayed like that. The contents of the file are correct, it's just a matter of how the file is displayed on screen. It happens with tabs but not with | which is why you don't see it happening when you use |.

Related

Remove all words containing '#' from list in DataFrame

I have a DataFrame in which one column contains lists of words.
>>dataset.head(1)
>> contain
0 ["name", "Place", "ect#gtr", "nick"]
1 ["gf#e", "nobel", "play", "hi"]
I want to remove all the words which contain '#'. In the above example, I want to remove "ect#gtr" and "gf#e".
Try This one
ab= np.column_stack([~df[col].str.contains(r"#") for col in df])
new_df=df.loc[ab.any(axis=1)]
print(new_df)
Use list comprehension with filtering, regex here is not necessary:
df = pd.DataFrame({'contain':[['name', 'Place', 'ect#gtr', 'nick'],
['gf#e', 'nobel', 'play', 'hi']]})
print (df)
contain
0 [name, Place, ect#gtr, nick]
1 [gf#e, nobel, play, hi]
df.contain = df.contain.apply(lambda x: [y for y in x if '#' not in y])
Or:
df.contain = [[y for y in x if '#' not in y] for x in df.contain]
print (df)
contain
0 [name, Place, nick]
1 [nobel, play, hi]
EDIT: For remove values in strings add split with join:
df = pd.DataFrame({'contain':['name Place ect#gtr nick',"gf#e nobel play hi"]})
print (df)
contain
0 name Place ect#gtr nick
1 gf#e nobel play hi
df.contain = df.contain.apply(lambda x: ' '.join([y for y in x.split() if '#' not in y]))
print (df)
contain
0 name Place nick
1 nobel play hi

Regex to pull out numbers and operands

I am trying to write a regex to parse out seven match objects: four numbers and three operands:
Individual lines in the file look like this:
[ 9] -21 - ( 12) - ( -5) + ( -26) = ______
The number in brackets is the line number which will be ignored. I want the four integer values, (including the '-' if it is a negative integer), which in this case are -21, 12, -5 and -26. I also want the operands, which are -, - and +.
I will then take those values (match objects) and actually compute the answer:
-21 - 12 - -5 + -26 = -54
I have this:
[\s+0-9](-?[0-9]+)
In Pythex it grabs the [ 9] but it also then grabs every integer in separate match objects (four additional match objects). I don't know why it does that.
If I add a ? to the end: [\s+0-9](-?[0-9]+)? thinking it will only grab the first integer, it doesn't. I get seventeen matches?
I am trying to say, via the regex: Grab the line number and it's brackets (that part works), then grab the first integer including sign, then the operand, then the next integer including sign, then the next operand, etc.
It appears that I have failed to explain myself clearly.
The file has hundreds of lines. Here is a five line sample:
[ 1] 19 - ( 1) - ( 4) + ( 28) = ______
[ 2] -18 + ( 8) - ( 16) - ( 2) = ______
[ 3] -8 + ( 17) - ( 15) + ( -29) = ______
[ 4] -31 - ( -12) - ( -5) + ( -26) = ______
[ 5] -15 - ( 12) - ( 14) - ( 31) = ______
The operands are only '-' or '+', but any combination of those three may appear in a line. The integers will all be from -99 to 99, but that shouldn't matter if the regex works. The goal (as I see it) is to extract seven match objects: four integers and three operands, then add the numbers
exactly as they appear. The number in brackets is just the line number and plays no role in the computation.
Much luck with regex, if you just need the result:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
s = s[s.find("]")+1:s.find("=")] # cut away line nr and = ...
if not re.sub( "[+-0123456789() ]*","",s): # weak attempt to prevent python code injection
print(eval(s))
else:
print("wonky chars inside, only numbers, +, - , space and () allowed.")
Output:
-54
Make sure to read the eval()
and have a look into:
https://opensourcehacker.com/2014/10/29/safe-evaluation-of-math-expressions-in-pure-python/
https://softwareengineering.stackexchange.com/questions/311507/why-are-eval-like-features-considered-evil-in-contrast-to-other-possibly-harmfu/311510
https://www.kevinlondon.com/2015/07/26/dangerous-python-functions.html
Example for hundreds of lines:
import re
s="[ 9] -21 - ( 12) - ( -5) + ( -26) = ______"
def calcIt(line):
s = line[line.find("]")+1:line.find("=")]
if not re.sub( "[+-0123456789() ]*","",s):
return(eval(s))
else:
print(line + " has wonky chars inside, only numbers, +, - , space and () allowed.")
return None
import random
random.seed(42)
pattern = "[ {}] -{} - ( {}) - ( -{}) + ( -{}) = "
for n in range(1000):
nums = [n]
nums.extend([ random.randint(0,100),random.randint(-100,100),random.randint(-100,100),
random.randint(-100,100)])
c = pattern.format(*nums)
print (c, calcIt(c))
Ahh... I had a cup of coffee and sat down in front of Pythex again.
I figured out the correct regex:
[\s+0-9]\s+(-?[0-9]+)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)\s+([-|+])\s+\(\s+(-?[0-9]+)\)
Yields:
-21
-
12
-
-5
+
-26

postgresql: regexp_split_to_table - how to split text by delimiters

I need to split my text to table by delimiters '<=' and '=>', for example
select regexp_split_to_table('kik plz <= p1 => and <= p2 => too. A say <=p1 =>','regexp');
The result must be:
table:
--------------
1 | 'kik plz '
2 | '<= p1 =>'
3 | ' and '
4 | <= p2 =>
5 | ' too. A say '
6 | '<=p1 =>'
I think the answer is in positional patterns, but my skills are not enough.
select regexp_split_to_table('kik plz <= p1 => and <= p2 => too. A say <=p1 =>', '((\s)(?=<=))|((\s)(?!=>))')
This returns the wrong result.
select regexp_split_to_table(
replace(
replace('kik plz<= p1 =>and<= p2 =>too. A say <=p1 =>', '<=', E'\001<=')
, '=>', E'=>\001')
, E'\001');

Combining values from an arbitrary number of pandas columns into a new column — a 'join' in the not-SQL sense

I'm trying to do what's described here, but it's not the case that only one of my columns is populated, and I want to have a delimiter.
The code I'd like to replace (with something that will take an arbitrary number of k's) is:
raw_df["all ks"] = raw_df["k1"].fillna("") + "/" + \
raw_df["k2"].fillna("") + "/" + \
raw_df["k3"].fillna("") + "/" + \
raw_df["k4"].fillna("")
I wondered if this solution could be somehow responsive, but I'm hoping for something simpler.
Thanks for any helpful suggestions. Searching the web has been frustrating because I'm trying to do a join (in the pythonic sense) and most search results relate to joining columns in the database sense (including as adapted in pandas).
You could use the cat string method to concatenate the string values. With this method you can specify the delimiter and what the NaN values should be replaced with.
For example, here's a DataFrame:
>>> df = pd.DataFrame({'a': ['x', np.nan, 'x'],
'b': ['y', 'y', np.nan],
'c': ['z', 'z', np.nan]})
a b c
0 x y z
1 NaN y z
2 x NaN NaN
Then starting with column a and passing in the remaining columns using a list comprehension:
>>> df['a'].str.cat(others=[df[col] for col in df.columns[1:]],
sep='/', na_rep='')
0 x/y/z
1 /y/z
2 x//
So this is what I came up. It uses Apply() and a function. Not as concise as I hoped, but it works with an arbitrary number of Ks. Maybe someone will come up with something better
Generating a dataframe
d = {'k1' : [np.nan,'a','b'], 'k2' : ['c', np.nan, 'c'], 'k3' : ['r','t',np.nan], 'k4': [np.nan,'t','e']}
raw_df = pd.DataFrame(d)
raw_df
k1 k2 k3 k4
0 Nan c r Nan
1 a Nan t t
2 b c Nan e
define a function
def concatKs(s):
allK = ''
for k in s:
if k is not np.nan:
allK += k + '/'
else:
allK += '' + '/'
return allK
then the apply() and passing our function
raw_df['all ks'] = raw_df.apply(concatKs, axis=1)
raw_df
k1 k2 k3 k4 all ks
0 NaN c r NaN /c/r//
1 a NaN t t a//t/t/
2 b c NaN e b/c//e/

posix regexp to split a table

I'm currently working on data migration in PostgreSQL. Since I'm new to posix regular expressions, I'm having some trouble with a simple pattern and would appreciate your help.
I want to have a regular expression split my table on each alphanumeric char in a column, eg. when a column contains a string 'abc' I'd like to split it into 3 rows: ['a', 'b', 'c']. I need a regexp for that
The second case is a little more complicated, I'd like to split an expression '105AB' into ['105A', '105B'], I'd like to copy the numbers at the beginning of the string and split the table on uppercase letters, in the end joining the number with exactly 1 uppercase letter.
the function I'll be using is probably regexp_split_to_table(string, regexp)
I'm intentionally providing very little data not to confuse anyone, since what I posted is the essence of the problem. If you need more information please comment.
The first was already solved by you:
select regexp_split_to_table(s, ''), i
from (values
('abc', 1),
('def', 2)
) s(s, i);
regexp_split_to_table | i
-----------------------+---
a | 1
b | 1
c | 1
d | 2
e | 2
f | 2
In the second case you don't say if the numerics are always the first tree characters:
select
left(s, 3) || regexp_split_to_table(substring(s from 4), ''), i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i);
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2
For a variable number of numerics:
select n || a, i
from (
select
substring(s, '^\d{1,3}') n,
regexp_split_to_table(substring(s, '[A-Z]+'), '') a,
i
from (values
('105AB', 1),
('106CD', 2)
) s(s, i)
) s;
?column? | i
----------+---
105A | 1
105B | 1
106C | 2
106D | 2