First "for loop" stops after first iteration python - python-2.7

I am reading two different files with for loop. First "for loop" stops after first iteration. The print output is only line1 of f1 with all lines of f2 but then exit the loop.
for line1 in f1:
line1 = line1.split('\t')
for line2 in f2:
line2 = line2.split('\t')
print line1,line2
f1:
x1
x2
x3
f2:
y1
y2
y3
output:
x1 y1
x1 y2
x1 y3
x2 y1
x2 y2
x2 y3
x3 y1
x3 y2
x3 y3

Your loops are currently nested, which means that your program will read the entire contents of f2 for every line in f1. but once the end of file 2 is reached (at the end of the first outer look there are no more lines in f2 to read. so we manually reset the cursor to the beginning.
Attempt 4:
You were not resetting the cursor on file 2 once you got to the end of the file the first time round, unless you reopen the file in each iteration you must move the cursor to the beginning manually.
If I have now understood you correctly:
def print_both(f1, f2):
f1.seek(0)
f2.seek(0)
for line1 in f1:
line1 = line1.split('\t')
for line2 in f2:
line2 = line2.split('\t')
print(line1, line2)
f2.seek(0)
print_both(open("f1.tsv", 'r'), open("f2.tsv", 'r'))

Related

SED - Insert saved value after certain pattern when a specific string doesn't exist in the file

I have files that look similar to this:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.8)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
And like this - note the STAINLESS in the 2nd one:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.7)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
STAINLESS
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
I want to save the value in Brackets in the line matching
LET: DimS (XX)
and insert a new line
MAT: 'XX correction' - where XX is the save value.
after S DimS
But only if the file whole doesn't contain the string STAINLESS.
So this should be the outcome for the 1st example:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.8)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
MAT: '0.8 correction'
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
Outcome of the 2nd example should stay as it was as it contains STAINLESS:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.7)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
STAINLESS
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
I've tried this to add the line after S DimS pattern:
sed -i -E '/S DimS/I a \\t MAT: "'"0.8 correction"'"'
But that just gives me:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.8)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
MAT: "0.8 correction"
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
And:
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.7)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
MAT: "0.8 correction"
STAINLESS
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
And obvioulsy won't save the value in the brackets...
Can anyone help me please?
Thank you.
If you're amenable to an awk solution, which is often easier than sed when there's logic to perform beyond simple string manipulation ...
# Capture the correction amount
/LET: DimS/ { correction = $3; gsub(/[()]/, "", correction) }
# Get ready to print
$1 == "DIM:" { f = 1 }
# Abort print!!
$1 == "STAINLESS" { f = 0 }
# Now is the time to print the extra line if the flag is still set
NF == 0 && f { printf " MAT: '%s correction'\n", correction; f = 0 }
# Output the original lines of the file
{ print }
Test first example:
$ awk -f a.awk file
LET: DimX (2660.0)
LET: DimZ (1050.0)
LET: DimS (0.8)
LET: DREHEN (20)
DIM: X DimX
Z DimZ+0.5
S DimS
MAT: '0.8 correction'
REF: X1 FOD-23.24
X2 FOD-23.24
Z1 FOD-24.86
Z2 FOD-24.86
POS: CENT_FUNC 1
QSU 10 QSD 10
TURN_AROUND
Try it with your second "STAINLESS" example and you'll see that the extra line is not printed.
This might work for you (GNU sed):
sed -E ':a;N;$!ba;/STAINLESS/b;
s/^(.*LET:\sDimS\s*\(([^)]*).*\n(\s*)S DimS[^\n]*\n)/\1\3MAT: '\''\2 correction'\''/' file
Gather up the entire file into memory and test it to see if it contains the word STAINLESS. If so, bail out otherwise: using pattern matching insert the required text.

Replacing string by incrementing number

Input File
AAAAAA this is some content.
This is AAAAAA some more content BBBBBB. BBBBBB BBBBBB
This is yet AAAAAA some more BBBBBB BBBBBB BBBBBB content.
I can accomplish this partially with this code:
awk '/AAAAAA/{gsub("AAAAAA", "x"++i)}1' test.txt > test1.txt
awk '{for(x=1;x<=NF;x++)if($x~/BBBBBB/){sub(/BBBBBB/,"y"++i)}}1' test1.txt
Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y4 y5 y6 content.
Anyway to get this output?
Expected Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
another one
$ awk '{sub("AAAAAA","x"(++x)); y=0; while(sub("BBBBBB","y"(++y)));}1' file
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
You may use this single awk:
awk '{
j=0
for (x=1; x<=NF; x++)
if ($x ~ /^A{6}/)
sub(/^A{6}/, "x" (++i), $x)
else if ($x ~ /^B{6}/)
sub(/^B{6}/, "y" (++j), $x)
} 1' file
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
Just need to reset i=0 after each loop:
awk '/AAAAAA/{gsub("AAAAAA", "x"++i)}1' test.txt > test1.txt
awk '{for(x=1;x<=NF;x++)if($x~/BBBBBB/){sub(/BBBBBB/,"y"++i)}{i=0}}1' test1.txt
Output:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.
Here is an alternate awk that is easily extended to add as many tags as you wish:
awk 'BEGIN{ rep["AAAAAA"]="x"; cnts["AAAAAA"]=1; reset["AAAAAA"]=0
rep["BBBBBB"]="y"; cnts["BBBBBB"]=1; reset["BBBBBB"]=1
# and so on...
}
{
for (e in rep) {
cnts[e]=(reset[e]) ? reset[e] : cnts[e]
while ( sub(e,rep[e] (cnts[e]++) ) )
; # empty statement since work is inside the while
}
} 1' file
Prints:
x1 this is some content.
This is x2 some more content y1. y2 y3
This is yet x3 some more y1 y2 y3 content.

Reading csv with several subgroups

I have a csv-file that contains "pivot-like" data that I would like to store into a pandas DataFrame. The original data file is divided using different number of whitespaces to differentiate between the level in the pivot-data like so:
Text that I do not want to include,,
,Text that I do not want to include,Text that I do not want to include
,header A,header B
Total,100,100
A,,2.15
a1,,2.15
B,,0.22
b1,,0.22
" slightly longer name"...,,0.22
b3,,0.22
C,71.08,91.01
c1,57.34,73.31
c2,5.34,6.76
c3,1.33,1.67
x1,0.26,0.33
x2,0.26,0.34
x3,0.48,0.58
x4,0.33,0.42
c4,3.52,4.33
x5,0.27,0.35
x6,0.21,0.27
x7,0.49,0.56
x8,0.44,0.47
x9,0.15,0.19
x10,,0.11
x11,0.18,0.23
x12,0.18,0.23
x13,0.67,0.85
x14,0.24,0.2
x15,0.68,0.87
c5,0.48,0.76
x16,,0.15
x17,0.3,0.38
x18,0.18,0.23
d2,6.75,8.68
d3,0.81,1.06
x19,0.3,0.38
x20,0.51,0.68
Others,24.23,0
N/A,,
"Text that I do not want to include(""at all"") ",,
(It looks aweful, but you should be able to paste in e.g. Notepad to see it a bit clearer)
Basically, there are only two columns a and b, but the rows are indented using 0, 3, 6, 9, ... etc whitespaces to differentiate between the levels. So for instance,
zero level, the main group, A has 0 spaces,
first level a1 has 3 spaces,
second level a2 has 6 spaces,
third level a3 has 9 spaces and
fourth and final level has 12 spaces with the corresponding values for columns a and b respectively.
I would now like to be able to read and group this data on these levels in order to create a new summarizing DataFrame, with columns corresponding to these different levels, looking like:
Level 4 Diff(a,b) Level 0 Level 1 Level 2 Level 3
x7 525 C c1 c2 c3
x5 -0.03 A a1 a22 NaN
x4 -0.04 A a1 a22 NaN
x8 -0.08 C c1 c2 c3
…
Any clue on how to do this?
Thanks
Easiest is to split this into different functions
read the file
parse the lines
generate the 'tree'
construct the DataFrame
Parse the lines
def parse_file(file):
import ast
import re
pat = re.compile(r'^( *)(\w+),([\d.]+),([\d.]+)$')
for line in file:
r = pat.match(line)
if r:
spaces, label, a, b = r.groups()
diff = ast.literal_eval(a) - ast.literal_eval(b)
yield len(spaces)//3, label, diff
Reads each line, yields the level, 'label' and diff using a regular expression. I use ast to convert the string to int or float
Generate the tree
def parse_lines(lines):
previous_label = list(range(5))
for level, label, diff in lines:
previous_label[level] = label
if level == 4:
yield tuple(previous_label), diff
Initiates a list of length 5, and then overwrites the level this node is on.
Construct the DataFrame
with StringIO(file_content) as file:
lines = parse_file(file)
index, data = zip(*parse_lines(lines))
idx = pd.MultiIndex.from_tuples(index, names=[f'level_{i}' for i in range(len(index[0]))])
df = pd.DataFrame(data={'Diff(a,b)': list(data)}, index=idx)
Opens the file, constructs the index and generates the DataFrame with the different levels in the index. If you don't want this, you can add a .reset_index() or construct the DataFrame slightly different
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
A a1 a2 a3 x1 -0.07
A a1 a2 a3 x2 -0.08000000000000002
A a1 a22 a3 x3 -0.04999999999999999
A a1 a22 a3 x4 -0.04000000000000001
A a1 a22 a3 x5 -0.03
A a1 a22 a3 x6 -0.06999999999999998
C c1 c2 c3 x7 525.0
C c1 c2 c3 x8 -0.08000000000000002
alternative for missing levels
def parse_lines(lines):
labels = [None] * 5
previous_level = None
for level, label, diff in lines:
labels[level] = label
if level == 4:
if previous_level < 3:
labels = labels[:previous_level + 1] + [None] * (5 - previous_level)
labels[level] = label
yield tuple(labels), diff
previous_level = level
the items under a22 don't seem to have a level_3, so it copies that from the previous. If this is unwanted, you can take this variation
df
level_0 level_1 level_2 level_3 level_4 Diff(a,b)
C c1 c2 c3 x1 -0.07
C c1 c2 c3 x2 -0.08000000000000002
C c1 c2 c3 x3 -0.09999999999999998
C c1 c2 c3 x4 -0.08999999999999997
C c1 c2 c4 x5 -0.07999999999999996
C c1 c2 c4 x6 -0.060000000000000026
C c1 c2 c4 x7 -0.07000000000000006
C c1 c2 c4 x8 -0.02999999999999997
C c1 c2 c4 x9 -0.04000000000000001
C c1 c2 c4 x11 -0.05000000000000002
C c1 c2 c4 x12 -0.05000000000000002
C c1 c2 c4 x13 -0.17999999999999994
C c1 c2 c4 x14 0.03999999999999998
C c1 c2 c4 x15 -0.18999999999999995
C c1 c2 c5 x17 -0.08000000000000002
C c1 c2 c5 x18 -0.05000000000000002
C c1 d2 d3 x19 -0.08000000000000002
C c1 d2 d3 x20 -0.17000000000000004

extract data from txt file using regexp in matlab

I need to extract some info from a txt file which looks like this using regexp:
##FileName = disp_20120803_064635_1
#Plane1
x1 = 10008 x2= -9991 x3= -9991
y1 = 137 y2 = 10 y3 = 158
z1= 844 z2= 779 z3 = 700
#Plane2
x1 = -16 x2= 193 x3= 320
y1 = -4472 y2 = -556 y3 = 5143
z1= 3215 z2= -1309 z3 = 370
#Plane3
x1 = -8145 x2= 5387 x3= 8070
y1 = -4808 y2 = 7643 y3 = 3051
z1= 4212 z2= 4120 z3 = -4176
##end
I want to extract the file name by the following code:
buffer = fileread('test.txt') ;
pattern = '##FileName\s=\s+(\w+?\d+)';
tokens = regexp(buffer, pattern, 'tokens');
fileName = [tokens{:}]
But the result is just disp_20120803 which is not the complete file name?
Any help?
Use this pattern instead:
pattern = '##FileName\s=\s+(\w+)';
Edit:
I don't know matlab syntax but you can use the following regex to capture the variables name and their values:
pattern = '([xyz][123])\s*=\s*(-?\d+)'
The variable name is in group 1 and its value in group 2.

Using RegEx to insert data by the nth term

I would like to add a term after an nth term with the previous term substituted in. For instance, how would I change the following in notpad++:
x1 y1 z1 a1 b1 c1
x2 y2 z2 a2 b2 c2
x3 y3 z3 a3 b3 c3
to
x1 y1 z1 ["z1"] a1 b1 c1
x2 y2 z2 ["z2"] a2 b2 c2
x3 y3 z3 ["z3"] a3 b3 c3
where x, y, z, a, b and c are strings seperated by spaces.
another example:
apples bananas pears grapes oranges lemons
to
apples bananas pears grapes fruit:(grapes) oranges lemons
and so on.
Suppose you have one group that matches your elements, for example [1-9] and there is another group, that matches the separator between your elements, for example [\,\.], then you can write the following
([1-9][\,\.]){n}([\,\.])([1-9][\,\.])*
This will match the first n elements and the separator after it.
You can then use the matched pattern to substitute the content of the second match with your values.
Is that is something you're looking for?
in the find put...
(\w\s\w\s\w\s)
in the replace put
\1["z"]
See this question for more info.
NotePad++ replace problem
I guess this would also make sense for the find...
(x[ ]y[ ]z[ ])
for your example... if it was the 3rd item it would be
find:
(\w\s\w\s)(\w)
replace
\1\2 fruit(\2)