I have two variables var1 and var2 on two different line numbers. My task is:
To open an input file, search for the lines beginning with var1 and insert a comment ABOVE the line.
To open the same input file, search for the lines beginning with var2 and insert a comment BELOW the line.
I was able to achieve 1 but not 2.
var1 = 2 #line number
var2 = 5 #line number
comment1 = "inserted text above var1"
comment2 = "inserted text below var2"
some for loop:
found1 = False
found2 = False
for line in fileinput.input(source.txt, inplace=True):
if not found and line.startswith(var1):
print comment1
found1 = True
print line,
if not found and line.startswith(var2):
print line
found1 = True
print comment2,
Input File:
1 abc
2 def
3 ghi
4 jkl
5 mno
6 pqr
7 stu
Output should be :
1 abc
inserted text above var1
2 def
3 ghi
4 jkl
5 mno
inserted text below var2
6 pqr
7 stu
You seem to have a lot of variables that you don't need. The way that I read your code, based on the indenting, you'll always print comment2, and sometimes you'll print the line more then once.
You do not need 2 for loops. You don't ever use found1 or found2, and the variable found is never defined these are variables that you probably don't need. You either need to define source.txt as a variable, or pass it as a string(put " around it, which is what I think you meant to do). var1, and var2 should probably be strings as well, since str.startswith() expects a string to be passed in.
Simplify it a little and I think you wont have a problem. Assuming that you are handling the reading the file correctly, something like this should do the trick.
var1= "2" #line number
var2 = "5" #line number
comment1 = "inserted text above var1"
comment2 = "inserted text below var2"
for line in fileinput.input("source.txt", inplace=True):
if line.startswith(var1):
print comment1
print line
if line.starswith(var2):
print comment2
Related
I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
I would like to end up with:
ID string
1 7-test
2 67-tty
3 j37b2 3hty
I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.
Following up on the loop suggesting from the comments, you could do something like the following:
clear
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end
gen N_words = wordcount(string) // # words in each string
qui sum N_words
global max_words = r(max) // max # words in all strings
split string, gen(part) parse(" ") // split string at space (p.s. space is the default)
gen string2 = ""
forval i = 1/$max_words {
* add in parts that contain at least one letter
replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}
drop part* N_words
where the result would be
. list
+----------------------------------------+
| id string string2 |
|----------------------------------------|
1. | 1 9884 7-test 58 - 489 7-test |
2. | 2 67-tty 783 444 67-tty |
3. | 3 j3782 3hty j3782 3hty |
+----------------------------------------+
Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.
I need to apply a for loop on a file containing records of a command, to convert one of the column into a list. Please advise, Thanks in advance .
Data is as below :
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
asset 8 100 1009663 1
asset 7 200 523533 1
asset 9 319710 319710 0
asset 5 870935 870935 0
This is my code :
lag_list = []
with open(fname) as f:
f.readline()
lines = f.readlines()[1: ]
length = len(lines)
print(length)
for line in lines:
print "Hello"
print line
print "hello 2"
data=line.split(' ')
lag_list.append(data[4])
data=line.split("\t")
lag_list.append(data[4])
print lag_list
return
But returning this error:
lag_list.append(data[4])
IndexError: list index out of range
Your data has
- not 4 tabs in a line
- or not 4 spaces in a line
- or you have a \n after the last line of your source data
When reading those lines and splitting them, you do not have 5 elements in the resulting list - hence index error when accessing data[4].
Splitting the same list by spaces and by tabs does not make much sense for me - I hope it does for your data and application.
Check your splitted list before indexing into it:
lag_list = []
with open(fname) as f:
f.readline()
lines = f.readlines()[1: ]
length = len(lines)
print(length)
for line in lines:
print "Hello"
print line
print "hello 2"
data = line.split(' ')
if len(data) >= 5: # check if safe to index into
lag_list.append(data[4]
else:
print("Not enough elements - need 5 at least:", data)
data = line.split("\t")
if len(data) >= 5: # check if safe to index into
lag_list.append(data[4])
else:
print("Not enough elements - need 5 at least:", data)
print lag_list
return
Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.
I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.
Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.
Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0
I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!