My source file is a txt file where I aim to select specific lines based on a few values that are spaced by tabs. My objective is to write these lines to a destination txt file. Every line has the same values of say (a or b), written over about 10 columns (1 value per column).
I have looked at solutions of SO and elsewhere online. I have defined search queries yet they give me error messages. I am just starting out with Python. Thank you for your help.
My code:
searchquery1 = \ta\ta\ta # 3 a's spaced by tab
with open(oldest) as f1: # source input file
with open('newtest.txt', 'a') as f2: # output file
lines = f1.readlines()
for i, line in enumerate(lines):
if line.endswith(searchquery1):
f2.writelines(line + "\n")
A short example:
source file:
A1 a a b b
A2 a a a a
A3 b b a a
...
with searchquery1 = 'a a a' (values are spaced by a tab)
destination file:
A2 a a a a (copy line 2 from source)
Related
input: I have more than 100 sample files. Each sample file has two different files has an extension of *.column' and *.datatypes
File1 each file has column names and has an extension of *.column datatypes description and has an extension of *.datatypes
What I need is an output file in their respective files sample
Output File should have column names along with datatypes.
Currently am getting all 100 files data merged and saved into one file.
Eg: file_1:
column names datatypes
id int
name string
Eg: file_2:
column names datatypes
id int
name string
i got the output for all files column names and datatypes merged in one single file.
What I need is to get individual files merged separately for each sample.
for name in os.listdir("C:\Python27"):
if name.endswith(".column"):
for file in name:
file = os.path.join(name)
joined = file+ ".joined"
with open(joined,"w") as fout:
filenames = glob.glob('*.column')
for filename in filenames:
with open(filename) as f1:
file_names = glob.glob('*.datatypes')
for filename in file_names:
with open(filename) as f2:
for line1,line2 in zip(f1,f2):
x = ("{0} {1} \n".format(line1.rstrip(),line2.rstrip()))
y = x.strip()
fout.write(y.strip() + ',\n')
Please assist me.
Hopefully the below would work. This is on the understanding that each *.column file has a corresponding *.datatypes file name, if not the code will throw a File not found. error.
for colname in os.listdir("C:\Python27"):
if colname.endswith(".column"):
print('Processing:' + colname)
file = os.path.splitext(colname)[0]
joined = file+ ".joined"
with open(joined,"w") as fout:
with open(colname) as f1:
datname = file+'.datatypes'
with open(datname) as f2:
for line1,line2 in zip(f1,f2):
x = ("{0} {1}".format(line1.rstrip(),line2.rstrip()))
y = x.strip()
fout.write(y.strip() + ',\n')
print('Finished writing to :'+joined)
I test ran this with a few sample input files as below file1.column
date_sev
pos
file1.datatypes
timestamp
date
file2.column
id
name
file2.datatypes
int
string
file3.column
id
name
file3.datatypes
int
string
When I run the file I get the below output in the console
Processing:file1.column
Finished writing to :file1.joined
Processing:file2.column
Finished writing to :file2.joined
Processing:file3.column
Finished writing to :file3.joined
And the output files I get are file1.joined
date_sev timestamp,
pos date,
file2.joined
id int,
name string,
file3.joined
id int,
name string,
Also if you want to better the output syntax of the files then I would make the changes as below...
From
x = ("{0} {1}".format(line1.rstrip(),line2.rstrip()))
To
x = ("{0},{1}".format(line1.rstrip(),line2.rstrip()))
From
fout.write(y.strip() + ',\n')
To
fout.write(y.strip() + '\n')
I left the formatting as is from your initial version in my original solution posted in the beginning.
I have problems combining multiple for loops. I will give an example with two of them, I would like to combine. If I know how to do it with two I will also be able to do it with multiple loops.
If anyone knows how to write this as lapply function that would also be nice.
require(ncdf4)
#### download files from this link to directory: (I just downloaded manually,two files are sufficient to answer the example)
#### ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/LWdown_daily_WFDEI/
setwd("C:/place_where_I_have_downloaded_my_files_from_link/")
temp = list.files(pattern="*.nc") #list imported netcdf files
list2env(
lapply(setNames(temp, make.names(gsub("*.nc$", "", temp))),
nc_open), envir = .GlobalEnv) #import all parameters lists to global environment
#### first loop - # select parameter out of netcdf files and combine into a List of 2
list_temp<-list() #create empty list before loop
for (t in temp[1:2]){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(t),"LWdown")[428,176,],xcoor=176,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
#### second loop # select parameter out of onenetcdf file per x-coordinate and combine into a List of 2
list_temp<-list() #create empty list before loop
for (x in 176:177){
list_temp[t]<-list(data.frame(LWdown=ncvar_get(nc_open(temp[1]),"LWdown")[428,x,],xcoor=x,ycoor=428))
}
LW_bind<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
How I tried to combine but didn't work:
#### combined loops
list_temp<-list()
for (t in temp[1:2]){for (x in 176:177){
#ncin<-list()
ncin<-nc_open(t)
list_temp[x][t]<-list(data.frame(LWdown=ncvar_get(ncin,"LWdown")[428,x,],x=x,y=428))
}}
LWdown_1to2<-do.call(rbind,list_temp)
rownames(LWdown_1to2)<-NULL
I already solved my problem. See below. But I am still curious how one could solve the two for loops as described above, so I will leave the question open an unanswered.
Here is my solution:
require(arrayhelpers);require(stringr);require(plyr);require(ncdf4)
# store all files from ftp://rfdata:forceDATA#ftp.iiasa.ac.at/WFDEI/ in the following folder:
setwd("C:/folder")
temp = list.files(pattern="*.nc") #list all the file names
param<-gsub("_\\S+","",temp,perl=T) #extract parameter from file name
xcoord=seq(176,180,by=1) #The X-coordinates you are interested in
ycoord=seq(428,433,by=1) #The Y-coordinates you are interested in
list_var<-list() # make an empty list
for (t in 1:length(temp)){
temp_year<-str_sub(temp[],-9,-6) #take string number last place minus 9 till last place minus 6 to extract the year from file name
temp_month<-str_sub(temp[],-5,-4) #take string number last place minus 9 till last place minus 6 to extract the month from file name
temp_netcdf<-nc_open(temp[t])
temp_day<-rep(seq(1:length(ncvar_get(temp_netcdf),"day"))),length(xcoord)*length(ycoord)) # make a string of day numbers the same length as amount of values
dim.order<-sapply(temp_netcdf[["var"]][[param[t]]][["dim"]],function(x) x$name) # gives the name of each level of the array
start <- c(lon = 428, lat = 176, tstep = 1) # indicates the starting value of each variable
count <- c(lon = 6, lat = 5, tstep = length(ncvar_get(temp_netcdf,"day"))) # indicates how many values of each variable have to be present starting from start
tempstore<-ncvar_get(temp_netcdf, param[t], start = start[dim.order], count = count[dim.order]) # array with parameter values
df_temp<-array2df (tempstore, levels = list(lon=ycoord, lat = xcoord, day = NA), label.x = "value") # convert array to dataframe
Add_date<-sort(as.Date(paste(temp_year[t],"-",temp_month[t],"-",temp_day,sep=""),"%Y-%m-%d"),decreasing=FALSE) # make vector with the dates
list_var[t]<-list(data.frame(Add_date,df_temp,parameter=param[t])) #add dates to data frame and store in a list of all output files
### nc_close(temp_netcdf) #close nc file to prevent data loss and errors
}
All_NetCDF_var_in1df<-do.call(rbind,list_var)
with open("file.txt") as f:
lines = f.readlines()
lines = [l for l in lines if "util.exe" in l]
with open("Lines.txt", "w") as f1:
new=f1.writelines(lines)
This is my example code to write a line which are having the text "util.exe".But i need to read a line which is below in ""util.exe" line.
For Example i have a text file with these lines.
1/16 joc_...
cd D:\cmd\find\joc
util.exe line
pcm wav line I need to read
Here I need to read a line pcm wav line i need to read which is below in util.exe line
Can you please guide me for this.
Here is a suggested solution. Just modify your list comprehension by enumerating the lines and then accessing the line after the util.exe. So here is modified code:
with open("file.txt") as f:
lines = f.readlines()
lines = [lines[index + 1] for index, l in enumerate(lines) if "util.exe" in l]
with open("Lines.txt", "w") as f1:
new=f1.writelines(lines)
So when I run the modified script here are the results:
Contents of example
"file.txt" -->
this is a test line
and here is another one
util.exe is here
this one should be recorded
but this one should not
now it appears at the end util.exe
this should also be saved. finally,
another case
the test phrase util.exe is in the middle
where this should be saved
but not this one
This is the output file "Lines.txt":
this one should be recorded
this should also be saved. finally,
where this should be saved
destLines = []
with open("file.txt") as fp:
lines = fp.readlines()
index = 0
for line in lines:
index += 1
if 'exe' in line and len(lines) >= index:
destLines.append(lines[index])
with open("lines.txt", "w") as fp:
fp.writelines(destLines)
I am trying to split one file with two articles in it into two separate files with one article in each, for subsequent analysis of the articles. Each article in the initial file has an ID that I want to use to separate the files with, using RE.
Below is the initial input file, with ID number:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
However, when I run my code, I do get two separate files as an output but they are empty.
This is my code:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
Output result is as follows:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
I am new to Python and I am not quite sure where the problem lies. I checked some other answers that addressed the multiple file outputs but cannot figure out the solution. Help would be very much appreciated.
There are two problems with your code:
you write only the line matching the ID (actually, just the match itself), not the rest
you are always writing to the last file, as you use i, the loop variable "left over" from the list comprehension
To fix it, you could change the lower portion of your code to this:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
But you do not have to read the file two times at all, just to count the matches: Just open another file when the line matches an ID line and directly write to that last file in the list, then close all the files.
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()
I have 251 CSV files in a folder. They are named "returned UDTs 1-12-13.csv", "returned UDTs 1-13-13.csv. The dates are not consecutive, however. For example holidays and weekends may have missing dates, so the next file may be "returned UDTs 1-17-13.csv". Each file has one column of data. I need to extract each column and append into one column in one new output csv file. I want to write a python script to do so. In a dummy folder with 3 dummy csv files (csv1.csv, csv2.csv, and csv3.csv) I created the following script that works:
import csv, os, sys
out_csv = r"C:\OutCSV\csvtest.csv"
path = r"C:\CSV_test"
fout=open(out_csv,"a")
# first file:
for line in open(path + "\csv1.csv"):
fout.write(line)
# now the rest:
for num in range(2,4):
f = open(path + "\csv"+str(num)+".csv")
f.next() # skip the header
for line in f:
fout.write(line)
f.close() # dont know if needed
fout.close()
The issue is the date in the file name and how to deal with it. Any help would be appreciated.