Python: Writing input data as it is to output file - python-2.7

So, i'm a total noob when it comes to programming, especially python. I'm trying to merge two files(based on my requirements) and store the result into an output file. I am easily being able to do the merge, but the issue i'm facing is with the data format. You see, the data in the output file must be in the same format as the input file
These are my input files:
File1:
ID,CLASS,BOARD_ID
3620694,Smart,233049933699
3620724,Smart,233200309044
3620819,Smart,233200971094
3620831,Smart,233201075614
3620865,Smart,233201516374
3620870,Smart,233201553354
3620906,Smart,233201863244
3620929,Smart,233201972254
3620963,Smart,233202244014
3621008,Smart,233202600234
3621107,Smart,233203534474
3621158,Smart,233204019454
3621179,Smart,233204093854
3621252,Smart,233204801364
3621254,Smart,233204815324
3621266,Smart,233205000144
3621275,Smart,233205104774
3621288,Smart,233205182584
File2:
CDUS,CBSCS,CTRS,CTRS_ID
0,0,0.000000000375,233010056572
0,0,4.0746,233200309044
0,0,0.6182,233200971094
0,0,15.4834,233201075614
0,0,2.2459,233201516374
0,0,0.148,233201553354
0,0,0.0468,233201863244
0,0,0.5045,233201972254
0,0,0.0000000000485,233049933699
So this is my python script:
import pandas
#pandas.set_option('display.precision',13)
csv1 = pandas.read_csv('OUTPUT_1707000867_BundleCrossCellData_45432477_0_0.txt',dtype={'BOARD_ID': str})
csv2 = pandas.read_csv('I2.txt',dtype={'CTRS_ID': str}).rename(columns={'CTRS_ID':'BOARD_ID'})
merged = pandas.merge(csv1, csv2,left_on=['BOARD_ID'],right_on=['BOARD_ID'],how='left',suffixes=('#x', '#y'), sort=True)
merged.to_csv("Op2.txt", index=False,date_format='%Y/%m/%d %H:%M:%S.000',float_format='%.13f')
Output received upon execution:
Op.txt
ID,CLASS,BOARD_ID,CDUS,CBSCS,CTRS
3620694,Smart,233049933699,0.0000000000000,0.0000000000000,0.0000000000485
3620724,Smart,233200309044,0.0000000000000,0.0000000000000,4.0746000000000
3620819,Smart,233200971094,0.0000000000000,0.0000000000000,0.6182000000000
3620831,Smart,233201075614,0.0000000000000,0.0000000000000,15.4834000000000
3620865,Smart,233201516374,0.0000000000000,0.0000000000000,2.2459000000000
3620870,Smart,233201553354,0.0000000000000,0.0000000000000,0.1480000000000
3620906,Smart,233201863244,0.0000000000000,0.0000000000000,0.0468000000000
3620929,Smart,233201972254,0.0000000000000,0.0000000000000,0.5045000000000
3620963,Smart,233202244014,,,
3621008,Smart,233202600234,,,
3621107,Smart,233203534474,,,
3621158,Smart,233204019454,,,
3621179,Smart,233204093854,,,
3621252,Smart,233204801364,,,
3621254,Smart,233204815324,,,
3621266,Smart,233205000144,,,
3621275,Smart,233205104774,,,
3621288,Smart,233205182584,,,
Expected Output:
ID,CLASS,BOARD_ID,CDUS,CBSCS,CTRS
3620694,Smart,233049933699,0,0,0.0000000000485
3620724,Smart,233200309044,0,0,4.0746000000000
3620819,Smart,233200971094,0,0,0.6182000000000
3620831,Smart,233201075614,0,0,15.4834000000000
3620865,Smart,233201516374,0,0,2.2459000000000
3620870,Smart,233201553354,0,0,0.1480000000000
3620906,Smart,233201863244,0,0,0.0468000000000
3620929,Smart,233201972254,0,0,0.5045000000000
3620963,Smart,233202244014,,,
3621008,Smart,233202600234,,,
3621107,Smart,233203534474,,,
3621158,Smart,233204019454,,,
3621179,Smart,233204093854,,,
3621252,Smart,233204801364,,,
3621254,Smart,233204815324,,,
3621266,Smart,233205000144,,,
3621275,Smart,233205104774,,,
3621288,Smart,233205182584,,,
As you can see in Op.txt, values of CDUS and CBSCS are not in the same format as in File2.txt. For obvious reasons both have to be in same format.
Is there a way to resolve this issue?
Also since the input files are being generated at runtime, i cannot statically type-cast a certain column to a certain type.
I also referenced following links, but didnt find an appropriate solution. Any help would be very much appreciated.
How do I suppress scientific notation in Python?
Force python to not output a float in standard form / scientific notation / exponential form
Casting float to string without scientific notation
supress scientific notation when writing python floats to files
Suppressing scientific notation in pandas?

Related

How to parse space separated data in pyspark?

I have below kind of data which is space separated, I want to parse it by space but I am getting issue when particular element has “Space” in it.
2018-02-13 17:21:52.809 “EWQRR.OOM” “ERW WERT11”
have used below code:
import shlex
rdd= line.map(lambda x: shlex.split(x))
but its returning deserialized result like \x00\x00\x00.
Use re.findall() and regex “.+?”|\S+ or you can use “[^”]*”|\S+ by #ctwheels it performs better.
rdd = line.map(lambda x: re.findall(r'“.+?”|\S+', x))
Input:
"1234" "ewer" "IIR RT" "OOO"
Getting Output:
1234, ewer, IIR, RT, OOO
Desired Output .
1234, ewer, IIR RT, OOO
By default all the text lines are encoded as unicode if you are using sparkContext's textFile api as the api document of textFile says
Read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding
as utf-8), which is faster and smaller than unicode. (Added in
Spark 1.2)
And by default this option is true
#ignore_unicode_prefix
def textFile(self, name, minPartitions=None, use_unicode=True):
And thats the reason you are getting unicode characters like \x00\x00\x00 in the results.
You should include use_unicode option while reading the data file to rdd as
import shlex
rdd = sc.textFile("path to data file", use_unicode=False).map(lambda x: shlex.split(x))
Your results should be
['2018-02-13', '17:21:52.809', 'EWQRR.OOM', 'ERW WERT11']
you can even include utf-8 encoding in map function as
import shlex
rdd = sc.textFile("path to the file").map(lambda x: shlex.split(x.encode('utf-8')))
I hope the answer is helpful

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

[C++]: Writing a numerical data to an ODS file, ODS does not treat them as numbers

When I export my calculations via ofstream in C++ to an ODS (Apache OpenOffice) file, the numbers are correctly shown there, however I cannot make any calculations in that specific ODS file.
For example, when I try to add, say 0.9191 on A1, and 0.5757 on A2, the =SUM(A1:A2) returns zero.
I tried to solve this thru formatting cells, but none worked so far. Any suggestions? Thank you.
Edit: The portion of code that does the exporting job.
string datafolder; datafolder = "c:/Users/cousinvinnie/Desktop/Code Vault/ArmaTut3/" + Jvalue;
string graph_path = datafolder + "/Graphavgs.ods"; ofstream graphavgs; graphavgs.open(graph_path);
for(int ctr = 0; ctr<cycledata; ctr++){
cyclepoints = (howmanyDC + 1) * (ctr + 1);
graphavgs<<(ctr + 1)<<" ";
calcguy = sum((wholedata.row(cyclepoints))) / nextgenpop;
secondbiggiesavg(ctr) = -log(calcguy);
graphavgs<<secondbiggiesavg(ctr)<<" ";
calcguy = sum((thirdbiggest.row(cyclepoints))) / nextgenpop;
thirdbiggiesavg(ctr) = -log(calcguy);
graphavgs<<thirdbiggiesavg(ctr)<<" ";
calcguy = sum((matrixavgs.row(cyclepoints))) / nextgenpop;
avgmatrixdata(ctr) = -log(calcguy);
graphavgs<<avgmatrixdata(ctr)<<" "<<endl;
}
graphavgs.close();
This code creates the Graphavgs.ods file. In that file I have
1 0.111753 0.182331 0.358724
2 0.147015 0.259202 0.48334
3 0.195855 0.362397 0.648719
4 0.25348 0.476696 0.839261
5 0.314722 0.618828 1.0633
6 0.420704 0.857286 1.37501
7 0.536699 1.1179 1.69503
8 0.76933 1.56382 2.13464
9 0.90525 1.89921 2.42443
10 1.15678 2.41533 2.82584
Now these numbers are not treated as numbers. When I try to work a function on them, like =SUM(A1:A2) the return is zero.
When I do =LN(A1), the return is #VALUE!
SOLVED: Find & Replace all dots with commas.
You are making a confusion between the CSV file format, the ODS file format and the representation of both in OpenOffice or LibreOffice.
What you build is a CSV file, that means purely text file that only contains a textual representation of values. By default, your C++ program generates floating values with a dot as a decimal separator.
An ODS file is in fact a ZIP file containing meta-data (name of creator, date of creation, date of last print, etc.), actual data and formatting informations. That way an ODS file is directly opened by LibreOffice or OpenOffice.
What you open a CSV file in LibreOffice or OpenOffice, you actually import it. That means that the program makes some assumptions on the data separator, the decimal separator and if appropriate on the date format to translate the textual values into numeric (or date) ones. Those assumptions are based on your system locale. The formatting in normally the default one. Depending on the version you use, a dialog box with import option may be displayed always or only if you explicitely import the file (menu File/Import). That dialog box allows you to specify the separators and decimal separators that the CSV file contains.
Once you have correctly loaded a CSV file, it is recommended to save it in ODS format to make sure that you will no longer have that import problem again.

reading in certain rows from a csv file using python

Say I have a csv file that looks like the following with the first column containing frequencies and the second column containing the power level (dBm).
Frequency | dBm
1 -11.43
2.3 -51.32
2.5 -12.11
2.8 -11.21
3.1 -73.22
3.2 -21.13
I only want to read in the data sets of this file that have a (dBm) value between -13 and -10. Therefore, in this example I only want the data sets (1, -11.43)(2.5, -12.11)(2.8, -11.21) to be read into my program variables x1 and y1. Could someone give me some help in how I could do this?
You can just use the csv library and check if each meets your criteria.
Something like this should work on your file:
with open('file.csv') as csvfile:
reader = csv.reader(csvfile,delimiter=' ')
reader.next()
reader.next()
for row in reader:
a = [float(i) for i in row if i!='']
if a[1]>=-13 and a[1]<=-1:
print (a[0],a[1])
Edit: If you're working with table data I would suggest trying out Pandas, it's really helpful in these situations.

I need to write a Python stub to print names of image files and whether they are blurry or not

New user here, and just started Python a few days ago!
My question is:
I need to write a Python stub to print names of image files and whether they are blurry or not. They are considered blurry if the value is > 0.3. There are 5 bits of information in each line, the second bit (index 1) is the number in question. In total there are 1868 lines.
Here is a sample of the data:
['out04-32-44-03.tif,0.295554,536047.6051,5281850.4252,19.8091\n',
'out04-32-44-15.tif,0.337232,536047.2831,5281850.5974,19.8256\n',
'out04-32-44-27.tif,0.2984,536046.9611,5281850.7696,19.8420\n',
'out04-32-44-39.tif,0.311989,536046.6392,5281850.9418,19.8584\n',
'out04-32-44-51.tif,0.346901,536046.3172,5281851.1140,19.8749\n',
'out04-32-44-63.tif,0.358519,536045.9953,5281851.2862,19.8913\n',
'out04-32-44-75.tif,0.342837,536045.6733,5281851.4584,19.9078\n',
'out04-32-44-87.tif,0.32909,536045.3513,5281851.6306,19.9242\n',
'out04-32-44-99.tif,0.294824,536045.0294,5281851.8028,19.9406\n']
Any suggestions greatly appreciated :-)
Based on the code you have written in the comments. This is for python 2.7
fin = open('E:\KGG 375 - GIS Advanced\Assignment 2 - Python\TIR043109gpxpos.txt')
for line in fin: # no need to read these into a list first
info = line.split(',')
blurry = float(info[1])
print info[0],
if blurry > 0.3:
print ' is blurry'
else:
print ' is not blurry'
Explanation:
There is no need to read the lines of a file to a list, you can just iterate over a file and it will read line by line
To be able to compare against a float, you need to convert the 2nd element (info[1]) into a float.
print info[0], will print the filename and the comma will prevent a line break so " is blurry" will print out to the same line. HOX! This is python2.7 syntax so it will not work with python 3.x