Python 2.7: read a txt file, split and group a few column count from right - python-2.7

Due to the txt file has some flaw, the .txt file need to split from the right. below is some part f the files. Notice that the first row has only 4 columns and the other row has 5 columns. I want the data from the 2nd, 3rd, and 4th columns from the right
5123 - SENTRAL REIT - SENTA.KL - [$SENT]
KIPT - 5280 - KIP REAL EST - KIPRA.KL - [$KIPR]
ALIT - 5269 - AL-SALAM REAL - ALSAA.KL - [$ALSA]
KLCC - 5235SS - KLCC PROP - KLCCA.KL - [$KLCC]
IGBgggREIT - 5227 - IGB RT - IGREA.KL - [$IGRE]
SUNEIT - 5176 - SUNWAY RT - SUNWA.KL - [$SUNW]
ALA78QAR - 5116 - AL-AQAR HEA RT - ALQAA.KL - [$ALQA]
I want the file to be saved in .csv and can be read by pandas later
The desired output is
Code,Company,RIC
5123,SENTRAL REIT,SENTA.KL
5280,KIP REAL EST, KIPRA.KL
5269,AL-SALAM REAL,ALSAA.KL
5235SS,KLCC PROP,KLCCA.KL
5227,IGB RT,IGREA.KL
5176,SUNWAY RT,SUNWA.KL
5116,AL-AQAR HEA RT,ALQAA.KL
My code is below
>>> with open('abc.txt', 'r') as reader:
>>> [x for x in reader.read().strip().split(' - ') if x]
It returns a list and I unable to group the to the right column due to the flaw of the list (unequal columns in some rows if it is counted from left)
Please advise how to get the desired output

This should do the trick :)
import pandas as pd
with open('abc.txt', 'r') as reader:
data = [line.split(' - ')[-4:-1] for line in reader.readlines()]
df = pd.DataFrame(columns=['Code', 'Company', 'RIC'], data=data)
df.to_csv('abc.csv', sep=',', index=0)

Related

How to append a specific row from an existing csv file to a new one with Pyhton 2

I have two csv files test1.csv and test2.csv that contain two rows with values (altitude,time).
test1.csv is quite larger that test2.csv.
I want to compare the altitudes based on the same time
I have found this piece of code that runs on Python2
import csv
with open('test1.csv', 'rb') as master:
master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))
with open('test2.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['result'])
for row in reader:
index = master_indices.get(row[0])
if index is not None:
message = 'Same time is found (row {})'.format(index)
else:
message = 'No same time is found'
writer.writerow(row + [message])
and it works fine as it writes the index from time1.csv that was found the same.
The result csv contains the time and altitude of test2.csv and also the message that show when there is match on time value or not.
Since I'm quite new to Python I'm trying to find away so that the results.csv file contains also the altitude column from test1.csv.
I tried to replicated the above code for the test1.csv file in order to add the row by adding the following code to the existing:
with open('test1.csv', 'rb') as master:
with open('results.csv', 'wb') as results:
writer = csv.writer(results)
reader2 = csv.reader(master)
writer.writerow(next(reader2, []) + ['altitude'])
for row in reader2:
writer.writerow(row)
But I got a csv file without the previous result column and an new but empty altitude column.
So eventually the result.csv should contain the following columns:
time,altitude(from test2.csv),altitude(from test1.csv),result
How can this be achieved?

Split string, extract and add to another column regex BIGQUERY

I have a table with Equipment column containing strings. I want to split string, take a part of it and add this part to a new column (SerialNumber_Asset). Part of the string i want to extract always has the same pattern: A + 7 digits. Example:
Equipment SerialNumber_Asset
1 AXION 920 - A2302888 - BG-ADM-82 -NK A2302888
2 Case IH Puma T4B 220 - BG-AEH-87 - NK null
3 ARION 650 - A7702047 - BG-ADZ-74 - MU A7702047
4 ARION 650 - A7702039 - BG-ADZ-72 - NK A7702039
My code:
select x, y, z,
regexp_extract(Equipment, r'([\A][\d]{7})') as SerialNumber_Asset
FROM `aa.bb.cc`
The message i got:
Cannot parse regular expression: invalid escape sequence: \A
Any suggestions what could be wrong? Thanks
Just use A instead of [\A], check example below:
select regexp_extract('AXION 920 - A2302888 - BG-ADM-82 -NK', r'(A[\d]{7})') as SerialNumber_Asset

Adding constant values at the begining of a dataframe in pyspark

I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)
Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

Print columns of Pandas dataframe to separate files + dataframe with datetime (min/sec)

I am trying to print a Pandas dataframe's columns to separate *.csv files in Python 2.7.
Using this code, I get a dataframe with 4 columns and an index of dates:
import pandas as pd
import numpy as np
col_headers = list('ABCD')
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y"),periods=rows)
df2 = pd.DataFrame(np.random.randn(10, 4), index=dates, columns = col_headers)
df = df2.tz_localize('UTC') #this does not seem to be giving me hours/minutes/seconds
I then remove the index and set it to a separate column:
df['Date'] = df.index
col_headers.append('Date') #update the column keys
At this point, I just need to print all 5 columns of the dataframe to separate files. Here is what I have tried:
for ijk in range(0,len(col_headers)):
df.to_csv('output' + str(ijk) + '.csv', columns = col_headers[ijk])
I get the following error message:
KeyError: "[['D', 'a', 't', 'e']] are not in ALL in the [columns]"
If I say:
for ijk in range(0,len(col_headers)-1):
then it works, but it does not print the 'Date' clumn. That is not what I want. I need to also print the date column.
Questions:
How do I get it to print the 'Dates' column to a *.csv file?
How do I get the time with hours, minutes and seconds? If the number of
rows is changed from 10 to 5000, then will the seconds change from one row of the dataframe to the next?
EDIT:
- Answer for Q2 (See here) ==> in the case of my particular code, see this:
dates = pd.date_range(dt.datetime.today().strftime("%m/%d/%Y %H:%M"),periods=rows)
I don't quite understand your logic but the following is a simpler method to do it:
for col in df:
df[col].to_csv('output' + col + '.csv')
example:
In [41]:
for col in df2:
print('output' + col + '.csv')
outputA.csv
outputB.csv
outputC.csv
outputD.csv
outputDate.csv