Grabbing parts of filename with python & boto3 - amazon-web-services

I just started with python and I am still a newbie , I want to create a function that grabs parts of filenames corresponding to a certain pattern these files are stored in s3 bucket.
So in my case, let's say I have 5 .txt files
Transfarm_DAT_005995_20190911_0300.txt
Transfarm_SupplierDivision_058346_20190911_0234.txt
Transfarm_SupplierDivision_058346_20200702_0245.txt
Transfarm_SupplierDivision_058346_20200703_0242.txt
Transfarm_SupplierDivision_058346_20200704_0241.txt
I want the script to go through these filenames, grab the string "Category i.e "Transfarm_DAT" and date "20190911"" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?

Check out the split and join functions if your filenames are always like this. Otherwise, regex is another avenue.
files_list = ['Transfarm_DAT_005995_20190911_0300.txt ', 'Transfarm_SupplierDivision_058346_20190911_0234.txt',
'Transfarm_SupplierDivision_058346_20200702_0245.txt', 'Transfarm_SupplierDivision_058346_20200703_0242.txt', 'Transfarm_SupplierDivision_058346_20200704_0241.txt']
category_list = []
date_list = []
for f in files_list:
date = f.split('.')[0].split('_',2)[2]
category = '_'.join([f.split('.')[0].split('_')[0], f.split('.')[0].split('_')[1]])
# print(category, date)
category_list.append(category)
date_list.append(date)
print(category_list, date_list)
Output lists:
['Transfarm_DAT', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision'] ['005995_20190911_0300', '058346_20190911_0234', '058346_20200702_0245', '058346_20200703_0242', '058346_20200704_0241']

Related

How to extract parts of logs based on identification numbers?

I am trying to extract and preprocess log data for a use case.
For instance, the log consists of problem numbers with information to each ID underneath. Each element starts with:
#!#!#identification_number###96245#!#!#change_log###
action
action1
change
#!#!#attribute###value_change
#!#!#attribute1###status_change
#!#!#attribute2###<None>
#!#!#attribute3###status_change_fail
#!#!#attribute4###value_change
#!#!#attribute5###status_change
#!#!#identification_number###96246#!#!#change_log###
action
change
change1
action1
#!#!#attribute###value_change
#!#!#attribute1###status_change_fail
#!#!#attribute2###value_change
#!#!#attribute3###status_change
#!#!#attribute4###value_change
#!#!#attribute5###status_change
I extracted the identification numbers and saved them as a .csv file:
f = open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8")
change_log = f.readlines()
number = re.findall('#!#!#identification_number###(.+?)#!#!#change_log###', change_log)
Now what I am trying to achieve is, that for every ID in the .csv file I can append the corresponding log content, which is:
action
change
#!#!#attribute###
Since I am rather new to Python and only started working with regex a few days ago, I was hoping for some help.
Each log for an ID starts with "#!#!identification_number###" and ends with "#!#!attribute5### <entry>".
I have tried the following code, but the result is empty:
In:
x = re.findall("\[^#!#!#identification_number###((.|\n)*)#!#!#attribute5###((.|\n)*)$]", str(change_log))
In:
print(x)
Out:
[]
Try this:
pattern='entification_number###(.+?)#!#!#change_log###(.*?)#!#!#id'
re.findall(pattern, string+'#!#!#id', re.DOTALL)
The dotall flag makes the point match newline, so hopefully in the second capturing group you will find the logs.
If you want to get the attributes, for each identification number, you can parse the logs (got for the search above) of each id number with the following:
pattern='#!#!#attribute(.*?)###(.*?)#!#'
re.findall(pattern, string_for_each_log_match+'#!#', re.DOTALL)
If you put each id into the regex when you search using string.format() you can grab the lines that contain the correct changelog.
with open(r'path\to\csv.csv', 'r') as f:
ids = f.readlines()
with open(r'C:\Users\reszi\Desktop\Temp\output_new.txt', encoding="utf8") as f:
change_log = f.readlines()
matches = {}
for id_no in ids:
for i in range(len(change_log)):
reg = '#!#!#identification_number###({})#!#!#change_log###'.format(id_no)
if re.search(reg, change_log[i]):
matches[id_no] = i
break
This will create a dictionary with the structure {id_no:line_no,...}.
So once you have all of the lines that tell you where each log starts, you can grab the lines you want that come after these lines.

Automate process for merging CSV files in Python

I am trying to work with 12 different csv files, that are stored in datafolder. I have a function that opens each file separately (openFile) and performs a specific calculation on the data. I then want to be able to apply a function to each file. The names of the files are all similar to this: UOG_001-AC_TOP-Accelerometer-2017-07-22T112654.csv . The code below shows how I was planning to read the files into the openFile function:
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'):
abs_path = os.path.abspath(DATA_PATH + 'datafolder/' + file)
print(abs_path)
data = openFile(abs_path)
data2 = someFunction(data)
I need to merge specific files, which have the same two letters in the file name. At the end I should have 6 files instead of 12. The files are not stored in the order that they need to be merged in datafolder as this will eventually lead to being used for a larger number of files. The files all have the same header
Am I able to supply a list of the two letters that are the key words in the file to use in regex? e.g.
list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
Any suggestions on how I can achieve this with or without regex?
You could walk through the file tree and, then based on the first two letters of the file name, save each pair of files that need to be merged.
fileList = {'AC':[], 'FO':[], 'CK':[], 'OR':[], 'RS':[], 'IK':[]}
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'): #Ensure we are looking at csv
#Add the file to its correct bucket based off of the letters in name
fileList[extractTwoLettersFromFileName(file)].append(file)
for twoLetters, files in fileList.items():
mergeFiles(files)
I did not provide implementation for extracting the letters and merging the files, but from you question you seem to already have that implementation.
You can first so a simple substring check, and then based on that, classify the filenames into groups:
letters_list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
for letters in letters_list:
if letters in filename:
filename_list = filename_dict.get(letters, list())
filename_list.append(filename)
filename_dict[letters] = filename_list
Here is the the Path object from pathlib to create a list of files whose names end with '.csv'. Then I use a function to examine each file's name for the presence of one or another of those strings that you mentioned using a regex so that I can create a list of the pairs of these strings with their associates filenames. Notice that the length of this list of pairs is 12, and that the filenames can be recovered from this.
Having made that list I can use groupby from itertools to create two-element lists of files that share those strings in the file_kinds list. You can merge items in these lists.
>>> from pathlib import Path
>>> file_kinds = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
>>> def try_match(filename):
... m = re.search('(%s)'%'|'.join(file_kinds), filename)
... if m:
... return m.group()
... else:
... return None
...
>>> all_files_list = [(item, try_match(item.name)) for item in list(Path(r'C:/scratch/datafolder').glob('*.csv')) if try_match(item.name)]
>>> len(all_files_list)
12
Expression for extracting full paths from all_files_list:
[str(_[0]) for _ in all_files_list]
>>> for kind, files_list in groupby(all_files_list, key=itemgetter(1)):
... kind, [str(_[0]) for _ in list(files_list)]
...
('AC', ['C:\\scratch\\datafolder\\UOG_001-AC_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-AC__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('CK', ['C:\\scratch\\datafolder\\UOG_001-CK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-CK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('FO', ['C:\\scratch\\datafolder\\UOG_001-FO_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-FO__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('IK', ['C:\\scratch\\datafolder\\UOG_001-IK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-IK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('OR', ['C:\\scratch\\datafolder\\UOG_001-OR_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-OR__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('RS', ['C:\\scratch\\datafolder\\UOG_001-RS_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-RS__B_TOP-Accelerometer-2017-07-22T112654.csv'])

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

Renaming files with no fixed char length in Python

I am currently learning Python 2.7 and am really impressed by how much it can do.
Right now, I'm working my way through basics such as functions and loops. I'd reckon a more 'real-world' problem would spur me on even further.
I use a satellite recording device to capture TV shows etc to hard drive.
The naming convention is set by the device itself. It makes finding the shows you want to watch after the recording more difficult to find as the show name is preceded with lots of redundant info...
The recordings (in .mts format) are dumped into a folder called "HBPVR" at the root of the drive. I'd be running the script on my Mac when the drive is connected to it.
Example.
"Channel_4_+1-15062015-2100-Exams__Cheating_the_....mts"
or
"BBC_Two_HD-19052015-2320-Newsnight.mts"
I included the double-quotes.
I'd like a Python script that (ideally) would remove the broadcaster name, reformat the date info, strip the time info and then put the show's name to the front of the file name.
E.g "BBC_Two_HD-19052015-2320-Newsnight.mts" ->> "Newsnight 19 May 2015.mts"
What may complicate matters is that the broadcaster names are not all of equal length.
The main pattern is that broadcaster name runs up until the first hyphen.
I'd like to be able to re-run this script at later points for newer recordings and not have already renamed recordings renamed further.
Thanks.
Try this:
import calendar
input = "BBC_Two_HD-19052015-2320-Newsnight.mts"
# Remove broadcaster name
input = '-'.join(input.split("-")[1:])
# Get show name
show = ''.join(' '.join(input.split("-")[2:]).split(".mts")[:-1])
# Get time string
timestr = ''.join(input.split("-")[0])
day = int(''.join(timestr[0:2])) # The day is the first two digits
month = calendar.month_name[int(timestr[2:4])] # The month is the second two digits
year = timestr[4:8] # The year is the third through sixth digits
# And the new string:
new = show + " " + str(day) + " " + month + " " + year + ".mts"
print(new) # "Newsnight 19 May 2015.mts"
I wasn't quite sure what the '2320' was, so I chose to ignore it.
Thanks Coder256.
That has given me a bit more insight into how Python can actually help solve real world (first world!) problems like mine.
It tried it out with some different combos of broadcaster and show names and it worked.
I would like though to use the script to rename a batch of recordings/files inside the folder from time to time.
The script did throw and error when processing an already re-named recording, which is to be expected I guess. Should the renamed file have a special character at the start of its name to help avoid this happening?
e.g "_Newsnight 19 May 2015.mts"
Or is there a more aesthetically pleasing way of doing this, with special chars being added on etc.
Thanks.
One way to approach this, since you have a defined pattern is to use regular expressions:
>>> import datetime
>>> import re
>>> s = "BBC_Two_HD-19052015-2320-Newsnight.mts"
>>> ts, name = re.findall(r'.*?-(\d{8}-\d{4})-(.*?)\.mts', s)[0]
>>> '{} {}.mts'.format(name, datetime.datetime.strptime(ts, '%d%m%Y-%H%M').strftime('%d %b %Y'))
'Newsnight 19 May 2015.mts'

String from CSV to list - Python

I don't get it. I have a CSV data with the following content:
wurst;ball;hoden;sack
1;2;3;4
4;3;2;1
I want to iterate over the CSV data and put the heads in one list and the content in another list. Heres my code so far:
data = [ i.strip() for i in open('test.csv', 'r').readlines() ]
for i_c, i in enumerate(data):
if i_c == 0:
heads = i
else:
content = i
heads.split(";")
content.split(";")
print heads
That always returns the following string, not a valid list.
wurst;ball;hoden;sack
Why does split not work on this string?
Greetings and merry Christmas,
Jan
The split method returns the list, it does not modify the object in place. Try:
heads = heads.split(";")
content = content.split(";")
I've noticed also that your data seems to all be integers. You might consider instead the following for content:
content = [int(i) for i in content.split(";")]
The reason is that split returns a list of strings, and it seems like you might need to deal with them as numbers in your code later on. Of course, disregard if you are expecting non-numeric data to show up at some point.