Python exclude directory with fnmatch

Python exclude directory with fnmatch - python-2.7

I'm working with some legacy code that I can't change (for reasons).
It uses fnmatch.fnmatch to filter a list of paths, like so (simplified):
import fnmatch
paths = ['a/x.txt', 'b/y.txt']
for path in paths:
if fnmatch.fnmatch(path, '*.txt'):
print 'do things'
Via configuration I am able to change the pattern used to match the files. I need to exclude everything in b/, is that possible?
From reading the docs (https://docs.python.org/2/library/fnmatch.html) it does not appear to be, but I thought asking was worth a try.

From the fnmatch.fnmatch documentation:
Patterns are Unix shell style:
* matches everything
? matches any single character
[seq] matches any character in seq
[!seq] matches any char not in seq
When I run:
for path in paths:
if fnmatch.fnmatch(path, '[!b]*'):
print path
I get:
a/x.txt

Somehow this method works for alphabet just after "!'
for example in my case from the list col_names
['# Spec No', 'Name', 'Date (DD/MM/YYYY)', 'Time (hh:mm:ss)', 'Year',
'Fractional day', 'Fractional time', 'Scans', 'Tint', 'SZA',
'NO2_UV.RMS', 'NO2_UV.RefZm', 'NO2_UV.RefNumber', 'NO2_UV.SlCol(bro)',
'NO2_UV.SlErr(bro)', 'NO2_UV.SlCol(ring)', 'NO2_UV.SlErr(ring)',
'NO2_UV.SlCol(HCHO)', 'NO2_UV.SlErr(HCHO)', 'NO2_UV.SlCol(O4)',
'NO2_UV.SlErr(O4)', 'NO2_UV.SlCol(O3a)', 'NO2_UV.SlErr(O3a)',
'NO2_UV.SlCol(O3223k)', 'NO2_UV.SlErr(O3223k)', 'NO2_UV.SlCol(NO2)',
'NO2_UV.SlErr(NO2)', 'NO2_UV.SlCol(no2a)', 'NO2_UV.SlErr(no2a)',
'NO2_UV.Offset (Constant)', 'NO2_UV.Err(Offset (Constant))',
'NO2_UV.Offset (Order 1)', 'NO2_UV.Err(Offset (Order 1))',
'NO2_UV.Shift(Spectrum)', 'NO2_UV.Stretch(Spectrum)1',
'NO2_UV.Stretch(Spectrum)2', 'HCHO.RMS', 'HCHO.RefZm', 'HCHO.RefNumber',
'HCHO.SlCol(bro)', 'HCHO.SlErr(bro)', 'HCHO.SlCol(ring)',
'HCHO.SlErr(ring)', 'HCHO.SlCol(HCHO)', 'HCHO.SlErr(HCHO)',
'HCHO.SlCol(O4)', 'HCHO.SlErr(O4)', 'HCHO.SlCol(O3a)',
'HCHO.SlErr(O3a)', 'HCHO.SlCol(O3223k)', 'HCHO.SlErr(O3223k)',
'HCHO.SlCol(NO2)', 'HCHO.SlErr(NO2)', 'HCHO.Offset (Constant)',
'HCHO.Err(Offset (Constant))', 'HCHO.Offset (Order 1)',
'HCHO.Err(Offset (Order 1))', 'HCHO.Shift(Spectrum)',
'HCHO.Stretch(Spectrum)1', 'HCHO.Stretch(Spectrum)2', 'Fluxes 318',
'Fluxes 330', 'Fluxes 390', 'Fluxes 440']
I wanted to search all the names that did not contain NO2_UV.
If I do
header_hcho = fnmatch.filter(col_names, '[!NO2_UV.]*');
it excludes the second element that is "Name"., because it starts with N. And the result is the same as if i do
header_hcho = fnmatch.filter(col_names, '[!N]*');
So, I went by rather an old-school method
header_hcho = []
idx=0
for idx in range(0, len(col_names)):
if col_names[idx].find("NO2_UV") == -1:
header_hcho.append(col_names[idx])
idx=idx+1

Related

Lemmatizing Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content.
I am preferring lemmatizing than stemming because I could extract the word meaning from the context in the sentence (e.g. distinguish between a verb and a noun) and obtain words that exist in the language, rather than roots of those words that don't usually have a meaning.
I found out this library called pattern (pip2 install pattern) that should complement nltk in order to perform lemmatization of the Italian language, however I am not sure the approach below is correct because each word is lemmatized by itself, not in the context of a sentence.
Probably I should give pattern the responsibility to tokenize a sentence (so also annotating each word with the metadata regarding verbs/nouns/adjectives etc), then retrieving the lemmatized word, but I am not able to do this and I am not even sure it is possible at the moment?
Also: in Italian some articles are rendered with an apostrophe so for example "l'appartamento" (in English "the flat") is actually 2 words: "lo" and "appartamento". Right now I am not able to find a way to split these 2 words with a combination of nltk and pattern so then I am not able to count the frequency of the words in the correct way.
import nltk
import string
import pattern
# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()
# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
in_word = input_word#.decode('utf-8')
# print('Something: {}'.format(in_word))
word_it = pattern.it.parse(
in_word,
tokenize=False,
tag=False,
chunk=False,
lemmata=True
)
# print("Input: {} Output: {}".format(in_word, word_it))
the_lemmatized_word = word_it.split()[0][0][4]
# print("Returning: {}".format(the_lemmatized_word))
return the_lemmatized_word
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))
# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))
# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))
# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))
# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))
# difference between stemmer and lemmatizer
print(
"For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
.format(
word_tokenized_no_punct_no_sw[1],
word_tokenized_no_punct_no_sw[6],
word_tokenize_list_no_punct_lc_no_stowords_stem[1],
word_tokenize_list_no_punct_lc_no_stowords_stem[6],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
)
)
Gives this output:
1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)
How to effectively lemmatize some sentences with pattern using their tokenizer? (assuming lemmas are recognized as nouns/verbs/adjectives etc.)
Is there a python alternative to pattern to use for Italian lemmatization with nltk?
How to split articles that are bound to the next word using apostrophes?

I'll try to answer your question, knowing that I don't know a lot about italian!
1) As far as I know, the main responsibility for removing apostrophe is the tokenizer, and as such the nltk italian tokenizer seems to have failed.
3) A simple thing you can do about it is call the replace method (although you probably will have to use the re package for more complicated pattern), an example:
word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]
It yields:
['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']
2) An alternative to pattern would be treetagger, granted it is not the easiest install of all (you need the python package and the tool itself, however after this part it works on windows and Linux).
A simple example with your example above:
import treetaggerwrapper
from pprint import pprint
it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))
The pprint yields:
[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
Tag(word=u'in', pos=u'PRE', lemma=u'in'),
Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
Tag(word=u'.', pos=u'SENT', lemma=u'.'),
Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
Tag(word=u'con', pos=u'PRE', lemma=u'con'),
Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
Tag(word=u'.', pos=u'SENT', lemma=u'.')]
It also tokenized pretty nicely the all'ippodromo to al and ippodromo (which is hopefully correct) under the hood before lemmatizing. Now we just need to apply the removal of stop words and punctuation and it will be fine.
The doc for installing the TreeTaggerWrapper library for python

I know this issue has been solved few years ago, but I am facing the same problem with nltk tokenization and Python 3 in regards to parsing words like all'ippodromo or dall'Italia. So I want to share my experience and give a partial, although late, answer.
The first action/rule that an NLP must take into account is to prepare the corpus. So I discovered that by replacing the ' character with a proper accent ’ by using accurate regex replacing during text parsing (or just a propedeutic replace all at once in basic text editor), then the tokenization works correctly and I am having the proper splitting with just nltk.tokenize.word_tokenize(text)

How to extract files with date pattern using python

I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks

just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!

latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.

use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!

In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}

Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

R- Subset a corpus by meta data (id) matching partial strings

I'm using the R (3.2.3) tm-package (0.6-2) and would like to subset my corpus according to partial string matches contained with the metadatum "id".
For example, I would like to filter all documents that contain the string "US" within the "id" column. The string "US" would be preceded and followed by various characters and numbers.
I have found a similar example here. It is recommended to download the quanteda package but I think this should also be possible with the tm package.
Another more relevant answer to a similar problem is found here. I have tried to adapt that sample code to my context. However, I don't manage to incorporate the partial string matching.
I imagine there might be multiple things wrong with my code so far.
What I have so far looks like this:
US <- tm_filter(corpus, FUN = function(corpus, filter) any(meta(corpus)["id"] == filter), grep(".*US.*", corpus))
And I receive the following error message:
Error in structure(as.character(x), names = names(x)) :
'names' attribute [3811] must be the same length as the vector [3]
I'm also not sure how to come up with a reproducible example simulating my problem for this post.

It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

Python - Sort files based on timestamp

I have a list which contains list of file names, i wanted to sort based on timestamp, which ( i.e timestamp ) is inbuild in each file name.
Note: In file, Hello_Hi_2015-02-20T084521_1424543480.tar.gz --> 2015-02-20T084521 represents as "year-moth-dayTHHMMSS" ( Based on this i wanted to sort )
Input file below:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz']
Output should be:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz']
Below is the code which i have tried.
def sort( dir ):
os.chdir( dir )
file_list = glob.glob('Hello_*')
file_list.sort(key=os.path.getmtime)
print("\n".join(file_list))
return 0
Thanks in advance!!

So this worked for me and it sorted files by created time that did not have the time stamp in the name;
import os
import re
files = [file for file in os.listdir(".") if (file.lower().endswith('.gz'))]
files.sort(key=os.path.getmtime)
for file in sorted(files,key=os.path.getmtime):
print(file)

Would this work?
You could write list contents to a file line by line and read the file:
lines = sorted(open(open_file).readlines(), key = lambda line :
line.split("_")[2])
Further, you could print out lines.

Your code is trying to sort based on the filesystem-stored modified time, not the filename time.
Since your filename encoding is slightly sane :-) if you want to sort based on filename alone, you may use:
sorted(os.listdir(dir), key=lambda s: s[9:]))
That will do, but only because the timestamp encoding in the filename is sane: fixed-length prefix, zero-padded, constant-width numbers, going in sequence from biggest time reference (year) to the lowest one (second).
If your prefix is not fixed, you can try something with RegExp like this (which will sort by the value after the second underscore):
import re
pat = re.compile('_.*?(_)')
sorted(os.listdir(dir), key=lambda s: s[pat.search(s).end():])

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python exclude directory with fnmatch - python-2.7

From the fnmatch.fnmatch documentation: Patterns are Unix shell style: * matches everything ? matches any single character [seq] matches any character in seq [!seq] matches any char not in seq When I run: for path in paths: if fnmatch.fnmatch(path, '[!b]*'): print path I get: a/x.txt

Related

Lemmatizing Italian sentences for frequency counting

How to extract files with date pattern using python

Python - creating a dictionary from large text file where the key matches regex pattern

R- Subset a corpus by meta data (id) matching partial strings

Python - Sort files based on timestamp

Categories

Resources