Rewrite YAML frontmatter with regular expression

Rewrite YAML frontmatter with regular expression - regex

I want to convert my WordPress website to a static site on GitHub using Jekyll.
I used a plugin that exports my 62 posts to GitHub as Markdown. I now have these posts with extra frontmatter at the beginning of each file. It looks like this:
---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
https://myurl.com/slug
published: true
sw_timestamp:
- "399956"
sw_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
- "408644"
swp_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
- '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
- "410228"
---
This block isn't parsed right by Jekyll, plus I don't need all this frontmatter. I would like to have each file's frontmatter converted to
---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
I would like to do this with regular expressions. But my knowledge of regex is not that great. With the help of this forum and lots of Google searches I didn't get very far. I know how to find the complete piece of frontmatter but how do I replace it with a part of it as specified above?
I might have to do this in steps, but I can't wrap my head around how to do this.
I use Textwrangler as the editor to do the search and replace.

YAML (and other relatively free formats like HTML, JSON, XML) is best not transformed using regular expressions, it is easy to work for one example and break for the next that has extra whitespace, different indentation etc.
Using a YAML parser in this situation is not trivial, as many either expect a single YAML document in the file (and barf on the Markdown part as extraneous stuff) or expect multiple YAML documents in the file (and barf because the Markdown is not YAML). Moreover most YAML parser throw away useful things like comments and reorder mapping keys.
I have used a similar format (YAML header, followed by reStructuredText) for many years for my ToDo items, and use a small Python program to extract and update these files. Given input like this:
---
ID: 51 # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
https://myurl.com/slug
published: true
sw_timestamp:
- "399956"
sw_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
- "408644"
swp_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
- '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
- "410228"
---
additional stuff that is not YAML
and more
and more
And this program ¹:
import sys
import ruamel.yaml
from pathlib import Path
def extract(file_name, position=0):
doc_nr = 0
if not isinstance(file_name, Path):
file_name = Path(file_name)
yaml_str = ""
with file_name.open() as fp:
for line_nr, line in enumerate(fp):
if line.startswith('---'):
if line_nr == 0: # don't count --- on first line as next document
continue
else:
doc_nr += 1
if position == doc_nr:
yaml_str += line
return ruamel.yaml.round_trip_load(yaml_str, preserve_quotes=True)
def reinsert(ofp, file_name, data, position=0):
doc_nr = 0
inserted = False
if not isinstance(file_name, Path):
file_name = Path(file_name)
with file_name.open() as fp:
for line_nr, line in enumerate(fp):
if line.startswith('---'):
if line_nr == 0:
ofp.write(line)
continue
else:
doc_nr += 1
if position == doc_nr:
if inserted:
continue
ruamel.yaml.round_trip_dump(data, ofp)
inserted = True
continue
ofp.write(line)
data = extract('input.yaml')
for k in list(data.keys()):
if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
del data[k]
reinsert(sys.stdout, 'input.yaml', data)
You get this output:
---
ID: 51 # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
additional stuff that is not YAML
and more
and more
Please note that the comment on the ID line is properly preserved.
¹ This was done using ruamel.yaml a YAML 1.2 parser, which tries to preserve as much information as possible on round-trips, of which I am the author.

Editing my post because I misinterpreted the question the first time, I failed to understand that the actual post was in the same file, right after the ---
Using egrep and GNU sed, so not the bash built-in, it's relatively easy:
# create a working copy
mv file file.old
# get only the fields you need from the frontmatter and redirect that to a new file
egrep '(---|ID|post_title|author|post_date|layout|published)' file.old > file
# get everything from the old file, but discard the frontmatter
cat file.old |gsed '/---/,/---/ d' >> file
# remove working copy
rm file.old
And if you want it all in one go:
for i in `ls`; do mv $i $i.old; egrep '(---|ID|post_title|author|post_date|layout|published)' $i.old > $i; cat $.old |gsed '/---/,/---/ d' >> $i; rm $i.old; done
For good measure, here's what I wrote as my first response:
===========================================================
I think you're making this way too complicated.
A simple egrep will do what you want:
egrep '(---|ID|post_title|author|post_date|layout|published)' file
redirect to a new file:
egrep '(---|ID|post_title|author|post_date|layout|published)' file > newfile
a whole dir at once:
for i in `ls`; do egrep '(---|ID|post_title|author|post_date|layout|published)' $i > $i.new; done

In cases like yours it is better to use actual YAML parser and some scripting language. Cut off metadata from each file to standalone files (or strings), then use YAML library to load the metadata. Once the metadata are loaded, you can modify them safely with no trouble. Then use serialize method from the very same library to create a new metadata file and finally put the files back together.
Something like this:
<?php
list ($before, $metadata, $after) = preg_split("/\n----*\n/ms", file_get_contents($argv[1]));
$yaml = yaml_parse($metadata);
$yaml_copy = [];
foreach ($yaml as $k => $v) {
// copy the data you wish to preserve to $yaml_copy
if (...) {
$yaml_copy[$k] = $yaml[$k];
}
}
file_put_contents('new/'.$argv[1], $before."\n---\n".yaml_emit($yaml_copy)."\n---\n".$after);
(It is just an untested draft with no error checks.)

You could do it with gawk like this:
gawk 'BEGIN {RS="---"; FS="\000" } (FNR == 2) { print "---"; split($1, fm, "\n"); for (line in fm) { if ( fm[line] ~ /^(ID|post_title|author|post_date|layout|published):/) {print fm[line]} } print "---" } (FNR > 2) {print}' post1.html > post1_without_frontmatter_fields.html

You basically want to edit the file. That is what sed (stream editor) is for.
sed -e s/^ID:(*)$^post_title:()$^author:()$^postdate:()$^layout:()$^published:()$/ID:\1\npost_title:\2\nauthor:\3\npostdate:\4\nlayout:\5\npublished:\6/g

You also can use python-frontmatter:
import frontmatter
import io
from os.path import basename, splitext
import glob
# Where are the files to modify
path = "*.markdown"
# Loop through all files
for fname in glob.glob(path):
with io.open(fname, 'r') as f:
# Parse file's front matter
post = frontmatter.load(f)
for k in post.metadata:
if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
del post[k]
# Save the modified file
newfile = io.open(fname, 'w', encoding='utf8')
frontmatter.dump(post, newfile)
newfile.close()
If you want to see more examples visit this page
Hope it helps.

Related

The glob.glob function to extract data from files

I am trying to run the script below. The intention of the script is to open different fasta files one after the other, and extract the geneID. The script works well if I don't use the glob.glob function. I get this message TypeError: coercing to Unicode: need string or buffer, list found
files='/home/pathtofiles/files'
#print files
#sys.exit()
for file in files:
fastas=sorted(glob.glob(files + '/*.fasta'))
#print fastas[0]
output_handle=(open(fastas, 'r+'))
genes_files=list(SeqIO.parse(output_handle, 'fasta'))
geneID=genes_files[0].id
print geneID
I am running of ideas on how to direct the script to open when file after another to give me the require information.

I see what you are trying to do, but let me first explain why your current approach is not working.
You have a path to a directory with fasta files and you want to loop over the files in that directory. But observe what happens if we do:
>>> files='/home/pathtofiles/files'
>>> for file in files:
>>> print file
/
h
o
m
e
/
p
a
t
h
t
o
f
i
l
e
s
/
f
i
l
e
s
Not the list of filenames you expected! files is a string and when you apply a for loop on a string you simply iterate over the characters in that string.
Also, as doctorlove correctly observed, in your code fastas is a list and open expects a path to a file as first argument. That's why you get the TypeError: ... need string, ... list found.
As an aside (and this is more a problem on Windows then on Linux or Mac), but it is good practice to always use raw string literals (prefix the string with an r) when working with pathnames to prevent the unwanted expansion of backslash escaped sequences like \n and \t to newline and tab.
>>> path = 'C:\Users\norah\temp'
>>> print path
C:\Users
orah emp
>>> path = r'C:\Users\norah\temp'
>>> print path
C:\Users\norah\temp
Another good practice is to use os.path.join() when combining pathnames and filenames. This prevents subtle bugs where your script works on your machine bug gives an error on the machine of your colleague who has a different operating system.
I would also recommend using the with statement when opening files. This assures that the filehandle gets properly closed when you're done with it.
As a final remark, file is a built-in function in Python and it is bad practice to use a variable with the same name as a built-in function because that can cause bugs or confusion later on.
Combing all of the above, I would rewrite your code like this:
import os
import glob
from Bio import SeqIO
path = r'/home/pathtofiles/files'
pattern = os.path.join(path, '*.fasta')
for fasta_path in sorted(glob.glob(pattern)):
print fasta_path
with open(fasta_path, 'r+') as output_handle:
genes_records = SeqIO.parse(output_handle, 'fasta')
for gene_record in genes_records:
print gene_record.id

This is way I solved the problem, and this script works.
import os,sys
import glob
from Bio import SeqIO
def extracting_information_gene_id():
#to extract geneID information and add the reference gene to each different file
files=sorted(glob.glob('/home/path_to_files/files/*.fasta'))
#print file
#sys.exit()
for file in files:
#print file
output_handle=open(file, 'r+')
ref_genes=list(SeqIO.parse(output_handle, 'fasta'))
geneID=ref_genes[0].id
#print geneID
#sys.exit()
#to extract the geneID as a reference record from the genes_files
query_genes=(SeqIO.index('/home/path_to_file/file.fa', 'fasta'))
#print query_genes[geneID].format('fasta') #check point
#sys.exit()
ref_gene=query_genes[geneID].format('fasta')
#print ref_gene #check point
#sys.exit()
output_handle.write(str(ref_gene))
output_handle.close()
query_genes.close()
extracting_information_gene_id()
print 'Reference gene sequence have been added'

Easy way to set on_delete across entire application

I've been using the -Wd argument for Python and discovered tons of changes I need to make in order to prepare my upgrade to Django 2.0
python -Wd manage.py runserver
The main thing is that on_delete is due to become a required argument.
RemovedInDjango20Warning: on_delete will be a required arg for ForeignKey in Django 2.0. Set it to models.CASCADE on models and in existing migrations if you want to maintain the current default behavior.
See https://docs.djangoproject.com/en/1.9/ref/models/fields/#django.db.models.ForeignKey.on_delete
Is there an easy regex (or way) I can use to put on_delete into all of my foreign keys?

Use with care
You can use
(ForeignKey|OneToOneField)\(((?:(?!on_delete|ForeignKey|OneToOneField)[^\)])*)\)
This will search for all foreign keys that currently do not already define what happens upon deletion and also ignores anywhere you have overridden ForeignKey.
It will then capture anything inside the brackets which allows you to replace the inner text with the capture group plus the on_delete
$1($2, on_delete=models.CASCADE)
It is not advised to do a replace all with the above, and you should still step through to ensure no issues are created (such as any pep8 line length warnings)

I had to do this, and Sayse 's solution worked:
import re
import fileinput
import os, fnmatch
import glob
from pathlib import Path
# https://stackoverflow.com/questions/41571281/easy-way-to-set-on-delete-across-entire-application
# https://stackoverflow.com/questions/11898998/how-can-i-write-a-regex-which-matches-non-greedy
# https://stackoverflow.com/a/4719629/433570
# https://stackoverflow.com/a/2186565/433570
regex = r'(.*?)(ForeignKey|OneToOneField)\(((?:(?!on_delete|ForeignKey|OneToOneField)[^\)])*)\)(.*)'
index = 0
for filename in Path('apps').glob('**/migrations/*.py'):
print(filename)
=> filename = (os.fspath(filename), ) # 3.6 doesn't have this
for line in fileinput.FileInput(filename, inplace=1):
a = re.search(regex, line)
if a:
print('{}{}({}, on_delete=models.CASCADE){}'.format(a.group(1), a.group(2), a.group(3), a.group(4)))
else:
print(line, end='')

I made this bash script that may help you.
#!/bin/bash
FK=()
IFS=$'\n'
count=0
for fk in $(cat $1 | egrep -i --color -o 'models\.ForeignKey\((.*?)');
do
FK[$count]=$fk
#FK+=$fk
count=$(($count + 1))
done
for c in "${FK[#]}";
do
r=`echo "${c}" | sed -e 's/)$/,on_delete=models.CASCADE)/g'`
a="${c}"
sed -i "s/${c}/${r}/g" $1
done
Maybe you want a more "save" approach, changing sed -i with sed -e and redirect the output to a file to compare against your original models.py file.
Happy coding!!

PyYAML, safe_dump adding line breaks and indent to the YAML file

I want to receive following YAML file:
---
classes:
- apache
- ntp
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
After parsing, I received such output:
---
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
classes:
- apache
- ntp
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
Here, I have found the properties that give possibility to style document. I tried to set line_break and indent, but it does not work.
with open(config['REPOSITORY_PATH'] + '/' + file_name, 'w+') as file:
yaml.safe_dump(data_map, file, indent=10, explicit_start=True, explicit_end=True, default_flow_style=False,
line_break=1)
file.close()
Please, advice me simple approach to style the output.

You cannot do that in PyYAML. The indent option only affects mappings and not sequences. PyYAML also doesn't preserve order of mapping keys on round-tripping.
If you use ruamel.yaml (dislaimer: I am the author of that package), then getting the exact same input as output is easy:
import ruamel.yaml
yaml_str = """\
---
classes:
- apache # keep the indentation
- ntp
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
"""
data = ruamel.yaml.round_trip_load(yaml_str)
res = ruamel.yaml.round_trip_dump(data, indent=4, block_seq_indent=2,
explicit_start=True)
assert res == yaml_str
please note that it also preserves the comment I added to the first sequence element.
You can build this from "scratch" but adding a newline is not something for which a call exists in ruamel.yaml:
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import Mark
from ruamel.yaml.comments import CommentedMap, CommentedSeq
data = CommentedMap()
data['classes'] = classes = CommentedSeq()
classes.append('apache')
classes.append('ntp')
data['apache::first'] = 1
data['apache::package_ensure'] = 'present'
data['apache::port'] = 999
data['apache::second'] = 2
data['apache::service_ensure'] = 'running'
data['ntp::bla'] = 'bla'
data['ntp::package_ensure'] = 'present'
data['ntp::servers'] = '-'
m = Mark(None, None, None, 0, None, None)
data['classes'].ca.items[1] = [CommentToken('\n\n', m, None), None, None, None]
# ^ 1 is the last item in the list
data.ca.items['apache::service_ensure'] = [None, None, CommentToken('\n\n', m, None), None]
res = ruamel.yaml.round_trip_dump(data, indent=4, block_seq_indent=2,
explicit_start=True)
print(res, end='')
You will have to add the newline as comment (without '#') to the last element before the newline, i.e. the last list element and the apache::service_ensure mapping entry.
Apart from that you should ask yourself if you really want to use PyYAML which only supports (most of) YAML 1.1 from 2005 and not the latest revision YAML 1.2 from 2009.
The wordpress page you linked to doesn't seem very serious (it doesn't even have the package name, PyYAML, correct).

Python: Returning a filename for matching a specific condition

import sys, hashlib
import os
inputFile = 'C:\Users\User\Desktop\hashes.txt'
sourceDir = 'C:\Users\User\Desktop\Test Directory'
hashMatch = False
for root, dirs, files in os.walk(sourceDir):
for filename in files:
sourceDirHashes = hashlib.md5(filename)
for digest in inputFile:
if sourceDirHashes.hexdigest() == digest:
hashMatch = True
break
if hashMatch:
print str(filename)
else:
print 'hash not found'
Contents of inputFile =
2899ebdb5f7a90a216e97b3187851fc1
54c177418615a90a6424cb945f7a6aec
dd18bf3a8e0a2a3e53e2661c7fb53534
Contents of sourceDir files =
test
test 1
test 2
I almost have the code working, I'm just tripping up somewhere. My current code that I have posted always returns the else statement, that the hash hasn't been found, even although they do as I have verified this. I have provided the content of my sourceDir so that someone case try this, the file names are test, test 1 and test 2, the same content is in the files.
I must add however, I am not looking for the script to print the actual file content, but rather the name of the file.
Could anyone suggest to where I am going wrong and why it is saying the condition is false?

You need to open the inputFile using open(inputFile, 'rt') then you can read the hashes. Also when you do read the hashes make sure you strip them first to get rid of new line characters \n at the end of the lines

Formatting text file

I have a txt file that I would like to alter so I will be able to place the data into columns see example below. The reason behind this is so I can import this data into a database / array and perform calculations on them. I tried importing/pasting the data into LibreCalc but it just imports everything into one column or it opens the file in LibreWriter I'm using ubuntu 10.04. Any ideas? I'm willing to use another program to work around this issue. I could also work with a comma delimited file but I'm not to sure how to convert the data to that format automatically.
Trying to get this:
WAVELENGTH, WAVENUMBER, INTENSITY, CLASSIFICATION, CODE,
1132.8322, 88274.326, 2300, PT II, 9356- 97630, 05,
Here's a link to the full file.
pt.txt file

Try this:
sed -e "s/(\s+)/,$1/g" pt.txt

is this what you want?
awk 'BEGIN{OFS=","}NF>1{$1=$1;print}' pt.txt
if you want the output format looks better, and you have "column" installed, you can try this too:
awk 'BEGIN{OFS=", "}NF>1{$1=$1;print}' pt.txt|column -t

The awk and sed one-liners are cool, but I expect you'll end up needing to do more than simply splitting up the file. If you do, and if you have access to Python 2.7, the following little script will get you going.
# -*- coding: utf-8 -*-
"""Convert to comma-delimited"""
import csv
from os import path
import re
import sys
def splitline(line):
return re.split('\s{2,}', line)
def main():
srcpath = path.abspath(sys.argv[1])
targetpath = path.splitext(srcpath)[0] + '.csv'
with open(srcpath) as infile, open(targetpath, 'w') as outfile:
writer = csv.writer(outfile)
for line in infile:
if line.startswith(' '):
line = line.strip()
cols = splitline(line)
writer.writerow(cols)
if __name__ == '__main__':
main()

The easiest way turned out to be importing using a fixed width like tohuwawohu suggested
Thanks
Without transforming it to a comma-separated file, you could access the csv import options by simply changing the file extension to .csv (maybe you should remove the "header" part manually, so that only the columns heads and the data rows do remain). After that, you can try to use whitespace as column delimiter, or even easier: select "fixed width" and set the columns manually. – tohuwawohu Oct 20 at 9:23

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rewrite YAML frontmatter with regular expression - regex

You basically want to edit the file. That is what sed (stream editor) is for. sed -e s/^ID:(*)$^post_title:()$^author:()$^postdate:()$^layout:()$^published:()$/ID:\1\npost_title:\2\nauthor:\3\npostdate:\4\nlayout:\5\npublished:\6/g

Related

The glob.glob function to extract data from files

Easy way to set on_delete across entire application

PyYAML, safe_dump adding line breaks and indent to the YAML file

Python: Returning a filename for matching a specific condition

Formatting text file

Categories

Resources