PyYAML, safe_dump adding line breaks and indent to the YAML file - python-2.7

I want to receive following YAML file:
---
classes:
- apache
- ntp
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
After parsing, I received such output:
---
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
classes:
- apache
- ntp
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
Here, I have found the properties that give possibility to style document. I tried to set line_break and indent, but it does not work.
with open(config['REPOSITORY_PATH'] + '/' + file_name, 'w+') as file:
yaml.safe_dump(data_map, file, indent=10, explicit_start=True, explicit_end=True, default_flow_style=False,
line_break=1)
file.close()
Please, advice me simple approach to style the output.

You cannot do that in PyYAML. The indent option only affects mappings and not sequences. PyYAML also doesn't preserve order of mapping keys on round-tripping.
If you use ruamel.yaml (dislaimer: I am the author of that package), then getting the exact same input as output is easy:
import ruamel.yaml
yaml_str = """\
---
classes:
- apache # keep the indentation
- ntp
apache::first: 1
apache::package_ensure: present
apache::port: 999
apache::second: 2
apache::service_ensure: running
ntp::bla: bla
ntp::package_ensure: present
ntp::servers: '-'
"""
data = ruamel.yaml.round_trip_load(yaml_str)
res = ruamel.yaml.round_trip_dump(data, indent=4, block_seq_indent=2,
explicit_start=True)
assert res == yaml_str
please note that it also preserves the comment I added to the first sequence element.
You can build this from "scratch" but adding a newline is not something for which a call exists in ruamel.yaml:
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import Mark
from ruamel.yaml.comments import CommentedMap, CommentedSeq
data = CommentedMap()
data['classes'] = classes = CommentedSeq()
classes.append('apache')
classes.append('ntp')
data['apache::first'] = 1
data['apache::package_ensure'] = 'present'
data['apache::port'] = 999
data['apache::second'] = 2
data['apache::service_ensure'] = 'running'
data['ntp::bla'] = 'bla'
data['ntp::package_ensure'] = 'present'
data['ntp::servers'] = '-'
m = Mark(None, None, None, 0, None, None)
data['classes'].ca.items[1] = [CommentToken('\n\n', m, None), None, None, None]
# ^ 1 is the last item in the list
data.ca.items['apache::service_ensure'] = [None, None, CommentToken('\n\n', m, None), None]
res = ruamel.yaml.round_trip_dump(data, indent=4, block_seq_indent=2,
explicit_start=True)
print(res, end='')
You will have to add the newline as comment (without '#') to the last element before the newline, i.e. the last list element and the apache::service_ensure mapping entry.
Apart from that you should ask yourself if you really want to use PyYAML which only supports (most of) YAML 1.1 from 2005 and not the latest revision YAML 1.2 from 2009.
The wordpress page you linked to doesn't seem very serious (it doesn't even have the package name, PyYAML, correct).

Related

Audio Timeout error in Speech to text API of Google Cloud

I aim to make my jarvis, which listens all the time and activates when I say hello. I learned that Google cloud Speech to Text API doesn't listen for more than 60 seconds, but then I found this not-so-famous link, where this listens for infinite duration. The author of github script says that, he has played a trick that script refreshes after 60 seconds, so that program doesn't crash.
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_indefinite.py
Following is the modified version, since I wanted it to answer of my questions, followed by "hello", and not answer me all the time. Now if I ask my Jarvis, a question, which while answering takes more than 60 seconds and it doesn't get the time to refresh, the program crashes down :(
#!/usr/bin/env python
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Google Cloud Speech API sample application using the streaming API.
NOTE: This module requires the additional dependency `pyaudio`. To install
using pip:
pip install pyaudio
Example usage:
python transcribe_streaming_indefinite.py
"""
# [START speech_transcribe_infinite_streaming]
from __future__ import division
import time
import re
import sys
import os
from google.cloud import speech
from pygame.mixer import *
from googletrans import Translator
# running=True
translator = Translator()
init()
import pyaudio
from six.moves import queue
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "C:\\Users\\mnauf\\Desktop\\rehandevice\\key.json"
from commands2 import commander
cmd=commander()
# Audio recording parameters
STREAMING_LIMIT = 55000
SAMPLE_RATE = 16000
CHUNK_SIZE = int(SAMPLE_RATE / 10) # 100ms
def get_current_time():
return int(round(time.time() * 1000))
def duration_to_secs(duration):
return duration.seconds + (duration.nanos / float(1e9))
class ResumableMicrophoneStream:
"""Opens a recording stream as a generator yielding the audio chunks."""
def __init__(self, rate, chunk_size):
self._rate = rate
self._chunk_size = chunk_size
self._num_channels = 1
self._max_replay_secs = 5
# Create a thread-safe buffer of audio data
self._buff = queue.Queue()
self.closed = True
self.start_time = get_current_time()
# 2 bytes in 16 bit samples
self._bytes_per_sample = 2 * self._num_channels
self._bytes_per_second = self._rate * self._bytes_per_sample
self._bytes_per_chunk = (self._chunk_size * self._bytes_per_sample)
self._chunks_per_second = (
self._bytes_per_second // self._bytes_per_chunk)
def __enter__(self):
self.closed = False
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
channels=self._num_channels,
rate=self._rate,
input=True,
frames_per_buffer=self._chunk_size,
# Run the audio stream asynchronously to fill the buffer object.
# This is necessary so that the input device's buffer doesn't
# overflow while the calling thread makes network requests, etc.
stream_callback=self._fill_buffer,
)
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
# Signal the generator to terminate so that the client's
# streaming_recognize method will not block the process termination.
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, *args, **kwargs):
"""Continuously collect data from the audio stream, into the buffer."""
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
if get_current_time() - self.start_time > STREAMING_LIMIT:
self.start_time = get_current_time()
break
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
# Now consume whatever other data's still buffered.
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b''.join(data)
def search(responses, stream, code):
responses = (r for r in responses if (
r.results and r.results[0].alternatives))
num_chars_printed = 0
for response in responses:
if not response.results:
continue
# The `results` list is consecutive. For streaming, we only care about
# the first result being considered, since once it's `is_final`, it
# moves on to considering the next utterance.
result = response.results[0]
if not result.alternatives:
continue
# Display the transcription of the top alternative.
top_alternative = result.alternatives[0]
transcript = top_alternative.transcript
# music.load("/home/pi/Desktop/rehandevice/end.mp3")
# music.play()
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
# If the previous result was longer than this one, we need to print
# some extra spaces to overwrite the previous result
overwrite_chars = ' ' * (num_chars_printed - len(transcript))
if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + '\r')
sys.stdout.flush()
num_chars_printed = len(transcript)
else:
#print(transcript + overwrite_chars)
# Exit recognition if any of the transcribed phrases could be
# one of our keywords.
if code=='ur-PK':
transcript=translator.translate(transcript).text
print("Your command: ", transcript + overwrite_chars)
if "hindi assistant" in (transcript+overwrite_chars).lower():
cmd.respond("Alright. Talk to me in urdu",code=code)
main('ur-PK')
elif "english assistant" in (transcript+overwrite_chars).lower():
cmd.respond("Alright. Talk to me in English",code=code)
main('en-US')
cmd.discover(text=transcript + overwrite_chars,code=code)
for i in range(10):
print("Hello world")
break
num_chars_printed = 0
def listen_print_loop(responses, stream, code):
"""Iterates through server responses and prints them.
The responses passed is a generator that will block until a response
is provided by the server.
Each response may contain multiple results, and each result may contain
multiple alternatives; for details, see https://cloud.google.com/speech-to-text/docs/reference/rpc/google.cloud.speech.v1#streamingrecognizeresponse. Here we
print only the transcription for the top alternative of the top result.
In this case, responses are provided for interim results as well. If the
response is an interim one, print a line feed at the end of it, to allow
the next result to overwrite it, until the response is a final one. For the
final one, print a newline to preserve the finalized transcription.
"""
responses = (r for r in responses if (
r.results and r.results[0].alternatives))
music.load(r"C:\\Users\\mnauf\\Desktop\\rehandevice\\coins.mp3")
num_chars_printed = 0
for response in responses:
if not response.results:
continue
# The `results` list is consecutive. For streaming, we only care about
# the first result being considered, since once it's `is_final`, it
# moves on to considering the next utterance.
result = response.results[0]
if not result.alternatives:
continue
# Display the transcription of the top alternative.
top_alternative = result.alternatives[0]
transcript = top_alternative.transcript
# Display interim results, but with a carriage return at the end of the
# line, so subsequent lines will overwrite them.
#
# If the previous result was longer than this one, we need to print
# some extra spaces to overwrite the previous result
overwrite_chars = ' ' * (num_chars_printed - len(transcript))
if not result.is_final:
sys.stdout.write(transcript + overwrite_chars + '\r')
sys.stdout.flush()
num_chars_printed = len(transcript)
else:
print("Listen print loop", transcript + overwrite_chars)
# Exit recognition if any of the transcribed phrases could be
# one of our keywords.
if re.search(r'\b(hello)\b', transcript.lower(), re.I):
#print("Give me order")
music.play()
search(responses, stream,code)
break
elif re.search(r'\b(ہیلو)\b', transcript, re.I):
music.play()
search(responses, stream,code)
break
num_chars_printed = 0
def main(code):
cmd.respond("I am Rayhaan dot A Eye. How can I help you?",code=code)
client = speech.SpeechClient()
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=SAMPLE_RATE,
language_code='en-US',
max_alternatives=1,
enable_word_time_offsets=True)
streaming_config = speech.types.StreamingRecognitionConfig(
config=config,
interim_results=True)
mic_manager = ResumableMicrophoneStream(SAMPLE_RATE, CHUNK_SIZE)
print('Say "Quit" or "Exit" to terminate the program.')
with mic_manager as stream:
while not stream.closed:
audio_generator = stream.generator()
requests = (speech.types.StreamingRecognizeRequest(
audio_content=content)
for content in audio_generator)
responses = client.streaming_recognize(streaming_config,
requests)
# Now, put the transcription responses to use.
try:
listen_print_loop(responses, stream, code)
except:
listen
if __name__ == '__main__':
main('en-US')
# [END speech_transcribe_infinite_streaming]
You can call your functions after recognition in different thread. Example:
new_thread = Thread(target=music.play)
new_thread.daemon = True # Not always needed, read more about daemon property
new_thread.start()
Or if you want just to prevent exception - you can always use try/except. Example:
with mic_manager as stream:
while not stream.closed:
try:
audio_generator = stream.generator()
requests = (speech.types.StreamingRecognizeRequest(
audio_content=content)
for content in audio_generator)
responses = client.streaming_recognize(streaming_config,
requests)
# Now, put the transcription responses to use.
listen_print_loop(responses, stream, code)
except BaseException as e:
print("Exception occurred - {}".format(str(e)))

PyYAML shows "ScannerError: mapping values are not allowed here" in my unittest

I am trying to test a number of Python 2.7 classes using unittest.
Here is the exception:
ScannerError: mapping values are not allowed here
in "<unicode string>", line 3, column 32:
... file1_with_path: '../../testdata/concat1.csv'
Here is the example the error message relates to:
class TestConcatTransform(unittest.TestCase):
def setUp(self):
filename1 = os.path.dirname(os.path.realpath(__file__)) + '/../../testdata/concat1.pkl'
self.df1 = pd.read_pickle(filename1)
filename2 = os.path.dirname(os.path.realpath(__file__)) + '/../../testdata/concat2.pkl'
self.df2 = pd.read_pickle(filename2)
self.yamlconfig = u'''
--- !ConcatTransform
file1_with_path: '../../testdata/concat1.csv'
file2_with_path: '../../testdata/concat2.csv'
skip_header_lines: [0]
duplicates: ['%allcolumns']
outtype: 'dataframe'
client: 'testdata'
addcolumn: []
'''
self.testconcat = yaml.load(self.yamlconfig)
What is the the problem?
Something not clear to me is that the directory structure I have is:
app
app/etl
app/tests
The ConcatTransform is in app/etl/concattransform.py and TestConcatTransform is in app/tests. I import ConcatTransform into the TestConcatTransform unittest with this import:
from app.etl import concattransform
How does PyYAML associate that class with the one defined in yamlconfig?
A YAML document can start with a document start marker ---, but that has to be at the beginning of a line, and yours is indented eight positions on the second line of the input. That causes the --- to be interpreted as the beginning of a multi-line plain (i.e. non-quoted) scalar, and within such a scalar you cannot have a : (colon + space). You can only have : in quoted scalars. And if your document does not have a mapping or sequence at the root level, as yours doesn't, the whole document can only consists of a single scalar.
If you want to keep your sources nicely indented like you have now, I recommend you use dedent from textwrap.
The following runs without error:
import ruamel.yaml
from textwrap import dedent
yaml_config = dedent(u'''\
--- !ConcatTransform
file1_with_path: '../../testdata/concat1.csv'
file2_with_path: '../../testdata/concat2.csv'
skip_header_lines: [0]
duplicates: ['%allcolumns']
outtype: 'dataframe'
client: 'testdata'
addcolumn: []
''')
yaml = ruamel.yaml.YAML()
data = yaml.load(yaml_config)
You should get into the habit to put the backslash (\) at the end of your first triple-quotes, so your YAML document. If you do that, your error would have actually indicated line 2 because the document doesn't start with an empty line anymore.
During loading the YAML parser encouncters the tag !ConcatTransform. A constructor for an object is probably registered with the PyYAML loader, associating that tag with the using PyYAML's add_constructor, during the import.
Unfortunately they registered their constructor with the default, non-safe, loader, which is not necessary, they could have registered with the SafeLoader, and thereby not force users to risk problems with non-controlled input.

how to get python to recognize the ® symbol [duplicate]

This question already has answers here:
Python to show special characters
(3 answers)
Closed 4 years ago.
Hi there I am trying to make python recognize ® as a symbol( if it doesn't show up that well here but it is the symbol with a capital R within a circle known as the 'registered' symbol)
I understand that it is not recognized in python due to ASCII however i was wondering if anyone knows of a way to use a different decoding system that includes this symbol or a method to make python 'ignore' it.
For some context:
I am trying to make an auto checkout program for a website so my program needs to match the item that the user wants. To do this I am using Beatifulsoup to scrape information however this symbol '®' is within the names of a few of the items causing python to crash.
Here is the current command that I am using but is not working due to ASCII:
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
Any help would be appreciated
Here is the entirety of the program so far(ignore the mess nowhere near done):
import time
import webbrowser
from selenium import webdriver
import mechanize
from bs4 import BeautifulSoup
import urllib2
from selenium.webdriver.support.ui import Select
CnI = []
item = []
colour = []
Uhrefs = []
Whrefs = []
FinalColours = []
selectItemindex = []
selectColourindex = []
#counters
Ccounter = 0
Icounter = 0
Splitcounter = 1
#wanted items suffix options:jackets, shirts, tops_sweaters, sweatshirts, pants, shorts, hats, bags, accessories, skate
suffix = 'accessories'
Wcolour = 'Black'
Witem = '2-Tone Nylon 6-Panel'
driver=webdriver.Chrome()
driver.get('http://www.supremenewyork.com/shop/all/'+suffix)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup)
for colour in soup.find_all('a', attrs={"class":"name-link"}, href=True):
CnI.append(str(colour.text))
Uhrefs.append(str(colour.get('href')))
print(colour)
print('#############')
for each in CnI:
each.split(',')
print(each)
while Splitcounter<=len(CnI):
item.append(CnI[Splitcounter-1])
FinalColours.append(CnI[Splitcounter])
Whrefs.append(Uhrefs[Splitcounter])
Splitcounter+=2
print(Uhrefs)
for each in item:
print(each)
for z in FinalColours:
print(z)
for i in Whrefs:
print(i)
##for i in item:
## hold = item.index(i)
## print(hold)
## if Witem == i and Wcolour == FinalColours[i]:
## print('correct')
##
##
for count,elem in enumerate(item):
if Witem in elem:
selectItemindex.append(count+1)
for count,elem in enumerate(FinalColours):
if Wcolour in elem:
selectColourindex.append(count+1)
print(selectColourindex)
print(selectItemindex)
for each in selectColourindex:
if selectColourindex[Ccounter] in selectItemindex:
point = selectColourindex[Ccounter]
print(point)
else:
Ccounter+=1
web = 'http://www.supremenewyork.com'+Whrefs[point-1]
driver.get(web)
elem1 = driver.find_element_by_name('commit')
elem1.click()
time.sleep(1)
elem2 = driver.find_element_by_link_text('view/edit basket')
elem2.click()
time.sleep(1)
elem3 = driver.find_element_by_link_text('checkout now')
elem3.click()
"®" is not a character but a unicode codepoint so if you're using Python2, your code will never work. Instead of using str(), use something like this:
unicode(input_string, 'utf8')
# or
unicode(input_string, 'unicode-escape')
Edit: Given the code surrounding the initial snippet that was posted later and the fact that BeautifulSoup actually returns unicode already, it seems that removal of str() might be the best course of action and #MarkTolonen's answer is spot-on.
BeautifulSoup returns Unicode strings. Stop converting them back to byte strings. Best practice when dealing with text is to:
Decode incoming text to Unicode (what BeautifulSoup is doing).
Process all text using Unicode.
Encode outgoing text to Unicode (to file, to database, to sockets, etc.).
Small example of your issue:
text = u'\N{REGISTERED SIGN}' # syntax to create a Unicode codepoint by name.
bytes = str(text)
Output:
Traceback (most recent call last):
File "test.py", line 2, in <module>
bytes = str(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 0: ordinal not in range(128)
Note the first line works and supports the character. Converting it to a byte string fails because it defaults to encoding in ASCII. You can explicitly encode it with another encoding (e.g. bytes = text.encode('utf8'), but that breaks rule 2 above and creates other issues.
Suggested reading:
https://nedbatchelder.com/text/unipain.html
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Rewrite YAML frontmatter with regular expression

I want to convert my WordPress website to a static site on GitHub using Jekyll.
I used a plugin that exports my 62 posts to GitHub as Markdown. I now have these posts with extra frontmatter at the beginning of each file. It looks like this:
---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
https://myurl.com/slug
published: true
sw_timestamp:
- "399956"
sw_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
- "408644"
swp_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
- '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
- "410228"
---
This block isn't parsed right by Jekyll, plus I don't need all this frontmatter. I would like to have each file's frontmatter converted to
---
ID: 51
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
I would like to do this with regular expressions. But my knowledge of regex is not that great. With the help of this forum and lots of Google searches I didn't get very far. I know how to find the complete piece of frontmatter but how do I replace it with a part of it as specified above?
I might have to do this in steps, but I can't wrap my head around how to do this.
I use Textwrangler as the editor to do the search and replace.
YAML (and other relatively free formats like HTML, JSON, XML) is best not transformed using regular expressions, it is easy to work for one example and break for the next that has extra whitespace, different indentation etc.
Using a YAML parser in this situation is not trivial, as many either expect a single YAML document in the file (and barf on the Markdown part as extraneous stuff) or expect multiple YAML documents in the file (and barf because the Markdown is not YAML). Moreover most YAML parser throw away useful things like comments and reorder mapping keys.
I have used a similar format (YAML header, followed by reStructuredText) for many years for my ToDo items, and use a small Python program to extract and update these files. Given input like this:
---
ID: 51 # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
post_excerpt: ""
layout: post
permalink: >
https://myurl.com/slug
published: true
sw_timestamp:
- "399956"
sw_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
sw_cache_timestamp:
- "408644"
swp_open_thumbnail_url:
- >
https://myurl.com/wp-content/uploads/2014/08/Featured_image.jpg
swp_open_graph_image_data:
- '["https://i0.wp.com/myurl.com/wp-content/uploads/2014/08/Featured_image.jpg?fit=800%2C400&ssl=1",800,400,false]'
swp_cache_timestamp:
- "410228"
---
additional stuff that is not YAML
and more
and more
And this program ¹:
import sys
import ruamel.yaml
from pathlib import Path
def extract(file_name, position=0):
doc_nr = 0
if not isinstance(file_name, Path):
file_name = Path(file_name)
yaml_str = ""
with file_name.open() as fp:
for line_nr, line in enumerate(fp):
if line.startswith('---'):
if line_nr == 0: # don't count --- on first line as next document
continue
else:
doc_nr += 1
if position == doc_nr:
yaml_str += line
return ruamel.yaml.round_trip_load(yaml_str, preserve_quotes=True)
def reinsert(ofp, file_name, data, position=0):
doc_nr = 0
inserted = False
if not isinstance(file_name, Path):
file_name = Path(file_name)
with file_name.open() as fp:
for line_nr, line in enumerate(fp):
if line.startswith('---'):
if line_nr == 0:
ofp.write(line)
continue
else:
doc_nr += 1
if position == doc_nr:
if inserted:
continue
ruamel.yaml.round_trip_dump(data, ofp)
inserted = True
continue
ofp.write(line)
data = extract('input.yaml')
for k in list(data.keys()):
if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
del data[k]
reinsert(sys.stdout, 'input.yaml', data)
You get this output:
---
ID: 51 # one of the key/values to preserve
post_title: Here's my post title
author: Frank Meeuwsen
post_date: 2014-07-03 22:10:11
layout: post
published: true
---
additional stuff that is not YAML
and more
and more
Please note that the comment on the ID line is properly preserved.
¹ This was done using ruamel.yaml a YAML 1.2 parser, which tries to preserve as much information as possible on round-trips, of which I am the author.
Editing my post because I misinterpreted the question the first time, I failed to understand that the actual post was in the same file, right after the ---
Using egrep and GNU sed, so not the bash built-in, it's relatively easy:
# create a working copy
mv file file.old
# get only the fields you need from the frontmatter and redirect that to a new file
egrep '(---|ID|post_title|author|post_date|layout|published)' file.old > file
# get everything from the old file, but discard the frontmatter
cat file.old |gsed '/---/,/---/ d' >> file
# remove working copy
rm file.old
And if you want it all in one go:
for i in `ls`; do mv $i $i.old; egrep '(---|ID|post_title|author|post_date|layout|published)' $i.old > $i; cat $.old |gsed '/---/,/---/ d' >> $i; rm $i.old; done
For good measure, here's what I wrote as my first response:
===========================================================
I think you're making this way too complicated.
A simple egrep will do what you want:
egrep '(---|ID|post_title|author|post_date|layout|published)' file
redirect to a new file:
egrep '(---|ID|post_title|author|post_date|layout|published)' file > newfile
a whole dir at once:
for i in `ls`; do egrep '(---|ID|post_title|author|post_date|layout|published)' $i > $i.new; done
In cases like yours it is better to use actual YAML parser and some scripting language. Cut off metadata from each file to standalone files (or strings), then use YAML library to load the metadata. Once the metadata are loaded, you can modify them safely with no trouble. Then use serialize method from the very same library to create a new metadata file and finally put the files back together.
Something like this:
<?php
list ($before, $metadata, $after) = preg_split("/\n----*\n/ms", file_get_contents($argv[1]));
$yaml = yaml_parse($metadata);
$yaml_copy = [];
foreach ($yaml as $k => $v) {
// copy the data you wish to preserve to $yaml_copy
if (...) {
$yaml_copy[$k] = $yaml[$k];
}
}
file_put_contents('new/'.$argv[1], $before."\n---\n".yaml_emit($yaml_copy)."\n---\n".$after);
(It is just an untested draft with no error checks.)
You could do it with gawk like this:
gawk 'BEGIN {RS="---"; FS="\000" } (FNR == 2) { print "---"; split($1, fm, "\n"); for (line in fm) { if ( fm[line] ~ /^(ID|post_title|author|post_date|layout|published):/) {print fm[line]} } print "---" } (FNR > 2) {print}' post1.html > post1_without_frontmatter_fields.html
You basically want to edit the file. That is what sed (stream editor) is for.
sed -e s/^ID:(*)$^post_title:()$^author:()$^postdate:()$^layout:()$^published:()$/ID:\1\npost_title:\2\nauthor:\3\npostdate:\4\nlayout:\5\npublished:\6/g
You also can use python-frontmatter:
import frontmatter
import io
from os.path import basename, splitext
import glob
# Where are the files to modify
path = "*.markdown"
# Loop through all files
for fname in glob.glob(path):
with io.open(fname, 'r') as f:
# Parse file's front matter
post = frontmatter.load(f)
for k in post.metadata:
if k not in ['ID', 'post_title', 'author', 'post_date', 'layout', 'published']:
del post[k]
# Save the modified file
newfile = io.open(fname, 'w', encoding='utf8')
frontmatter.dump(post, newfile)
newfile.close()
If you want to see more examples visit this page
Hope it helps.

What is wrong with Django csv upload code?

Here is my code. I would like to import csv and save it to database via model.
class DataInput(forms.Form):
file = forms.FileField(label="Select CSV file")
def save(self, mdl):
records = csv.reader(self.cleaned_data["file"].read().decode('utf-8'), delimiter=',')
if mdl=='auction':
auction = Auction()
for line in records:
auction.auction_name = line[0]
auction.auction_full_name = line[1]
auction.auction_url = line[2]
auction.is_group = line[3]
auction.save()
Now, it throws the following error.
Exception Type: IndexError
Exception Value: list index out of range
csv file
RTS,Rapid Trans System,www.rts.com,TRUE
ZAA,Zelon Advanced Auton,www.zaa.info,FALSE
Really stuck. Please, help.
First of all, the full stacktrace should reveal exactly where the error is. Give Django the --traceback argument, e.g. ./manage.py --traceback runserver.
As Burhan Khalid mentioned 10 minutes ago you miss the 5th column in your csv file (index 4), so that is the root of the error.
Once you read the file with .read(), you are passing in the complete string - which is why each row is an individual character.
You need to pass the entire file object, without reading it first:
records = csv.reader(self.cleaned_data["file"], delimiter=',')
If you need to decode it first, then you had better run through the file yourself:
for line in self.cleaned_data['file'].read().decode('utf-8').split('\n'):
if line.strip():
try:
name, full_name, url, group = line.split(',')
except ValueError:
print('Invalid line: {}'.format(line))
continue
i = Auction()
i.auction_name = name
i.action_full_name = full_name
i.auction_url = url
i.is_group = group
i.save()