I need to strip blank lines from only the first 6 lines of a text file. I've attempted to cobble together a solution using this StackOverflow question and this file but to no avail.
Here's the sed script I'm using (aliased as faprep='~/misc-scripts/fa-prep.sed), the last command is the one that's failing:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
# Pay attention to meeeeeeee!!!!
1,6{/./!d} # Remove blank lines from around titles??
Here's the command I'm running from terminal, which shows the last line failing to strip whitespace from the first 6 lines of the file (after all of the other modifications have been made, of course):
calyodelphi#dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
calyodelphi#dragonpad:~/pokemon-story/compilations $
The rest of the file is composed of a blank line after the third title and then paragraphs all separated by blank lines. I want to keep those blank lines, so that only the blank lines between the titles at the very top are stripped.
Just to clarify a few points: this file has Unix line endings, and the lines are supposed to not have spaces. Even viewing in a text editor that shows whitespace, each blank line contains only a newline character.
Since the discussion in the comments made it clear that you want to ignore empty lines in the first six lines of the body tag -- in other words, the first six times that part of the script is reached -- rather than the first six lines of the overall input data, you cannot use the global line counters. Since you're not using the hold buffer, we can use it to build our own counter, though.
So, replace
1,6 { /./! d }
with
x # swap in hold buffer
/.\{6\}/! { # if the counter in it hasn't reached 6
s/^/./ # increment by one (i.e., append a character)
x # swap the input back in
/./!d # if it is empty, discard it
x # otherwise swap back
}
x # and swap back one more time. This dance ensures that the
# line from the input is in the pattern space when we drop
# out at the bottom to the printing, regardless of which
# branches were entered.
Or, if this seems too complicated, use #glennjackman's suggestion and pipe the output of the first sed script through sed '1,6 { /./! d; }', since the second process will have its own line counters working on the preprocessed data. There's no fun in it, but it'll work.
This answer courtesy of #Wintermute's comments on my question pointing me in the right direction! I was mistakenly thinking that sed was working on the modified stream when I put that delete statement in at the very end. When I tried a different address (lines 9,14) it worked perfectly, but was too hackish for me to settle on. But this confirmed I needed to think of the stream as still including lines that I thought were already gone.
So I moved the delete statement up above the statement that clears out the <body> tags and everything outside them, and used a regex and the addr1,+N trick here to produce this final result:
The script:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,+6{/^$/d} # Remove blank lines from around titles
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
And the resulting output:
calyodelphi#dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
The next two weeks of training passed by too quickly and too slowly at the same time. [rest of paragraph omitted for space]
calyodelphi#dragonpad:~/pokemon-story/compilations $
Thanks #Wintermute! :D
Related
I am new to both python and regex. I am trying to process a text file where I want to remove lines with only digits and space. This is the regular expression I am using.
^\s*[0-9]*\s*$
I am able to match the lines I want to remove (in notepad++ find dialog).
but when I try to do the same with python, the lines are not matched. Is there a problem in the regex itself or there is some problem with my python code?
Python code that I am using :
contacts = re.sub(r'^\s*[0-9]*\s*$','\n',contents)
Sample text
Age:30
Gender:Male
20
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
30 30
Gender:Female
21
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
30
Gender:Male
Use re.sub in multiline mode:
contacts = re.sub(r'^\s*([0-9]+\s*)+$','\n',x, flags=re.M)
Demo
If you want the beginning ^ and ending $ anchors to kick in, then you want to be in multiline mode.
In addition, use the following to represent a line only containing clusters of numbers, possibly separated by whitespace:
^\s*([0-9]+\s*)+$
You don't even need regex for that, a simple str.translate() to remove characters you're not interested and check if something is left should more than suffice:
import string
clear_chars = string.digits + string.whitespace # a map of characters we'd like to check for
# open input.txt for reading, out.txt for writing
with open("input.txt", "rb") as f_in, open("output.txt", "wb") as f_out:
for line in f_in: # iterate over the input file line by line
if line.translate(None, clear_chars): # remove the chars, check if anything is left
f_out.write(line) # write the line to the output file
# uncomment the following if you want added newlines when pattern matched
# else:
# f_out.write("\n") # write a new line on match
Which will produce for your sample input:
Age:30
Gender:Male
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
Gender:Female
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
Gender:Male
If you want the matching lines replaced with a new line, just uncomment the else clause.
I'm having a text file like this. It has more than 500 thousand lines:
('12', '9', '56', 'Trojan.Genome.Win32.230770',
'04df65889035a471f8346565600841af',
'9190953854e36a248819e995078a060e0da2e687',
'b6488037431c283da6b9878969fecced695ca746afb738be49103bd57f37d4e4',
'2015-10-16 00:00:00', 'Zillya', '16', 'TROJAN', 'trojan.png',
'2016-01-14 21:35:44'); #line1
('13', '3', '54', 'UnclassifiedMalware',
'069506a02c4562260c971c8244bef301',
'd08e90874401d6f77768dd3983d398d427e46716',
'78e155e6a92d08cb1b180edfd4cc4aceeaa0f388cac5b0f44ab0af97518391a2',
'2015-10-15 00:00:00', 'Comodo', '6', 'MALWARE', 'malware.png',
'2016-01-14 21:35:44'); #line2
I only want to keep the text file into something like this:
Trojan.Genome.Win32.230770, 04df65889035a471f8346565600841af,
9190953854e36a248819e995078a060e0da2e687,
b6488037431c283da6b9878969fecced695ca746afb738be49103bd57f37d4e4
#line1
UnclassifiedMalware, 069506a02c4562260c971c8244bef301,
d08e90874401d6f77768dd3983d398d427e46716,
78e155e6a92d08cb1b180edfd4cc4aceeaa0f388cac5b0f44ab0af97518391a2
#line2
I have tried all of regex that I could think of but they didn't work.
If this is supposed to be done multiple times, this solution might be lacking, simply because of a lack in documentation.
Just applying regex to a file (maybe not even saving it) is not really reproducible / understandable for others.
I'm proposing a python small script to make clear what you are in fact doing. Besides you'll get full control over the exact format of the output, where it writes to etc.
# get regex module
import re
filename = 'path/to/your/file.txt'
# open file
with open(filename) as file_handle:
for line in file_handle:
# remove trailing whitespace
line = line.strip()
# if line is empty forget about it
if not line:
continue
# split into comment part and data part
data, comment = line.split(';')
# transform into comma seperated values
# aka. remove whitespace, parentheses, quotes
data = re.sub(r'\s|\(|\)|\'', '', line)
# file is build up like this (TODO: make names more logical)
nr1, nr2, nr3, \
name, \
hash1, hash2, hash3, \
first_date, discoverer, nr4, \
category, snapshot_file, last_date = data.split(',')
# print, or possibly write
print("{name:}, {hash1:}, {hash2:}, {hash3:} {comment:}".format(**locals()))
Since this is a comma-delimited file, you can use a regular expression to search and replace, although it won't be nearly as efficient as just splitting up your string in your programming language of voice.
'([^']*)',\s*
will find a single quote, then capture all the text until it encounters the next single quote, followed by the comma and any trailing whitespace.
You would then repeat that a bunch of times, once for each comma-separated field.
It would look a little like this, and then you can choose which fields to substitute back into your text. In this case, you want only fields \4 through \8.
Could it be written so the \1 through \3 are not captured? Certainly, using a non-capturing (?:...) group. Then your substitutions would range from \1 through \5. But this makes it flexible enough that if you want to include or exclude any of the other fields, it's as simple as including or excluding them in the substitution field.
I have a text file that takes the form of:
first thing: content 1
second thing: content 2
third thing: content 3
fourth thing: content 4
This pattern repeats throughout the entire text file. However, sometimes one of the rows is completely gone like so:
first thing: content 1
second thing: content 2
fourth thing: content 4
How could I search the document for these missing rows and just add it back with a value of "NA" or some filler to produce a new text file like this:
# 'third thing' was not there, so re-adding it with NA as content
first thing: content 1
second thing: content 2
third thing: NA
fourth thing: content 4
Current code boilerplate:
with open('original.txt, 'r') as in:
with open('output.txt', 'wb') as out:
#Search file for pattern (Maybe regex?)
#If pattern does not exist, add the line
Thanks for any help you all can offer!
You must look for 1-3 lines (less than 4) followed by newline:
^\n([^\n]*\n){1,3}\n
Demo: https://regex101.com/r/rL3eA5/2
This isn't pretty, but it works. Here's a regex to detect where lines are missing:
(?:^|\n)(second thing:\s*[^\n]+\n)|(first thing:\s*[^\n]+\n(?!second thing:))|(second thing:\s*[^\n]+\n(?!third thing:))|(third thing:\s*[^\n]+\n(?!fourth thing:))|(third thing:\s*[^\n]+\n\n)
regex101 demo here
Notice the Single Line flag.
When you've got a match, check which match group that matches. If it's the first one, the first line is missing. If it's the second one, the second line is missing and so on for third and fourth.
Here's an example how to replace if the 1'st group got a match.
Here's an example how to replace if the 3'rd group got a match.
Here's an example how to replace if the 4'rd group got a match.
You'll probably have to do some tweaking, but it should get you on your way ;)
Regards.
A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.
The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.
Running on MacOS (i.e., BSD-type unix).
Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]
You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.
EDIT: on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out
(New) Code:
# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'
This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...
sed '# label anchor for a jump
:loop
# load a new line in working buffer (so always 2 lines loaded after)
N
# verify if the 2 lines have same starting pattern and join if the case
/^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
$ b
# if lines are joined, cycle and re make with next line (jump to :loop)
t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
/.*|.*|.*\n/ P
# remove first line (using last search pattern)
s///
# (if anay modif) cycle (jump to :loop)
t loop
# exit and print working buffer
' YourFile
posix version (maybe --posix on Mac)
self commented
assume sorted entry, no empty line, no pipe in data (nor escaped one)
used unbufferd -u for a stream process if available
I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile