How to take document per line when combining multiple documents? - python-2.7

Hello everyone,
I have 3000 documents with me. I want to combine the content of those 3000 documents in one single document. I used
cat *.html > Combined_Text.txt
command to do the process. But, I would like to have the data of one document per line in the Combined_Text.txt which means I should just be having 3000 lines of content (one document per line). How to do it? Please help!

The following command will remove new lines from every html and then append the files to each other in Combined_Text.txt.
for f in *.html; do cat $f | tr -d '\n' >> Combined_Text.txt; echo "" >> Combined_Text.txt; done;
That second echo seems to inelegant, I'm sure there is a better way to put the files on their own lines, but it does the job.

Related

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!
As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern
There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?
You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.
Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

Combine multiple lines of text documents into one

I have thousands of text documents and they have varied number of lines of texts. I want to combine all the lines into one single line in each document individually. That is for example:
abcd
efgh
ijkl
should become as
abcd efgh ijkl
I tried using sed commands but it is quite not achieving what I want as the number of lines in each documents vary. Please suggest what I can do. I am working on python in ubuntu. One line commands would be of great help. thanks in advance!
If you place your script in the same directory as your files, the following code should work.
import os
count = 0
for doc in os.listdir('C:\Users\B\Desktop\\newdocs'):
if doc.endswith(".txt"):
with open(doc, 'r') as f:
single_line = ''.join([line for line in f])
single_space = ' '.join(single_line.split())
with open("new_doc{}.txt".format(count) , "w") as doc:
doc.write(single_space)
count += 1
else:
continue
#inspectorG4dget's code is more compact than mine -- and thus I think it's better. I tried to make mine as user-friendly as possible. Hope it helps!
Using python wouldn't be necessary. This does the trick:
% echo `cat input.txt` > output.txt
To apply to a bunch of files, you can use a loop. E.g. if you're using bash:
for inputfile in /path/to/directory/with/files/* ; do
echo `cat ${inputfile}` > ${inputfile}2
done
assuming all your files are in one directory,have a .txt extension and you have access to a linux box with bash you can use tr like this:
for i in *.txt ; do tr '\n' ' ' < $i > $i.one; done
for every "file.txt", this will produce a "file.txt.one" with all the text on one line.
If you want a solution that operates on the files directly you can use gnu sed (NOTE THIS WILL CLOBBER YOUR STARTING FILES - MAKE A BACKUP OF THE DIRECTORY BEFORE TRYING THIS):
sed -i -n 'H;${x;s|\n| |g;p};' *.txt
If your files aren't in the same directory, you can used find with -exec:
find . -name "*.txt" -exec YOUR_COMMAND \{\} \;
If this doesn't work, maybe a few more details about what you're trying to do would help.

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

making graphs with a shell script

i need to make a graph with numeric values in a time period, the values represent online users in a web page.
the script will be exectued with cron every 30 mins and the needed html file will be downloaded with wget. but there are some yet unanswered questions & problems:
-i need to get just the numeric value from html code (but grep returns the whole line), how can I get only the numeric value? I can get the line with grep, it looks like this:
Users online: 24 917 </div>
How can I get just the 24917?
-what would be easier? to generate .svg file with the graph, or save values in a .csv file (and generate graph with OOo or something similar). Maybe some other good ideas?
Thanks in advance,
-skazhy
You can do the following to get your number:
Set the regular expression:
digits='[[:digit:]]+ *[[:digit:]]*'
followed by these two lines:
num=$(echo $line | grep -Eo "$digits")
num=${num// }
or these:
# Bash >= 3.2 (syntax may be different for 3.0/3.1)
[[ $line =~ $digits ]]
num=${BASH_REMATCH[#]// }
to extract the number from the variable $line containing the line in your question.
Gnuplot should be readily available. A few examples of its output can be found here.
These are from here.
Just one process (grep):
array=( $(grep whatever filename ) ) && echo "${array[2]}${array[3]}"