I have a log with millions of lines that like this
1482364800 bunch of stuff 172.169.49.138 252377 + many other things
1482364808 bunch of stuff 128.169.49.111 131177 + many other things
1482364810 bunch of stuff 2001:db8:0:0:0:0:2:1 124322 + many other things
1482364900 bunch of stuff 128.169.49.112 849231 + many other things
1482364940 bunch of stuff 128.169.49.218 623423 + many other things
Its so big that I can't really read it into memory for python to parse so i want to zgrep out only the items I need into another smaller file but Im not very good with grep. In python I would normally open.gzip(log.gz) then pull out data[0],data[4],data[5]to a new file so my new file only has the epoc and ip and date(the ip can be ipv6 or 4)
expected result of the new file:
1482364800 172.169.49.138 252377
1482364808 128.169.49.111 131177
1482364810 2001:db8:0:0:0:0:2:1 124322
1482364900 128.169.49.112 849231
1482364940 128.169.49.218 623423
How do I do this zgrep?
Thanks
To select columns you have to use cut command zgrep/grep select lines
so you can use cut command like this
cut -d' ' -f1,2,4
in this exemple I get the columns 1 2 and 4 with space ' ' as a delimiter of the columns
yous should know that -f option is used to specify numbers of columns and -d for the delimiter.
I hope that I have answered your question
I'm on OSX and maybe that is the issue but I couldnt get zgrep to work in filtering out columns. and zcat kept added a .Z at the end of the .gz. Here's what I ended up doing:
awk '{print $1,$3,$4}' <(gzip -dc /path/to/source/Largefile.log.gz) | gzip > /path/to/output/Smallfile.log.gz
This let me filter out the 3 columns I needed from the Largefile to a Smallfile while keeping both the source and destination in compressed format.
Related
I have a text file with data formatted as below. Figured out how to format the second part of the file to format it for upload into a db table. Hitting a wall trying to get the just the first 7 lines to format in the same way.
If it wasn't obvious, I'm trying to get it pipe delimited with the exact same number of columns, so I can easily upload it to the db.
Year: 2019 Period: 03
Office: NY
Dept: Sales
Acct: 111222333
SubAcct: 11122234-8
blahblahblahblahblahblahblah
Status: Pending
1000
AAAAAAAAAA
100,000.00
2000
BBBBBBBBBB
200,000.00
3000
CCCCCCCCCC
300,000.00
4000
DDDDDDDDDD
400,000.00
some kind folks answered my question about the bottom part, using the following code I can format that to look like so -
(.*)\r?\n(.*)\r?\n(.*)(?:\r?\n|$)
substitute with |||||||$1|$2|$3\n
|||||||1000|AAAAAAAAAA|100,000.00
|||||||2000|BBBBBBBBBB|200,000.00
|||||||3000|CCCCCCCCCC|300,000.00
|||||||4000|DDDDDDDDDD|400,000.00
just need help formatting the top part - to look like this, so the entire file matches with the exact same number of columns.
Year: 2019|Period: 03|Office: NY|Dept: Sales|Acct: 111222333|SubAcct: 11122234-8|blahblahblahblahblahblahblah|Status: Pending|||
I'm ok with having multiple passes on the file to get the desired end result.
I've helped you on your previous question, so I will focus now on the first part of your file.
You can use this regex:
\n|\b(?=Period)
Working demo
And use | as the replacement string
If you don't want the previous space before Period, then you can use:
\n|\s(?=Period)
I am just learning bash scripting and commands and i need some help with this assignment.
I have txt file that contains the following text and i need to:
Extract guest name ( 1.1.1 ..)
Sum guest result and output the guest name with result.
I used sed with simple regex to extract out the name and the digits but i have no idea about how to summarize the numbers becuase the guest have multiple lines record as you can see in the txt file. Note: i can't use awk for processing
Here is my code:
cat file.txt | sed -E 's/.*([0-9]{1}.[0-9]{1}.[0-9]{1}).*([0-9]{1})/\1 \2/'
And result is:
1.1.1 4
2.2.2 2
1.1.1 1
3.3.3 1
2.2.2 1
Here is the .txt file:
Guest 1.1.1 have "4
Guest 2.2.2 have "2
Guest 1.1.1 have "1
Guest 3.3.3 have "1
Guest 2.2.2 have "1
and the output should be:
1.1.1 = 5
2.2.2 = 3
3.3.3 = 1
Thank you in advance
I know your teacher wont let you use awk but, since beyond this one exercise you're trying to learn how to write shell scripts, FYI here's how you'd really do this job in a shell script:
$ awk -F'[ "]' -v OFS=' = ' '{sum[$2]+=$NF} END{for (id in sum) print id, sum[id]}' file
3.3.3 = 1
2.2.2 = 3
1.1.1 = 5
and here's a bash builtins equivalent which may or may not be what you've covered in class and so may or may not be what your teacher is expecting:
$ cat tst.sh
#!/bin/env bash
declare -A sum
while read -r _ id _ cnt; do
(( sum[$id] += "${cnt#\"}" ))
done < "$1"
for id in "${!sum[#]}"; do
printf '%s = %d\n' "$id" "${sum[$id]}"
done
$ ./tst.sh file
1.1.1 = 5
2.2.2 = 3
3.3.3 = 1
See https://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/ for how I'm using the associative array. It'll be orders of magnitude slower than the awk script and I'm not 100% sure it's bullet-proof (since shell isn't designed to process text there are a LOT of caveats and pitfalls) but it'll work for the input you provided.
OK -- since this is a class assignment, I will tell you how I did it, and let you write the code.
First, I sorted the file. Then, I read the file one line at a time. If the name changed, I printed out the previous name and count, and set the count to be the value on that line. If the name did not change, I added the value to the count.
Second solution used an associative array to hold the counts, using the guest name as the index. Then you just add the new value to the count in the array element indexed on the guest name.
At the end, loop through the array, print out the indexes and values.
It's a lot shorter.
I have not done any programming in about 12 years and have been asked by one of my colleagues to help with what is apparently a basic Python 2.7 script. My question is very similar to what this person asked (though has not been answered):
Python - Batch combine Multiple large CSV, filter data, skip header, appending vertically into a single CSV
I need to prompt the user for the folder path, read in each file from that folder (there are hundreds of CSV files), conduct processing, and then output the finished processing from each file into a single CSV file with the output of each file separated by a blank line and the filename of the file that it was read from.
It would result in something like this:
CHEM_0_5
etc etc
etc etc
etc etc
LAW_4_1
etc etc
etc etc
LAW_7_3
etc etc
etc etc
Currently the script has to be edited with the name of the file it has to read, saved, and then run. Then the contents of the output file has to be manually copied into a new csv file. It is very tedious and time consuming.
This is what I currently have. Please note I have removed some of the processing from the example.
import time
import datetime
x = 0
stamp = 0
compare = 1
values = []
## INSERT NAME OF FILE YOU WANT TO CLEAN
g = open('CHEM_0_5.csv','r')
for line in g:
lis=[line.split() for line in g]
lis.pop(0)
lis.pop(0)
timestamps = []
results = []
x = 0
for i in cl:
## INSERT WHAT YOU WANT TO SAVE THE FILE AS
fd = open('new.csv','a')
fd.write(str(ts[x]) + "," + str(i) + "\n")
fd.close()
x = x + 1
g.close()
I have been trying to re-learn python in the process of searching for answers but given that I don't really know what I'm doing I feel that this could be something to do after I've completed the task for my colleague.
Thank you for taking the time to read my submission!
I have an interesting issue I am trying to solve and I have taken a good stab at it but need a little help. I have a squishy file that contains some lua code. I am trying to read this file and build a file path out of it. However, depending on where this file was generated from, it may contain some information or it might miss some. Here is an example of the squishy file I need to parse.
Module "foo1"
Module "foo2"
Module "common.command" "common/command.lua"
Module "common.common" "common/common.lua"
Module "common.diagnostics" "common/diagnostics.lua"
Here is the code I have written to read the file and search for the lines containing Module. You will see that there are three different sections or columns to this file. If you look at line 3 you will have "Module" for column1, "common.command" for column2 and "common/command.lua" for column3.
Taking Column3 as an example... if there is data that exists in the 3rd column then I just need to strip the quotes off and grab the data in Column3. In this case it would be common/command.lua. If there is no data in Column3 then I need to get the data out of Column2 and replace the period (.) with a os.path.sep and then tack a .lua extension on the file. Again, using line 3 as an example I would need to pull out common.common and make it common/common.lua.
squishyContent = []
if os.path.isfile(root + os.path.sep + "squishy"):
self.Log("Parsing Squishy")
with open(root + os.path.sep + "squishy") as squishyFile:
lines = squishyFile.readlines()
squishyFile.close()
for line in lines:
if line.startswith("Module "):
path = line.replace('Module "', '').replace('"', '').replace("\n", '').replace(".", "/") + ".lua"
Just need some examples/help in getting through this.
This might sound silly, but the easiest approach is to convert everything you told us about your task to code.
for line in lines:
# if the line doesn't start with "Module ", ignore it
if not line.startswith('Module '):
continue
# As you said, there are 3 columns. They're separated by a blank, so what we're gonna do is split the text into a 3 columns.
line= line.split(' ')
# if there are more than 2 columns, use the 3rd column's text (and remove the quotes "")
if len(line)>2:
line= line[2][1:-1]
# otherwise, ...
else:
line= line[1] # use the 2nd column's text
line= line[1:-1] # remove the quotes ""
line= line.replace('.', os.path.sep) # replace . with /
line+= '.lua' # and add .lua
print line # prove it works.
With a simple problem like this, it's easy to make the program do exactly what you yourself would do if you did the task manually.
Below gives the list in a file (unsorted-file) that needs to be sorted in Linux, preferably in a single line linux command.
03123456789abcd
02987654321pqrs
02123456789mnop
03987654321stuv
04123456789ghjk
01000000000
99000000000
97000000000
98000000000
Required sorted file output:
01000000000
02123456789mnop
03123456789abcd
04123456789ghjk
02987654321pqrs
03987654321stuv
97000000000
98000000000
99000000000
Requirement:
If first two char is 01 then it is the header
If first two char is greater than 90 then they are trailers
Sort order: position 3 - 11 and then position 1 - 2
I tried a simple sort command like
$sort unsorted-file > sorted-file.
The requirement 3 failed. Then I tried
$sort -k 1.3, 1.11 -k 1.2 unsorted-file > sorted-file
The trailer records made it to the top of the file because of all zeros from position 3.
The other options that I know is to strip out the headers and trailers; sort the file and merge the header and trailer files back. Is there a way to do in one linux (complex) command itself?
Thanks for your time.
-R-
( grep '^01' unsorted-file
grep -E -v '^(01|9)' unsorted-file | sort -k 1.3,1.11 -k 1.1
grep '^9' unsorted-file ) > sorted-file