Python Aggregate column C based on A & B - python-2.7

I have some log files that I am trying to analyze. Using a little regex I have gotten the following structure:
Month/Year, URL, Count
Sep 2016,/,100513
Sep 2016,/,68221
Oct 2016,/,536365
Oct 2016,/,362350
Oct 2016,/,89203
Nov 2016,/,526455
Nov 2016,/,351360
Nov 2016,/,88279
Dec 2016,/,538702
Dec 2016,/,156063
Dec 2016,/,89094
Jan 2017,/,535684
Jan 2017,/,105867
Jan 2017,/,87492
Feb 2017,/,483897
Feb 2017,/,80502
Feb 2017,/,47554
Mar 2017,/,434830
Mar 2017,/,72355
Mar 2017,/,43036
It's several 100k lines long so I can't use Excel or Google Sheets so I am trying to aggregate the Count by both Month and URL in python. What is a good method to do this?

You can do this using pandas. Your example is a csv file so the following would work.
import pandas as pd
df = pd.read_csv('x.csv', parse_dates=True)
print df.groupby(['Month/Year', 'URL']).sum()

If you need a solution without external dependencies (maybe a strict corporate environment):
months = {}
urls = {}
with open ('./parsed-data.txt', 'r') as f:
lines = f.readlines()
for line in lines:
# [Month, URL, Count]
data = line.split(',')
months[data[0]] = months.setdefault(data[0], 0) + int(data[2])
urls[data[1]] = urls.setdefault(data[1], 0) + int(data[2])
# Do whatever with months and urls here

Related

Grok pattern for [Mon Jan 04 08:36:12 2021]

I am working on shipping some logs to elasticsearch using logstash. I am unable to figure out the grok pattern for [Mon Jan 04 08:36:12 2021] .The format is Day Month Date Time Year Help and Suggestions are most welcome.
Log - [Mon Jan 04 08:36:12 2021]
Grok I tried - \[%{DAY:day} %{MONTH:month} %{TIME:time} %{YEAR:year}]
Result Expected - Day:Mon Month:Jan Date:04 Hour:08 Minute:36 Second:12 Year:2021
You forgot to specify the %{MONTHDAY} in between the month and time variables.
You can use
\[%{DAY:day} %{MONTH:month} %{MONTHDAY} %{TIME:time} %{YEAR:year}]
Grok pattern list used:
DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
MONTH \b(?:[Jj]an(?:uary|uar)?|[Ff]eb(?:ruary|ruar)?|[Mm](?:a|รค)?r(?:ch|z)?|[Aa]pr(?:il)?|[Mm]a(?:y|i)?|[Jj]un(?:e|i)?|[Jj]ul(?:y|i)?|[Aa]ug(?:ust)?|[Ss]ep(?:tember)?|[Oo](?:c|k)?t(?:ober)?|[Nn]ov(?:ember)?|[Dd]e(?:c|z)(?:ember)?)\b
MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
HOUR (?:2[0123]|[01]?[0-9])
MINUTE (?:[0-5][0-9])
SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
YEAR (?>\d\d){1,2}

how do I use f-string with regex in Python

This code works if I use raw strings only. However, as soon as I add f to r it stops working.
Is there a way to make f-strings work with raw strings for re?
import re
lines = '''
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
'''
rmonth = 'a'
regex = fr'(\d{1,2})/(\d{1,2})/(\d{4}|\d{2})'
date_found = re.findall(regex, lines)
date_found
The new fstrings in Python interpret brackets in their own way. You can escape brackets you want to see in the output by doubling them:
regex = fr'(\d{{1,2}})/(\d{{1,2}})/(\d{{4}}|\d{{2}})'

python program to grep the output of a file with a time range

I am using python 2.6.6
I have a sample file 1.csv
1.csv
11887788201606180000 value=1 sat sun mon tue , 998848494 992920209 992828282 kdkkdkdf 992828228 o333448482
28283838201606180000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
33838383201606180000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
47474747201606190000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
47474747201606200000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
I want to get the data from time range 20160618 to 20160619
and my expected output should like this:
11887788201606180000 value=1 sat sun mon tue , 998848494 992920209 992828282 kdkkdkdf 992828228 o333448482
28283838201606180000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
33838383201606180000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
47474747201606190000 value-2 jan feb mar apr , 8849494994 49499494 499494949 49949494 499494484 449494994
The code i have written is
import csv
import sys
import time
import datetime
if __name__ == '__main__':
from_raw = raw_input('\nEnter From date :')
from_date = datetime
print 'From date: = ' + str(from_date)
to_raw = raw_input('\nEnter TO Date :')
to_date = datetime
in_file = './file.csv'
for line in in_file:
fields = line.split(',')
found_date = datetime.date
if from_date <= found_date <= to_date:
print line
in_file.close()
I am executing it like
python script.py 1.csv
I am able to key in the start date and end date with the script but not able to get the expected output
please help
Just reading your code the problem is in the line
fields = line.split(',')
You are splitting the line at the "," which is not what you want. Given that the date substring is consistently in the same place in the string I would try an easy solution which is
found_date = line[8:16]
And also remove the following line:
found_date = datetime.date
This line will change the found_date in the line to the current date/time which you do no want to happen.
These simple changes should solve your coding issue as long as the input format is consistent.

Howto grep over months with defined start and end date

so here's my problem: I have big log files and want a script to grep certain periods of time and safe them to a file (sorted), basically
bash script.sh Jul 4 Sep 30
will return for example
Sep 30 user0 logged in
Sep 15 user1 logged in
Aug 6 user0 logged in
Aug 3 user1 logged in
Jul 28 user2 logged in
Jul 27 user2 logged in
Jul 4 user0 logged in
My first attempt was that every month and date gets his own variable like
bash script.sh Jul 4 Sep 3 0
so I can use $1 for start month (July), $2 for start date (4) and so on in grep like
for logs in logs*
do
grep -qEe "^\"$1\" [\"$2\"-9]\s" $messages >> result.txt
done
to get all logs from July 4 to 9 but I don't know how to get logs from the entire time period that aren't in the same month nor in a period like 1-9 or 10-19 and so on
Any help greatly appreciated!
EDIT:
As some people asked, here's how my log files look like (just much bigger and not sorted):
Sep 30 user0 logged in
Jul 27 user2 logged in
Aug 6 user0 logged in
Aug 31 user1 logged in
Jul 8 user2 logged in
Sep 5 user1 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
[...]
Here's my take:
#/bin/bash
year="$(date +"%Y")"
start="$(date -d"$1 $2, $year" +'%s')"
end="$(($(date -d"$3 $4, $year" +'%s')+86400))"
for log in logs*; do
while IFS= read -r line; do
d="$(date -d"$(cut -d' ' -f1,2 <<< "$line"), $year" +'%s')"
if (( $start <= $d && $d < $end )); then
echo "$s"
fi
done < "$log"
done
You run it like that: ./script.sh Jul 04 Sep 03. Since no year is included in the logs, it assumes that all dates (including the ones in the command line) are for the current year. It's probably not the most optimal solution but it works. It relies on date which it repeatedly calls to parse dates into a unix timestamp. unix timestamps are nice because they are just numbers and thus can be used in numeric comparisons.
$ range="Jul 4 Sep 30"
$ awk -v range="$range" '
BEGIN {
numMths = split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",m)
for (i in m) {
mths[m[i]] = i
}
split(range,r)
beg = sprintf("%02d%02d", mths[r[1]], r[2])
end = sprintf("%02d%02d", mths[r[3]], r[4])
}
{ cur = sprintf("%02d%02d", mths[$1], $2) }
(cur >= beg) && (cur <= end) { vals[$1,$2] = $0 }
END {
for (mthNr=numMths; mthNr>0; mthNr--) {
for (dayNr=31; dayNr>0; dayNr--) {
date = m[mthNr] SUBSEP dayNr
if (date in vals) {
print vals[date]
}
}
}
}
' file
Sep 30 user0 logged in
Sep 5 user1 logged in
Aug 31 user1 logged in
Aug 6 user0 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
Jul 8 user2 logged in

All CSV values in column 0 are strings

For some reason a csv file I wrote (win7) with Python has all the values as a string in column 0 and cannot perform any operation.
It has no labels.
The format is (I would like to keep the last value - date - as a date format):
"Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
EDIT - When I read it with the csv module it prints it out like:
['Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0," date: Feb 04, 2016\t\t\t"']
What is the best way to convert the strings into comma separated values like this?
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date:, Feb 04, 2016
Thanks a lot.
s="Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
print(s)
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date: Feb 04, 2016
to add a comma after "date:" you need to add some logic (like replace ":" with ":,"; or after first word etc.
First, your date field is quoted, which is ok (and needed) because there is a comma inside:
" date: Feb 04, 2016 "
But then the whole line also gets quoted (and thus seen as a single field). And because there are already quotes around the date field, those get escaped with another quote:
"Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0,"" date: Feb 04, 2016 """
So, if you remove that last quoting, everything should be fine (but you might want to trim the date field):
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0," date: Feb 04, 2016 "
If you want it exactly like this, you need another comma after date: :
Rob,Avanti,Ave,12.83,Max,4.0,Min,-21.9,analist disp:,-1.0, date:,"Feb 04, 2016"
On the other hand, it would be better to use a header instead:
Name,Name2,Ave,Max,Min,analist disp,date
Rob,Avanti,12.83,4.0,-21.9,-1.0,"Feb 04, 2016"