Know the number of times this word appears: PYTHON 2.7 - python-2.7

I have a file.txt with about 1,000 lines that look like this:
--- Adding sections to FwLogger: [],2020-01-13 16:09:18,2020-01-13 16:09:22
--- Clearing all sections from FwLogger,2020-01-13 16:09:17,2020-01-13 16:09:22
--- (1/0) The value was discarded due to being too separated from previous value
--- (1/0) ContinueBoot#b7630fd Rebooting device due to capabilities request freeze
And I would need to know how many times the word "FwLogger" appears ( in number ).

There are definitely more elegant ways to do it, but in my version you replace the delimiters manually:
with open('test.txt') as file:
for line in (line.strip() for line in file):
#here you replace all possible delimiters in your file with a space to split afterwards according to the spaces
c=line.replace(","," ").replace(";"," ").replace("#"," ").replace(":"," ")
for word in c.split(" "):
if word == "FwLogger":
# print(line)
counter= counter+1
print(counter)

read in your txt file and use the string find method like below
loop
istart = str.find(sub, istart)
I= I + 1.
end loop
I start is the position where the string you're looking for was last found. before starting your loop assign istart = 1
each time one is found increment a counter
i.e. I= I + 1

Related

Remove lines from buffer that match the selected text

When analyzing large log files, I often remove lines containing text I find irrelevant:
:g/whatever/d
Sometimes I find text that spans multiple lines, like stacktraces. For that, I record the steps taken (search, go to start anchor, delete to end anchor) and replay that macro with 100000#q. I'm searching for a function or a feature vim already has included that allows me to mark text and remove all lines containing this text. Ideally this would also work for block selection.
If I understood your problem right, this command should do what you want:
:g/NullPointer/,/omitt/d
Example:
Before:
1
2
3
NullPointerException1
4
5
6
omitted
7
NullPointerException2
8
9
omitted
10
After:
1
2
3
7
10
Please read :h edit-paragraph-join, there is good explanation for the command, your case is just changing join into d
:g/whatever/d2
will delete a line with whatever and the line after it. If you can find text that always happens in the first line, you can strip out all of the following text if it has the same number of lines by changing 2 to whatever you need.
You could actually just use some normal commands in a global command to achieve what you want, look at your example (hope i understood it more or less right):
someText
NullPointerException
...
omitted
you want to delte from the line above NPE until the line with omitted right?
Just use the following:
:g/NullPointerException/execute "normal! kddd/omitted\<cr>dd"
It maybe looks complex, but it isn't. It is not better than a macro1
, but i like commands more, because I always make errors recording macros.
Since it only uses normal vim movements, it is easy to adopt. If you f.e. not know where your previous anchor is, you could use ?anchor\<cr> instead of kd. For a better demonstration you will have to submit a realistic example.
[1] You could argue, that this only needs to be run once, but that is also true for a recursive macro http://vim.wikia.com/wiki/Record_a_recursive_macro
Thanks to the answers here, I was able to code a very handy function: The sources below enables one to select text and remove all lines with the same (or similar) text in the current buffer. That works with both in-line and multiline selection. As I said I was searching for something that made me faster in analyzing log files. Log files typically contain dates and times and these change all the time, so it's a good idea to have something that let's us ignore numbers. Let's see. I'm using these two mappings:
vnoremap d :<C-U>echo RemoveSelectionFromBuffer(0)<CR>
vnoremap D :<C-U>echo RemoveSelectionFromBuffer(1)<CR>
Typical usage:
Remove similar lines ignoring numbers: Shift+v, then Shift+d
Remove same matches (single line): Mark text inline (leaving out dates and times), then d
Remove same matches (multiline): Mark text across lines (leaving out dates and times), then d
Here's the source code:
" Removes lines matching the selected text from buffer.
function! RemoveSelectionFromBuffer(ignoreNumbers)
let lines = GetVisualSelection() " selected lines
" Escape backslashes and slashes (delimiters)
call map(lines, {k, v -> substitute(v, '\\\|/', '\\&', 'g')})
if a:ignoreNumbers == 1
" Substitute all numbers with \s*\d\s* - in formatted output matching
" lines may have whitespace instead of numbers. All backslashes need
" to be escaped because \V (very nomagic) will be used.
call map(lines, {k, v -> substitute(v, '\s*\d\+\s*', '\\s\\*\\d\\+\\s\\*', 'g')})
endif
let blc = line('$') " number of lines in buffer (before deletion)
let vlc = len(lines) " number of selected lines
let pattern = join(lines, '\_.') " support multiline patterns
let cmd = ':g/\V' . pattern . '/d_' . vlc " delete matching lines (d_3)
let pos = getpos('v') " save position
execute "silent " . cmd
call setpos('.', pos) " restore position
let dlc = blc - line('$') " number of deleted lines
let dmc = dlc / vlc " number of deleted matches
let cmd = substitute(cmd, '\(.\{50\}\).*', '\1...', '') " command output
let lout = dlc . ' line' . (dlc == 1 ? '' : 's')
let mout = '(' . dmc . ' match' . (dmc == 1 ? '' : 'es') . ')'
return printf('%s removed: %s', (vlc == 1 ? lout : lout . ' ' . mout), cmd)
endfunction
I took the GetVisualSelection() code from this answer.
function! GetVisualSelection()
if mode() == "v"
let [line_start, column_start] = getpos("v")[1:2]
let [line_end, column_end] = getpos(".")[1:2]
else
let [line_start, column_start] = getpos("'<")[1:2]
let [line_end, column_end] = getpos("'>")[1:2]
end
if (line2byte(line_start)+column_start) > (line2byte(line_end)+column_end)
let [line_start, column_start, line_end, column_end] =
\ [line_end, column_end, line_start, column_start]
end
let lines = getline(line_start, line_end)
if len(lines) == 0
return ''
endif
let lines[-1] = lines[-1][: column_end - 1]
let lines[0] = lines[0][column_start - 1:]
return lines
endfunction
Thanks, aepksbuck, DoktorOSwaldo and Kent.

Python word search in a text file

I have a text file in which I need to search for specific 3 words using Python. For example the words are account, online and offer and I need the count of how many times it appears in the system.
with open('fixtures/file1.csv') as f:
print len(filter(
lambda line: "account" in line or "online" in line or "offer" in line,
f.readlines()
))
You can also check directly if the words are in the each line.
Update
To count how many times does each word appear in a file, the most effective way I find is to iterate once over the file and check how many times each word is found in the line. For that, try the following:
keys = ('account', 'online', 'offer')
with open('fixtures/file1.csv') as f:
found = dict((k, 0) for k in keys)
for line in f.readlines():
for k in keys:
found[k] += 1 if k in line else 0
found will then be a dictionary with what you are looking for.
Hope this helps!
I am assuming it is a plain text document. In that case you would open('file.txt') as f and then get every [for] line in f and check if 'word' in f.lower() and then incrament a counter accordingly (say wordxtotal += 1)

Python :Replace repetitive line in a file with empty space but not on first / last occurrence

I have a file which has repetitive lines <this is repeated> that I would like to replace with empty space "". However, the first occurrence or last occurrence of the repeated line does not need to be replaced. I tried replace() before but this function will replace all the strings in the file. Is there any way to write it to get the expected result? Ps: It is a large text file
The file is as follow:
<this is repeated>
second line
another lines
third line
<this is repeated>
<this is repeated>
Note: I realized after posting that if the last occurrence was the very last line without a \n after this technique would leave it as well as the next last occurrence.
First you would need to iterate over the file until you find the first occurrence:
file = <OPEN FILE>
rep_line = "<this is repeated>\n"
beginning = "" #record all data until found
while True: #broken when rep_line is found in file (or end of file is reached)
line = file.readline()
if not line:
raise EOFError("reached end of file before finding first occurence")
beginning+=line
if line == rep_line:
break
rest = file.read() #you can read the rest after iterating over a few lines
Then you will have beginning which contains everything up to and including the first occurrence, and the rest
So all you need to do with rest is to count how may time it occurs and replace all but the last one:
reps = rest.count(rep_line)
new_text = beginning + rest.replace(rep_line,"",reps - 1)
# ^ don't replace the last one
however this direct approach will pick up lines that end with the text (like "hello <this is repeated>" for example) and this can be fixed by also checking that there is a \n right before the line:
reps = rest.count("\n"+rep_line)
new_text = beginning + rest.replace("\n"+rep_line,"\n",reps - 1)
# ^ replace with a single newline

Command line to merge lines with matching first field, 50 GB input

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.
The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.
Running on MacOS (i.e., BSD-type unix).
Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]
You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.
EDIT: on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out
(New) Code:
# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'
This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...
sed '# label anchor for a jump
:loop
# load a new line in working buffer (so always 2 lines loaded after)
N
# verify if the 2 lines have same starting pattern and join if the case
/^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
$ b
# if lines are joined, cycle and re make with next line (jump to :loop)
t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
/.*|.*|.*\n/ P
# remove first line (using last search pattern)
s///
# (if anay modif) cycle (jump to :loop)
t loop
# exit and print working buffer
' YourFile
posix version (maybe --posix on Mac)
self commented
assume sorted entry, no empty line, no pipe in data (nor escaped one)
used unbufferd -u for a stream process if available

findall function grabbing the wrong info

I am trying to writing a piece of python to read my files. The code is below:
import re, os
captureLevel = [] # capture read scale.
captureQID = [] # capture questionID.
captureDesc = [] # capture description.
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt')
newfile=open('finalwordlist.csv','w')
mytext=file.read()
for row in mytext.split('\n'):
grabLevel=re.findall(r'(\d{1})+\n',row)
captureLevel.append(grabLevel)
grabQID=re.findall(r'(\w{1}\d{5})',row)
captureQID.append(grabQID) #ERROR LINE.
grabDesc=re.findall(r'\,+\s+(\w.+)',row)
captureDesc.append(grabDesc)
lineCount = 0
wordCount = 0
lines = ''.join(grabDesc).split('.')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
newfile.close()
Here are three lines of my data:
a00004," another oakstr eetrequest, helped student request item",2
a00005, asked retiree if he used journal on circ list,2
a00006, asked scientist about owner of some archival notes,2
Here is the result:
22|a00002|1|1|a00002,
22|a00002|1|2|
22|a00002|1|3|scientist
22|a00002|1|4|looking
22|a00002|1|5|for
The first column of the result should be just one number, but why is it printing out a two digit number?
Any idea what is the problem here? Thanks.
It is the tab and space difference again. Need to be careful especially for Python. Spaces are not treated as equivalent to tab. Here is a helpful link talking about the difference: http://legacy.python.org/dev/peps/pep-0008/. To be brief, space is recommended for indentation in the post. However, I find Tab works fine for indentation too. It is important to keep indentation consistent. So if you use tab, make sure you use it all the way.