findall function grabbing the wrong info - python-2.7

I am trying to writing a piece of python to read my files. The code is below:
import re, os
captureLevel = [] # capture read scale.
captureQID = [] # capture questionID.
captureDesc = [] # capture description.
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt')
newfile=open('finalwordlist.csv','w')
mytext=file.read()
for row in mytext.split('\n'):
grabLevel=re.findall(r'(\d{1})+\n',row)
captureLevel.append(grabLevel)
grabQID=re.findall(r'(\w{1}\d{5})',row)
captureQID.append(grabQID) #ERROR LINE.
grabDesc=re.findall(r'\,+\s+(\w.+)',row)
captureDesc.append(grabDesc)
lineCount = 0
wordCount = 0
lines = ''.join(grabDesc).split('.')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
newfile.close()
Here are three lines of my data:
a00004," another oakstr eetrequest, helped student request item",2
a00005, asked retiree if he used journal on circ list,2
a00006, asked scientist about owner of some archival notes,2
Here is the result:
22|a00002|1|1|a00002,
22|a00002|1|2|
22|a00002|1|3|scientist
22|a00002|1|4|looking
22|a00002|1|5|for
The first column of the result should be just one number, but why is it printing out a two digit number?
Any idea what is the problem here? Thanks.

It is the tab and space difference again. Need to be careful especially for Python. Spaces are not treated as equivalent to tab. Here is a helpful link talking about the difference: http://legacy.python.org/dev/peps/pep-0008/. To be brief, space is recommended for indentation in the post. However, I find Tab works fine for indentation too. It is important to keep indentation consistent. So if you use tab, make sure you use it all the way.

Related

Know the number of times this word appears: PYTHON 2.7

I have a file.txt with about 1,000 lines that look like this:
--- Adding sections to FwLogger: [],2020-01-13 16:09:18,2020-01-13 16:09:22
--- Clearing all sections from FwLogger,2020-01-13 16:09:17,2020-01-13 16:09:22
--- (1/0) The value was discarded due to being too separated from previous value
--- (1/0) ContinueBoot#b7630fd Rebooting device due to capabilities request freeze
And I would need to know how many times the word "FwLogger" appears ( in number ).
There are definitely more elegant ways to do it, but in my version you replace the delimiters manually:
with open('test.txt') as file:
for line in (line.strip() for line in file):
#here you replace all possible delimiters in your file with a space to split afterwards according to the spaces
c=line.replace(","," ").replace(";"," ").replace("#"," ").replace(":"," ")
for word in c.split(" "):
if word == "FwLogger":
# print(line)
counter= counter+1
print(counter)
read in your txt file and use the string find method like below
loop
istart = str.find(sub, istart)
I= I + 1.
end loop
I start is the position where the string you're looking for was last found. before starting your loop assign istart = 1
each time one is found increment a counter
i.e. I= I + 1

Regex to find 4th value inside bracket

How i can read 4th Value(inside "" i.e "vV0...." using Regex in below condition ?
I am updating a bit this part - Is it possible to first find Word "LaunchFileUploader" and then select the 4th Value, if there are multiple instance of LaunchFileUploader in the file just select 4th Value of first word found ? Attaching screenshot of file where this needs to be searched (In the file word is "LaunchFileUploader")
I tried this but it gives as - I need 4th value (Group 1 is giving me third value)
\bLaunchFileUploader\b(\:?.*?,){3}.*?\)
Match 1
Full match 11030-11428 LaunchFileUploader("ERM-1BLX3D04R10-0001", 1662, "2ecbb644-34fa-4919-9809-a5ff47594c2d", "8dZOPyHKBK...
Group 1. n/a "2ecbb644-34fa-4919-9809-a5ff47594c2d",
I am still looking for solution for this. Any help is aprreciated.
Depending on what's available to you to use, there's a couple of ways to do it.
Either way, this would work better if there were no new lines in the string, just plain ("value1","value2","value3","value4") etc. It'll still work, but you may need to clean up some new lines from the resulting string.
The easy way - use code for the hard part. Grab the inner string with:
(?<=\().*?(?=\))
This will get everything that's between the 2 parentheses (using positive lookarounds). In code, you could then split/explode this string on , and take the 4th item.
If you want to do it all in regex, you could use something along the lines of:
(?<=\()(?:.*?,){3}(.*?)(?=\))
This would a) match the entire contents of the parentheses and b) capture the 4th option in a capture group. To go even deeper:
(?<=\()(?:.*?,){3}\"(.*?)\"(?=\))
would capture the contents of the "" quotation marks only.
Some tools don't allow you to use lookarounds, if this is the case let me know and I'll see what other ways there are around it.
EDIT Ran this in JS console on browser. This absolutely does work.
EDIT 2 I see you've updated your question with the text you're actually searching in. This pattern will include the space and the new line character as per the copy/paste of the above text.
(?<=\(\")(?:.*?,\s?\n?){3}\"(.*?)\"(?=\))
See my second image for the test in console
This works for python and PHP:
(?<=\")(.*)(?:\"\);)\Z
Demo for Python and PHP
For Java, replace \Z with $ as follows:
(?:")(.*)(?:\"\);)$
Demo for JavaScript
NOTE: Be sure to look the captured group and not the matched group.
UPDATE:
Try this for your updated request:
"(.*)"(?:[\\);\] \/>}]*)$
Demo for updated input string
all the above regex patterns assume there is a line break after each comma
Auto-generated Java Code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\"(.*)\"(?:[\\\\);\\] \\/>\\}]*)$";
final String string = "\n"
+ "}$(document).ready( function(){ PathUploader\n"
+ " (\"ERM-1BLX3D04R10-0001\", \n"
+ " 1662, \n"
+ " \"1bff5c85-7a52-4cc5-86ef-a4ccbf14c5d5\", \n"
+ "\"vV0mX3VadCSPnN8FsAO7%2fysNbP5b3SnaWWHQETFy7ORSoz9QUQUwK7jqvCEr%2f8UnHkNNVLkJedu5l%2bA%2bne%2fD%2b2F5EWVlGox95BYDhl6EEkVAVFmMlRThh1sPzPU5LLylSsR9T7TAODjtaJ2wslruS5nW1A7%2fnLB%2bljZaQhaT9vZLcFkDqLjouf9vu08K9Gmiu6neRVSaISP3cEVAmSz5kxxhV2oiEF9Y0i6Y5%2f5ASaRiW21w3054SmRF0rq3IwZzBvLx0%2fAk1m6B0gs3841b%2fw%3d%3d\"); } );//]]>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

Remove lines from buffer that match the selected text

When analyzing large log files, I often remove lines containing text I find irrelevant:
:g/whatever/d
Sometimes I find text that spans multiple lines, like stacktraces. For that, I record the steps taken (search, go to start anchor, delete to end anchor) and replay that macro with 100000#q. I'm searching for a function or a feature vim already has included that allows me to mark text and remove all lines containing this text. Ideally this would also work for block selection.
If I understood your problem right, this command should do what you want:
:g/NullPointer/,/omitt/d
Example:
Before:
1
2
3
NullPointerException1
4
5
6
omitted
7
NullPointerException2
8
9
omitted
10
After:
1
2
3
7
10
Please read :h edit-paragraph-join, there is good explanation for the command, your case is just changing join into d
:g/whatever/d2
will delete a line with whatever and the line after it. If you can find text that always happens in the first line, you can strip out all of the following text if it has the same number of lines by changing 2 to whatever you need.
You could actually just use some normal commands in a global command to achieve what you want, look at your example (hope i understood it more or less right):
someText
NullPointerException
...
omitted
you want to delte from the line above NPE until the line with omitted right?
Just use the following:
:g/NullPointerException/execute "normal! kddd/omitted\<cr>dd"
It maybe looks complex, but it isn't. It is not better than a macro1
, but i like commands more, because I always make errors recording macros.
Since it only uses normal vim movements, it is easy to adopt. If you f.e. not know where your previous anchor is, you could use ?anchor\<cr> instead of kd. For a better demonstration you will have to submit a realistic example.
[1] You could argue, that this only needs to be run once, but that is also true for a recursive macro http://vim.wikia.com/wiki/Record_a_recursive_macro
Thanks to the answers here, I was able to code a very handy function: The sources below enables one to select text and remove all lines with the same (or similar) text in the current buffer. That works with both in-line and multiline selection. As I said I was searching for something that made me faster in analyzing log files. Log files typically contain dates and times and these change all the time, so it's a good idea to have something that let's us ignore numbers. Let's see. I'm using these two mappings:
vnoremap d :<C-U>echo RemoveSelectionFromBuffer(0)<CR>
vnoremap D :<C-U>echo RemoveSelectionFromBuffer(1)<CR>
Typical usage:
Remove similar lines ignoring numbers: Shift+v, then Shift+d
Remove same matches (single line): Mark text inline (leaving out dates and times), then d
Remove same matches (multiline): Mark text across lines (leaving out dates and times), then d
Here's the source code:
" Removes lines matching the selected text from buffer.
function! RemoveSelectionFromBuffer(ignoreNumbers)
let lines = GetVisualSelection() " selected lines
" Escape backslashes and slashes (delimiters)
call map(lines, {k, v -> substitute(v, '\\\|/', '\\&', 'g')})
if a:ignoreNumbers == 1
" Substitute all numbers with \s*\d\s* - in formatted output matching
" lines may have whitespace instead of numbers. All backslashes need
" to be escaped because \V (very nomagic) will be used.
call map(lines, {k, v -> substitute(v, '\s*\d\+\s*', '\\s\\*\\d\\+\\s\\*', 'g')})
endif
let blc = line('$') " number of lines in buffer (before deletion)
let vlc = len(lines) " number of selected lines
let pattern = join(lines, '\_.') " support multiline patterns
let cmd = ':g/\V' . pattern . '/d_' . vlc " delete matching lines (d_3)
let pos = getpos('v') " save position
execute "silent " . cmd
call setpos('.', pos) " restore position
let dlc = blc - line('$') " number of deleted lines
let dmc = dlc / vlc " number of deleted matches
let cmd = substitute(cmd, '\(.\{50\}\).*', '\1...', '') " command output
let lout = dlc . ' line' . (dlc == 1 ? '' : 's')
let mout = '(' . dmc . ' match' . (dmc == 1 ? '' : 'es') . ')'
return printf('%s removed: %s', (vlc == 1 ? lout : lout . ' ' . mout), cmd)
endfunction
I took the GetVisualSelection() code from this answer.
function! GetVisualSelection()
if mode() == "v"
let [line_start, column_start] = getpos("v")[1:2]
let [line_end, column_end] = getpos(".")[1:2]
else
let [line_start, column_start] = getpos("'<")[1:2]
let [line_end, column_end] = getpos("'>")[1:2]
end
if (line2byte(line_start)+column_start) > (line2byte(line_end)+column_end)
let [line_start, column_start, line_end, column_end] =
\ [line_end, column_end, line_start, column_start]
end
let lines = getline(line_start, line_end)
if len(lines) == 0
return ''
endif
let lines[-1] = lines[-1][: column_end - 1]
let lines[0] = lines[0][column_start - 1:]
return lines
endfunction
Thanks, aepksbuck, DoktorOSwaldo and Kent.

Regex with Replace String Python

I have this situation, I have a sentence with wrong dot (.) to process, the sentence:
sentence = 'Hi. Long time no see .how are you ?can you follow .#abcde?'
I am trying to normalize this sentence, if you see it, there is some wrong format sentence (.how, ?can, and .#abcde). I am thinking of using regex to handle this because the sentence keep changing. This is my code so far:
import re
character = ['.','?','#']
sentence = 'Hi. Long time no see .how are you ?can you follow .#abcde?'
sentence = str(sentence)
for i in character:
charac = str(i)
charac_after = re.findall(r'\\'+charac+r'\S*', sentence)
if charac_after:
print("Exist")
sentence = sentence.replace(charac, charac+' ')
print(sentence)
The result some how skip the dot (.) and at (#) it just process the question mark (?). This is the result:
Exist
Hi. Long time no see .how are you ? can you follow .#abcde?
its supposed to be "Hi. Long time no see . how are you ? can you follow . # abcde?". I don't know if my double backslash in "r'\'+charac+r'\S*'" are wrong or something, did I miss something?
How can I process all the character? please help.
Without any knowlegde of python i think you need to do it like this:
(as per suggestion from #Sebastian Proske)
character = ['.','?','#']
sentence = str('Hi. Long time no see .how are you ?can you follow .#abcde?')
sentence = re.sub(r'([' + ''.join(map(re.escape, character)) + r'])(?=\S)', r'\1 ', sentence)
print(sentence)
The code i am not sure about, but the regex. see here:
https://regex101.com/r/HXdeuK/2
see demo here https://repl.it/Fw5b/3

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))