While this seems very sketchy, I am doing this for my MGMT288 class, and am trying to create a program that searches for a SSN from a group of copied text. I have very little python background, and am just exploring regex and in extension pyperclip. Currently my code in the entirety looks like this.
import re,pyperclip
SSNREG=re.compile(r'(\d{3})(-)?(\d{2})(-)?(\d{4})')
SSN=[]
CB=pyperclip.paste()
for groups in SSNREG.findall(CB):
SSN.append(groups[0])
if len(SSN)>0:
pyperclip.copy('\n'.join(CB))
print('Copied '+len(CB)+' SSN\'s to clipboard!')
print('\n'.join(CB))
else:
print('There were no SSN\'s to be found in the text.')
Whenever I have a 3-2-4 digit number copied with the dashes, it still prints out that there were no SSN's in the clipboard, and I can't figure out what's wrong.
I've just changed the /d to \d and it still doesn't seem to find anything.
Related
Using C++, I've managed to follow the Xapian tutorial found here.
https://getting-started-with-xapian.readthedocs.io/en/latest/practical_example/index.html#
The indexer program works as I expect it to, but the search program - https://getting-started-with-xapian.readthedocs.io/en/latest/practical_example/searching/building.html - works only with a caveat.
When, for example, I run the equivalent of:
python2 code/python/search1.py db Dent watch
No matches are found, unless I instead write the following:
python2 code/python/search1.py db '"Dent" "watch"'
Which works as well as I expect. The problem is in not quite knowing why it works (though I know the '"' symbol is a search query modifier of some kind), and in how I should aim to prepare queries for processing.
For example, does the Xapian::QueryParser class constructor have an option to add the '"' symbols for me? Or should I preprocess input before I try to retrieve matches?
For the record, using the following queryParser.parse_query(input, queryParser.FLAG_PHRASE) appear to fix the issue I had.
I have a text file in Notepad++ that contains about 66,000 words all in 1 line, and it is a set of 200 "lines" of output that are all unique and placed in 1 line in the basic JSON form {output:[{output1},{output2},...}]}.
There is a set of characters matching the RegEx expression "id":.........,"kind":"track" that occurs about 285 times in total, and I am trying to either single them out, or copy all of them at once.
Basically, without some super complicated RegEx terms, I am stuck because I can't figure out how to highlight all of them at once, and also the Remove Unbookmarked Lines feature does not apply because this is all in one line. I have only managed to be able to Mark every single occurrence.
So does this require a large number of steps to get the file into multiple lines and work from there, or is there something else I am missing?
Edit: I have come up with a set of Macro schemes that make the process of doing this manually work much faster. It's another alternative but still takes a few steps and quite some time.
Edit 2: I intended there to be an answer for actually just highlighting the different sections all at once, but I guess that it not possible. The answer here turns out to be more useful in my case, allowing me to have a list of IDs without everything else.
You seem to already have a regex which matches single instances of your pattern, so assuming it works and that we must use Notepad++ for this:
Replace .*?("id":.........,"kind":"track").*?(?="id".........,"kind":"track"|$) with \1.
If this textfile is valid JSON, this opens you up to other, non-notepad++ options, like using Python with the json module.
Edited to remove unnecessary steps
I am trying to find a string in word, I can see 3 of the strings in the document. However, the remaining 600+ of them are not visible.
I'm trying to search using (this is the regex in the external tool I used initially):
(ABC-\d+)
Using a tool to search in Word I searched for
(ABC.*)
and all of the results ended up being some form of the following:
ABCNormal -13
I don't have a clue how to find out what that even means in this context.
I tried searching IN Word for the following REGEX and it doesn't find any except the 3 that don't have the "normal " thing.
ABC?#[0-9]#
That should mean look for ABC some number of characters and some number of numbers.
I have tried turning on the hidden text/etc within the display options, the paragraph icon on the ribbon, anything I an think of.
Any ideas how to figure out how to SEE what this is, and either fix it, or work around it?
In the external tool [(ABC)[^0-9]+(\d+)] finally worked, but I still don't understand how to remove the Normal Text that is in the string that is NOT visible.
For example the string I visibly see
ABC-13
the text Regex is seeing is
ABCNormal -13
Got some troubles with my regex.
I got some lines like this:
SomeText#"C:\\","Shadow Copy Components:\\","E:\\",""
SomeText#"D:\\"
SomeText#"E:\\","Shadow Copy Components:\\"
SomeText#"SET SNAP_ID=serv.a.x.com_1380312019","BACKUP H:\\ USING \\\\?\\GLOBALROOT\\Device\\HarddiskVolumeShadowCopy47\\ OPTIONS:ALT_PATH_PREFIX=c:\\VERITAS\\NetBackup\\temp\\_vrts_frzn_img_3200\"
SomeText#"SET SNAP_ID=serv.a.x.com_1380312019","BACKUP Y:\\Libs USING \\\\?\\GLOBALROOT\\Device\\HarddiskVolumeShadowCopy47\\ OPTIONS:ALT_PATH_PREFIX=c:\\VERITAS\\NetBackup\\temp\\_vrts_frzn_img_3200\"
What i would like is to get a group named jobFileList containing for each line:
"C:\\","Shadow Copy Components:\\","E:\\",""
"D:\\"
"E:\\","Shadow Copy Components:\\"
H:\\
Y:\\Libs
You can see i only want the file list, but some times its only the full text after the # mark and sometimes there is a lot of ** that i need to remove.
Fact is i cant use a script for this case so i need to do this with only ONE regexp, can't just do a streplace of other stuff after the regex.
What i did is :
SomeText(#.*BACKUP (?P<jobFileList>.*?) .*)?(#(?P<jobFileList>.*))?
But seems i cant set the same GroupName :( If i replace the second jobFileList with another name its works perfectly but not what i need .
Thanks for your help,
EDIT:
I can also have some lines like :
SomeText#/ahol5d72_1_2
SomeText#/p7ol4a1p_1_2
SomeText#Gvadag04SANDsk_Daily
SomeText#/bck_reco_a9ol5765_1_2_827497669
In all these cases i need to have all the text after the # mark.
A version which doesn't rely on the double quotes after the double backslash:
SomeText#(?:(.*?BACKUP) )?(?P<jobFileList>(?(1)[^ ]*|.*$))
This: (?(1)[^ ]*|.*$) is a conditional group that is supported in Python 2.7.5 (probably works for higher versions but I don't know for previous ones). If there's BACKUP, it grabs all the non-spaces and if there's no BACKUP, it grabs everything till the end of the string.
regex101 demo
EDIT: As per comment, the regex that worked after #timmalos' modifications:
\#(?P<G>.*?[^E]BACKUP\s)?(?P<G2>f:\\\\Mailbox\\\)?(?P<jobFileList>(?(G)(?(G2)[^\]|\S)*|.*))
This is possible to match with a single regular expression however I know nothing of splunk. Maybe this will help:
("?[A-Z]:\\\\(?:".+|\S+)?)
Live demonstration here
I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.