I have a big text file(message log from discord, preparing it to use in ML)
There can be content like this :
username1
#username2, whats up
username2
username1, im good, and username3 is also doing well
(the amount of unique usernames is relativly small, so i can eaasly manually find replace them, and names are more real and dont have numbers in them)
The problem is, I need to delete the lines, where is only username, and nothing else.
so here it would be
#username2, whats up
username1, im good, and username3 is also doing well
blackground : The library im using is treating each line as seperate text, so if undeleted, it would like to generate usernames, not messages, because of shear quantity of header names
better example, taken from actual file :
adyos
He reacted to invitation
adyos
And by the way, basic nightbot commands are up
adyos
And...
Captain Lea Skywalker
MEE6 is better, adyos 😐
excepted :
He reacted to invitation
And by the way, basic nightbot commands are up
And...
MEE6 is better, adyos 😐
If I well understand your needs and if you have a small number of usernames, you can use the following:
Ctrl+H
Find what: ^(?:adyos|user2)$\R?
Replace with: LEAV EMPTY
CHECK Match case
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
^ # beginning of line
(?: # non capture group
adyos # username 1
| # OR
user2 # username 2
) # end group, you can add all your usernames separated by a pipe |
$ # end of line
\R? # optional linebreak
Screenshot (before):
Screenshot (after):
Related
New at this so thanks in advance for the help.
I'm looking to write a Regex that will match the end of the string but not the beginning and there are some cases where the string is only one character.
Here are the sample strings and I'm trying to match only the items shown, otherwise there is no match.
/en-ca/brand/atf-type-f/ # should match /brand/atf-type-f/
/ # no match
/en-ca # no match
/en-ca/ # no match
/es-xl # no match
/en-gb # no match
/ru-kz/ # no match
/knowledge-centre/sds # should match /knowledge-centre/sds
/en-us/brand/purity-fg # should match /brand/purity-fg
The Regex engine I'm using to Google Analytics and I'm looking to output the Page Path without the country ID and the language ID.
Figured this out.
Using the Advanced Filter within GA I:
1) Used regex with ^(/..-..)?(/)?(.*)
2) used the Output To -> Constructor to put up the groups I wanted. Each () within GA Output Constructor is numbered. Therefore $A1 pickups first part and so on. Therefore just returning $A3 gave me the path. Had to added / back in at the beginning so the output statement became /$A3
Hope this help someone else.
I try to match the username of users on YouNow from a specific field.
I extracted this html, I try to extract the username _You Won
"\n\t\t\t\t\t\t14\n\t\t\t\t\t\t_You Won\n\t\t\t\t\t"
This is my regex attempt:
(\d+)[\\n\\t]+([\W\w]+[^\\n\\t"$])
This worked fine, first I match a number which is the level, then I match the username. However, if the username ends with either t or n then it does not get the last letter. So user game 1n would get cut down to game 1
Does someone know how I can fetch the username correctly?
Play it:
https://regex101.com/r/j8rufa/2
You could use Positive Lookahead at the end instead of [^\\n\\t"$].
Your code will be:
(\d+)[\\nt]+([\W\w]+(?=\\n\\t))
Demo: https://regex101.com/r/j8rufa/4
You can also use Positive Lookbehind to further enhance the code to ensure that the whole name is matched. For example, if the name is something like t_You Won, it will be matched without any issues:
(\d+)[\\nt]+(?<=\\t)([\W\w]+(?=\\n\\t))
Demo: https://regex101.com/r/j8rufa/6
I have this file where I only want to extract the email address and first name from our client list.
So a sample from the file:
a#abc.com,www.abc.com,2011-11-15 00:00:00,8.8.8.8,John,Doe,209 Park Rd,See,FL,33870,,,
b#abc.com,cde.com,2011-11-07 00:00:00,4.4.4.4,Erickson,Crast,136 Kua St # 1367,Pearl,HI,96782,,8084568190,
I would like to get back
a#abc.com,John
b#abc.com,Erickson
So basically email address and First Name
I know I can do this in powershell but maybe a find and replace in ultraedit will be faster
Note: you will notice some fields are not provided so it will show ",," meaning those fields were left empty when the user signed up but the amount of comma in each line is the same, 12 being the count.
So basically there are fields separated by ",". Without looking at the correct content (i.e. email/timestamp etc. will need to have a certain format which could also be checked) let's just try to extract the values of the first and fourth field.
so I'd suggest
a Replace-Operation where you search for
^([^,]*),[^,]*,[^,]*,[^,]*,([^,]*),.*$
and replace it with
\1 # \2
Options: "Regular Expressions: Unix".
(Just inserted the # to have a separator, although the first whitespace would be sufficient. But you'll get the idea, I assume...)
Result:
a#abc.com # John
b#abc.com # Erickson
I'm trying to parse a gitolite.conf file, which is a whitespace-oriented conf file with a few regexes. The worst problem is that some options might appear anywhere:
#staff = dilbert alice # line 1
#projects = foo bar # line 2
repo #projects baz # line 3
RW+ = #staff # line 4
- master = ashok # line 5
RW = ashok # line 6
R = wally # line 7
config hooks.emailprefix = '[%GL_REPO] ' # line 8
Check the "master" attribute. Some repos have them, others do not. It's a real pain.
This answer assumes a goal of extracting key/value pairs into capturing groups, where key consists of contiguous non-whitespace before = and value includes everything after = but before #, trimmed of leading/trailing whitespace.
Basic version
([^\s]+)\s*=\s*((?:\s*[^\s#]+)*)
More advanced version
The regex above doesn't handle quoted strings very well (e.g. prefix = ' Quoted with # and leading/trailing whitespace '). Regex isn't great at this kind of thing but simple cases can be handled as follows:
([^\s]+)\s*=\s*('[^']*'|"[^"]*"|(?:(?:\s*[^\s#]+)*))
Here's the demo if you need to see what is captured and play around with it more: Debuggex Demo
First, you should know that this isn't entirely possible with Regex. Regex is a great tool for parsing regular languages (including some types of configuration files), but as soon as you get into "Well, this line is actually a header line and we need all lines under it, and some lines might have this token, and others might not", it gets quite messy. I'm not saying it's impossible, but you're going to waste a lot of time debugging your Regex pattern instead of just writing a parser in whatever language you're using this with.
Second, if you're going to ask a quesiton about Regex, it is always helpful to know what you want out of the expression. Do you want to tokenize everything, do you only want the configuration keys, do you only want the comments?
That being said, I took my best guess, here's an expression to get you started:
^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))
With this expression, please apply the g and m flags (global and multiline). In PCRE, this would look like:
/^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))/gm
There are two capture groups, one is whatever is before the = sign, and the other is whatever is after. If there is no = sign, the first capture group contains everything. Anything after "#" is ignored.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/eQexbZU
World's most convuluted title I know, an example should explain it better. I have a large txt file in the below format, though details and amount of lines will change everytime:
Username: john_joe Owner: John Joe
Account:
CLI:
Default:
LGICMD:
Flags:
Primary days:
Secondary days:
No access restrictions
Expiration:
Pwdlifetime:
Last Login:
Maxjobs:
Maxacctjobs:
Maxdetach:
Prclm:
Prio:
Queprio:
CPU:
Authorized Privileges:
BYPASS
Default Privileges:
SYSPRV
This sequence is repeated a couple of thousand times for different users. I need to find every user (ideally the entire first line of the above) that has SYSPRV under "Default Permissions".
I know I could write an application to do this, I was just hoping their might be a nice regex I could use.
Cheers
^Username:\s*(\S+)((?!^Username).)*Default Privileges:\s+SYSPRV
with the option to make ^ match start of line, and to make dot match newlines, will isolate those records and capture the username in backreference no. 1. Tell me which language you're using, and I'll provide a code sample.
Explanation:
^Username:\s: match "Username" at the start of the line, a colon and any whitespace.
(\S)+": match any non-whitespace characters and capture them into backreference no. 1. This will be the Username.
((?!Username).)*: Match any character as long as it's not the "U" of "Username". This ensures that we won't accidentally cross over into the next record.
Default Privileges:\s+SYSPRV: match the required text.
So in Python, for example, you would use:
result = re.findall(r"(?sm)^Username:\s*(\S+)((?!^Username).)*Default Privileges:\s+SYSPRV", subject)