Regular expression to get only the first word from each line - regex

I have a text file
#sp_id int,
#sp_name varchar(120),
#sp_gender varchar(10),
#sp_date_of_birth varchar(10),
#sp_address varchar(120),
#sp_is_active int,
#sp_role int
Here, I want to get only the first word from each line. How can I do this? The spaces between the words may be space or tab etc.

Here is what I suggest:
Find what: ^([^ \t]+).*
Replace with: $1
Explanation: ^ matches the start of line, ([^ \t]+) matches 1 or more (due to +) characters other than space and tab (due to [^ \t]), and then any number of characters up to the end of the line with .*.
See settings:
In case you might have leading whitespace, you might want to use
^\s*([^ \t]+).*

I did something similar with this:
with open('handles.txt', 'r') as handles:
handlelist = [line.rstrip('\n') for line in handles]
newlist = [str(re.findall("\w+", line)[0]) for line in handlelist]
This gets a list containing all the lines in the document,
then it changes each line to a string and uses regex to extract the first word (ignoring white spaces)
My file (handles.txt) contained info like this:
JoIyke - personal twitter link;
newMan - another twitter handle;
yourlink - yet another one.
The code will return this list:
[JoIyke, newMan, yourlink]

Find What: ^(\S+).*$
Replace by : \1
You can simply use this to get the first word.Here we are capturing the first word in a group and replace the while line by the captured group.

Find the first word of each line with /^\w+/gm.

Related

Remove duplicate lines containing same starting text

So I have a massive list of numbers where all lines contain the same format.
#976B4B|B|0|0
#970000|B|0|1
#974B00|B|0|2
#979700|B|0|3
#4B9700|B|0|4
#009700|B|0|5
#00974B|B|0|6
#009797|B|0|7
#004B97|B|0|8
#000097|B|0|9
#4B0097|B|0|10
#970097|B|0|11
#97004B|B|0|12
#970000|B|0|13
#974B00|B|0|14
#979700|B|0|15
#4B9700|B|0|16
#009700|B|0|17
#00974B|B|0|18
#009797|B|0|19
#004B97|B|0|20
#000097|B|0|21
#4B0097|B|0|22
#970097|B|0|23
#97004B|B|0|24
#2C2C2C|B|0|25
#979797|B|0|26
#676767|B|0|27
#97694A|B|0|28
#020202|B|0|29
#6894B4|B|0|30
#976B4B|B|0|31
#808080|B|1|0
#800000|B|1|1
#803F00|B|1|2
#808000|B|1|3
What I am trying to do is remove all duplicate lines that contain the same hex codes, regardless of the text after it.
Example, in the first line #976B4B|B|0|0 the hex #976B4B shows up in line 32 as #976B4B|B|0|31. I want all lines EXCEPT the first occurrence to be removed.
I have been attempting to use regex to solve this, and found ^(.*)(\r?\n\1)+$ $1 can remove duplicate lines but obviously not what I need. Looking for some guidance and maybe a possibility to learn from this.
You can use the following regex replacement, make sure you click Replace All as many times as necessary, until no match is found:
Find What: ^((#[[:xdigit:]]+)\|.*(?:\R.+)*?)\R\2\|.*
Replace With: $1
See the regex demo and the demo screenshot:
Details:
^ - start of a line
((#[[:xdigit:]]+)\|.*(?:\R.+)*?) - Group 1 ($1, it will be kept):
(#[[:xdigit:]]+) - Group 2: # and one or more hex chars
\| - a | char
.* - the rest of the line
(?:\R.+)*? - any zero or more non-empty lines (if they can be empty, replace .+ with .*)
\R\2\|.* - a line break, Group 2 value, | and the rest of the line.

How can I delete the rest of the line after the second pipe character "|" for every line with python?

I am using notepad++ and I want to get rid of everything after one second (including the second pipe character) for every line in my txt file.
Basically, the txt file has the following format:
3.1_1.wav|I like apples.|I like apples|I like bananas
3.1_2.wav|Isn't today a lovely day?|Right now it is 1 in the afternoon.|....
The result should be:
3.1_1.wav|I like apples.
3.1_2.wav|Isn't today a lovely day?
I have tried using \|.* but then everything after the first pipe character is matched.
In Notepad++ do this:
Find what: ^([^\|]*\|[^\|]*).*
Replace with: $1
check "Regular expression", and "Replace All"
Explanation:
^ - anchor at start of line
( - start group, can be referenced as $1
[^\|]* - scan over any character other than |
\| - scan over |
[^\|]* - scan over any character other than |
) - end group
.* - scan over everything until end of line
in replace reference the captured group with $1
I'm not sure if this is the best way to do it, but try this:
[^wav]\|.*

startWith with regex kotlin [duplicate]

I am trying to work on regular expressions. I have a mainframe file which has several fields. I have a flat file parser which distinguishes several types of records based on the first three letters of every line. How do I write a regular expression where the first three letters are 'CTR'.
Beginning of line or beginning of string?
Start and end of string
/^CTR.*$/
/ = delimiter
^ = start of string
CTR = literal CTR
$ = end of string
.* = zero or more of any character except newline
Start and end of line
/^CTR.*$/m
/ = delimiter
^ = start of line
CTR = literal CTR
$ = end of line
.* = zero or more of any character except newline
m = enables multi-line mode, this sets regex to treat every line as a string, so ^ and $ will match start and end of line
While in multi-line mode you can still match the start and end of the string with \A\Z permanent anchors
/\ACTR.*\Z/m
\A = means start of string
CTR = literal CTR
.* = zero or more of any character except newline
\Z = end of string
m = enables multi-line mode
As such, another way to match the start of the line would be like this:
/(\A|\r|\n|\r\n)CTR.*/
or
/(^|\r|\n|\r\n)CTR.*/
\r = carriage return / old Mac OS newline
\n = line-feed / Unix/Mac OS X newline
\r\n = windows newline
Note, if you are going to use the backslash \ in some program string that supports escaping, like the php double quotation marks "" then you need to escape them first
so to run \r\nCTR.* you would use it as "\\r\\nCTR.*"
^CTR
or
^CTR.*
edit:
To be more clear: ^CTR will match start of line and those chars. If all you want to do is match for a line itself (and already have the line to use), then that is all you really need. But if this is the case, you may be better off using a prefab substr() type function. I don't know, what language are you are using. But if you are trying to match and grab the line, you will need something like .* or .*$ or whatever, depending on what language/regex function you are using.
Regex symbol to match at beginning of a line:
^
Add the string you're searching for (CTR) to the regex like this:
^CTR
Example: regex
That should be enough!
However, if you need to get the text from the whole line in your language of choice, add a "match anything" pattern .*:
^CTR.*
Example: more regex
If you want to get crazy, use the end of line matcher
$
Add that to the growing regex pattern:
^CTR.*$
Example: lets get crazy
Note: Depending on how and where you're using regex, you might have to use a multi-line modifier to get it to match multiple lines. There could be a whole discussion on the best strategy for picking lines out of a file to process them, and some of the strategies would require this:
Multi-line flag m (this is specified in various ways in various languages/contexts)
/^CTR.*/gm
Example: we had to use m on regex101
Try ^CTR.\*, which literally means start of line, CTR, anything.
This will be case-sensitive, and setting non-case-sensitivity will depend on your programming language, or use ^[Cc][Tt][Rr].\* if cross-environment case-insensitivity matters.
^CTR.*$
matches a line starting with CTR.
Not sure how to apply that to your file on your server, but typically, the regex to match the beginning of a string would be :
^CTR
The ^ means beginning of string / line
There's are ambiguities in the question.
What is your input string? Is it the entire file? Or is it 1 line at a time? Some of the answers are assuming the latter. I want to answer the former.
What would you like to return from your regular expression? The fact that you want a true / false on whether a match was made? Or do you want to extract the entire line whose start begins with CTR? I'll answer you only want a true / false match.
To do this, we just need to determine if the CTR occurs at either the start of a file, or immediately following a new line.
/(?:^|\n)CTR/
(?i)^[ \r\n]*CTR
(?i) -- case insensitive -- Remove if case sensitive.
[ \r\n] -- ignore space and new lines
* -- 0 or more times the same
CTR - your starts with string.

How remove 1st ":" word from line in txt file?

Please see my textfile data below
roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057
deonnarsi15:latashajcclaypoolejcv5946sumonkhan:JKVWjv&20:8801627573929
ernaalo68:lindaohschletteoha1797sumonkhan:OPYZoy&84:8801628302709
dorathyshi56:fredrickaslperkinsonsle8932sumonkhan:STJKsj&30:8801621846709
londassg15:nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
xiaoexu39:miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
I am want delete first word until :
like
roydwk27:
deonnarsi15:
ernaalo68:
dorathyshi56:
actually I am want if sumonkhan starting line then no problem but if sumonkhan line area 1st position available : with something then need remove this.
below actually data show in my .txt file
nataliaunmcredmondung5478sumonkhan:UVDEud&61:8801624792536
miriamfyboatwrightfyr3810sumonkhan:IJZAiz&47:8801626854856
all line available sumonkhan so if sumon khan starting position like this then good else delete this : full word not full line.
I hope this regex would help you. This regex deletes everything until first colon(:).
If you are reading a file then, read it line by line and run following regex on each line.
$str = 'roydwk27:teenaibuchytilibu5762sumonkhan:IJQRiq&76:8801627574057';
$str =~ s/^(?:.*?):(.*)/$1/g;
This code is in perl, you can re-write equivalent code in any other language.
See this demo at regex101.com.
^[\w\d]+:(.*)
^ // match the beginning of a line
[\w\d]+ // match any letter and any number
: // match ":" literally
( // start of the capturing group
.* // match any characters
) // end of capturing group
Now in all your matches in the first group you have the text you want matched. Note the g (global) and m (multiline) modifiers.

Regular Expression: Extract the lines

I try to extract the name1 (first-row), name2 (second-row), name3 (third-row) and the street-name (last-row) with regex:
Company Inc.
JohnDoe
Foobar
Industrieterrein 13
The very last row is the street name and this part is already working (the text is stored in the variable "S2").
REGEXREPLACE(S2, "(.*\n)+(?!(.*\n))", "")
This expression will return me the very last line. I am also able the extract the first row:
REGEXREPLACE(S2, "(\n.*)", "")
My problem is, that I do not know how to extract the second and third row....
Also how do I test if the text contains one, two, three or more rows?
Update:
The regex is used in the context of Scribe (a ETL tool). The problem is I can not execute sourcecode, I only have the following functions:
REGEXMATCH(input, pattern)
REGEXREPLACE(input, pattern, replacement)
If the regex language provides support for lookaheads you may count rows backwards and thus get (assuming . does not match newline)
(.*)$ # matching the last line
(.*)(?=(\n.*){1}$) # matching the second last line (excl. newline)
(.*)(?=(\n.*){2}$) # matching the third last line (excl. newline)
just use this regex:
(.+)+
explain:
.
Wildcard: Matches any single character except \n.
+
Matches the previous element one or more times.
As for a regular expression that will match each of four rows, how about this:
(.*?)\n(.*?)\n(.*?)\n(.*)
The parentheses will match, and the \n will match a new line. Note: you may have to use \r\n instead of just \n depending; try both.
You can try the following:
((.*?)\n){3}