Parse a text file with no newline using RegEx - regex

I have a text file like below. Every record has 12 fields which are separated by |, but there is no record delimiter like a newline, and every record starts with 555. I am trying to parse it with RegEx.
555|abc|user|2|20120914055204696|20120914055204718|0||||21|33555|def|udp|2|20120914055204696|20120914055204718|0||||22|33555|abc|user|2|20120914055204696|20120914055204718|0||||23|33
I tried with 555(\|.*?\|){12}(\d\d), but it did not work. Can anyone please help me with this?

You can use
555(?:\|[^|]*){11}(?=$|555)
See demo
It will match these records in the input string:
555|abc|user|2|20120914055204696|20120914055204718|0||||21|33
555|def|udp|2|20120914055204696|20120914055204718|0||||22|33
555|abc|user|2|20120914055204696|20120914055204718|0||||23|33
The regex 555(?:\|[^|]*){11}(?=$|555) matches:
555 - literal 555
(?:\|[^|]*){11} - 11 occurrences of | followed by any number of characters other than |
(?=$|555) - up to (but not returning as part of the match) end of string or 555.

555(?:\|[^|]*?){11}\d\d
You need to remove the second | .See demo.
https://regex101.com/r/sS2dM8/31

Related

Regex to capture alpha numeric before pipe separated

I've been trying to create a regex with space & alpha numeric values.
Below Im sharing the sample String.
Manchester United 8547|12345678910
|12345678910
Manchester |12345678910
124587933 |12345678910
8457 Manchester United|12345678910
Manchester United|12345678910
I want to capture everything before pipe(|) separated. At times there is a possibility of complete space and no alpha numeric values before pipe(|) which I've shown in 2nd example. Regex should not capture pipe(|) and next numerical values(12345678910).
I've tried below regex but none are working for me.
^.*$
^[\s\w\d]+$
[a-zA-Z0-9\s]+
[a-zA-Z0-9\s\W]+
^[\sa-z|A-Z|0-9]+$
^[\sa-z|A-Z|0-9]+$
[^\s]*$
([^\"]*)
^[a-zA-Z0-9]$
^([^?]*)$
.+?(?=\w)
\s[a-zA-Z0-9]+
^[\sa-zA-Z0-9]+
I need a full match & not group match
for example if I try for
Manchester 8457 then regex would be Manchester \d+. This gives me full match & not group match.
You can try this.
input.substring(0,input.indexOf("|"))
If you want to match alphanumeric before the pipe and not get a group match, but a match only, you can use a character class with a positive lookahead (?=\|) (if that is supported) to assert the pipe at the right.
^[A-Za-z0-9 ]+(?=\|)
Regex demo
Assuming that every line would have a pipe, you could split the input string on CRLF, and then extract the portion to the left of the pipe:
String input = "Manchester United 8547|12345678910\n |12345678910\nManchester |12345678910\n124587933 |12345678910\n8457 Manchester United|12345678910\n Manchester United|12345678910\n";
String[] parts = input.split("\r?\n");
List<String> contents = Arrays.stream(parts)
.map(x -> x.split("\\|")[0].trim())
.collect(Collectors.toList());
System.out.println(contents);
This prints:
[Manchester United 8547, , Manchester, 124587933, 8457 Manchester United,
Manchester United]
for getting alphanumeric part use the following
^\s*\w(.+?)\|
This should answer your question i guess.
^(.+?)\|
Please use this and try it checks only for the beginning string.
its is for the pipe
Try it here

How to remove specific characters in notepad++ with regex?

This is data present in my .txt file
+919000009998 SMS +919888888888
+919000009998 MMS +91988 88888 88
+919000009998 MMS abcd google
+919000009998 MMS amazon
I want to convert my .txt like this
919000009998 SMS 919888888888
919000009998 MMS 919888888888
919000009998 MMS abcd google
919000009998 MMS amazon
removing the + symbol, and also the spaces if present in third column only if it is a number, if it is string no operation to be performed
is there any regex to do this which can I write in search and replace in notepad++?
Ctrl+H
Find what: \+|(?<=\d)\h+(?=\d)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
\+ # + sign
| # OR
(?<=\d) # positive lookbehind, make sure we have a digit before
\h+ # 1 or more horizontal spaces
(?=\d) # positive lookahead, make sure we have a digit after
Screen capture:
All previous answer will perfectly work.
However, I'm just adding this just in case you need it:
If for some reason you had non-phone numbers on the third column separated by spaces (a street comes to mind for me +919000009998 MMS street foo nº 123 4º-B) you may use this regex instead (It will join number as long as the third column starts by +):
Search: ^[+](\S+\s+\S+\s++)(?:([^+][^\n]*)|[+])|\G\s*(\d+)
Replace by: \1\2\3
That will avoid joining the 3 and 4 on my previous example.
You have a demo here.

Regex mask with varying ignored characters

I have a series of strings which look something like this:
foobar | ABC Some text 123
barfoo | DEF Some te 456
And I want to mask it such that I get the results
ABC123
DEF456
respectively. The text in between will always be a substring Some text which could potentially contain numbers (e.g. S0m3 t3xt or S0m3 t3). It will always be a substring starting from the left, so never me te.
So clearly I need to start the Regex with something like
(?<=| )[A-Z]{3}
which gets me ABC and DEF but I am at a loss of how to effectively concatenate the numbers at the end of the string.
Is there any way to do this with a single expression?
See http://regexr.com?375u8
(?<=| )([A-Z]{3}).*(\d{3})
This will give you three characters in the range of A-Z and three numbers in two capturing groups, allowing you to use these groups to concatenate both to your desired output: $1$2
This will even work if your Some text contains three numbers inbetween.
In case you want to replace everything with both of your capturing groups, add .* in front of the regex:
.*(?<=| )([A-Z]{3}).*?(\d{3})
Another javascript version
[
'foobar | ABC Some text 123',
'barfoo | DEF Some te 456'
].map(function(v) {
return v.replace(/^.*\| ([A-Z]{3}) .* (\d{3})$/, '$1$2');
})
Gives
["ABC123", "DEF456"]

Regular expression= tabspace+STRING+tabspace

How can I write this as a regular expression?
tabspaceSTRINGtabspace
My data looks like this:
12345 adsadasdasdasd 30
34562 adsadasdasdasd asdadaads<adasdad 30
12313 adsadasdasdasd asdadas dsaads 313123<font="TNR">adsada 30
1232131 adsadasdasdasd asdadaads<adasdad"asdja <div>asdjaıda 30
I want to get
12345 30
34562 30
12313 30
1232131 30
\t*\t doesn't work.
try the following regular expression
\t.+\t
The problem there is your definition of String...
If you use something like the suggested above, it'll match
tabspaceSTRINGtabspacetabspace
You get the picture. This might be acceptable, if not, you need to limit your "STRING" definition, like:
\t\w+\t
or:
\t[a-zA-Z]+\t
What characters are allowed in your string?
\t\w+\t
\w would allow letters, digits and the underscore (depending on your regex engine ASCII or Unicode)
See it here on Regexr, a good platform to test regular expressions.
Your "regex" \t*\t would match 0 or more tabs and then one tab. The * is a quantifier meaning 0 or more and is referring to the character or group before (here to your \t)
If your whitespace are not tabs, try this
\s+.+\s+30
\s is a whitespace character (space, tab, newline (not important for Notepad++)).
If you are not sure about the strings you are looking for except that they are separated by tabs it is a good approach to describe such a string as everything but a tab: (^\t*)
[^\t]*\t([^\t]*)\t[^\t]*
You can test it on regexpad.com.

What is wrong with this Regular Expression?

I am beginner and have some problems with regexp.
Input text is : something idUser=123654; nick="Tom" something
I need extract value of idUser -> 123456
I try this:
//idUser is already 8 digits number
MatchCollection matchsID = Regex.Matches(pk.html, #"\bidUser=(\w{8})\b");
Text = matchsID[1].Value;
but on output i get idUser=123654, I need only number
The second problem is with nick="Tom", how can I get only text Tom from this expresion.
you don't show your output code, where you get the group from your match collection.
Hint: you will need group 1 and not group 0 if you want to have only what is in the parentheses.
.*?idUser=([0-9]+).*?
That regex should work for you :o)
Here's a pattern that should work:
\bidUser=(\d{3,8})\b|\bnick="(\w+)"
Given the input string:
something idUser=123654; nick="Tom" something
This yields 2 matches (as seen on rubular.com):
First match is User=123654, group 1 captures 123654
Second match is nick="Tom", group 2 captures Tom
Some variations:
In .NET regex, you can also use named groups for better readability.
If nick always appears after idUser, you can match the two at once instead of using alternation as above.
I've used {3,8} repetition to show how to match at least 3 and at most 8 digits.
API links
Match.Groups property
This is how you get what individual groups captured in a match
Use look-around
(?<=idUser=)\d{1,8}(?=(;|$))
To fix length of digits to 6, use (?<=idUser=)\d{6}(?=($|;))