regex to match two characters and one equal operator - regex

I'm reading a file.
In that file I'm using row separator to split file. But in file the row separators are not constant.
Here is my file example.
CN=100
adshnxhndxghdngfhdsfs
CN=200
jhnxrhewxrgewhgxew
XN=300
jskhd sa
ZP=400
jhnxrhewxrgewhgxew
XX=500
jhnxrhewxrgewhgxew
Any my row separators in above file are like these CN=, ZP=, XX=, XN= There can be more because its gonna be very big file.
What regex I can use to figure out my row separators of pattern like these(CN=, ZP=, XX=, XN=)

Simple as
^\w{2}=\d+
See a demo on regex101.com (mind the multiline modifier and tell us your programming language, though!)

Related

Regex With Colons in Data

I have a text file which I'm looking to remove some data from. The data is separated using a colon ':' as the delimiter. There are approx 9 separations. The data after the 7th column is most often null and thus useless but the additional colons are still there.
An example of the file would like this:
column1:column2:column3:column4:column5:column6:column7:column8:column9:column10
I hope to remove the info from after column8. So the data to be removed would be:
:column9:column10
Could someone advise me how to do so in Regex?
I've been reading and no where have I found a way to isolate a colon and text following after x number of colons.
Any help you could offer would be much appreciated.
$_ = join ":", ( split /:/, $_, -1 )[0..7];
or
s/(?::[^:]*){2}\z//;
The following regex will keep the first 8 columns and discard all others.
s/^[^:]*(?::[^:]*){7}\K.*//;
Assumes simple single line records.

Load file in pig based on whitespace

I am trying to load a file in PIG which 2 words may be separated with spaces or tabs (may me more than one). Is there a way to delimit the file load using a regex for whitespace? Or is there any other way to achieve the below?
Input:
COUNTESS This young gentlewoman had a father,--O, that`
Output:
COUNTESS
This
young
gentlewoman
had
a
father,--O,
that
It would be great to have a comma delimiter also, but that would make it more complex. For now, only the whitespace delimiter should work for me.
Load the file as a line and then use TOKENIZE.If you have a mixture of tabs and space then after loading the data add a step to replace the tabs with spaces in the line and then use TOKENIZE.
A = LOAD 'test2.txt' as (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(A.$0));
C = FOREACH B GENERATE TOBAG(*);
DUMP C;
OUTPUT
I don't really know PIG, but here's some info:
https://pig.apache.org/docs/r0.9.1/func.html#strsplit
STRSPLIT(string, regex, limit)
regex could be something like [\s,]+. That will split on any blocks of whitespace and commas. So for instance, a b,c ,d, e would split in to each letter. the order of space and comma does not matter.

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Regex replace filename in javascript

I'm having trouble with a regular expression, I have several images with file name that need changing. I've done them by hand. It was quick easy and painless. However, I wanted to know what I needed to do as a simple replacement reg ex using JavaScript. And that's when it doesn't quite work out. The image is called "muti blossom 02.png" and it's going to be re-sized and saved out as JPEGs with the name "iOS_multi_BLOSSOM_2048.jpg". The others are of the same form but have different nouns; winter, leaf, circus etc.
The file-name is structured as follows:
"mutli" at the start (lower case),
white space,
the noun (lower case),
white-space,
a number (that may have a preceding 0 and may be one or two digits),
file extension which may be .png or .psd (lowercase).
It then needs to be changed to:
iOS_multi (camel case as written),
noun (UPPERCASE),
2048 (new fixed size),
new file extension .jpg(lowercase).
I know that ([a-z]+\s) matches "multi" and that (\s\d+.[a-z]+$) will match the numbers and file extension, but have no idea how to successfully match the bit in the middle as well. And do the uppercase on the noun. But I'm sure there is someone else that does. Thank you.
In JavaScript regex you cannot do this with a replace as it is not possible to uppercase the replacement text. However the match method will return an array which you can then manipulate.
var oldImageName = "multi blossom 02.png";
var matches = oldImageName.match(/multi (\w+) \d{1,2}\.(?:png|psd)/);
var newImageName = "iOS_multi_" + matches[1].toUpperCase() + "_2048.jpg";
Note: this assumes that the "noun" is a single word with no spaces
I was searching for "javascript Regex to replace characters that Windows doesn't accept in a filename" but found nothing,
so here is regex to strip chars from filename that windows filesistem do not allow (/\:?<>|"):
var originalFileName='some filename:with"forbidden/>\? chars.in';
var strippedFileName=originalFileName.replace(/[/\\:?<>|\"]+/g, "")
console.log(strippedFileName);

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.