Matlab - how to extract specific data from a vector - regex

I have some data from a GPS receiver, however, some of the data are corrupted by extra characters. I want to extract the timestamp (the first field) and the data for the $GPGGA and $GPVTG.
To be more clear, here is a sample of the data I have in a cell array:
'1458937887.70818 $GPGGA,200228.90,3555.3269,N,15552.9641,A*25'
'1458937887.709668 $GPVTG,56.740,T,56.740,M,0.069,N,0.127,K,D*2D'
'1458937887.712022 ªDe¾,…´apö$™°%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,C*2B'
'1458937887.714071 $GPVTG,286.847,T,286.847,M,0.028,N,0.051,K,D*28'
As you can see, the problem here is in the third line where some strange characters appear between the timestamp and the data.
Another problem is that sometimes this third line is split into two lines, something like this:
'1458937887.712022 ªDe¾,…´apö$™°'
'%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,D*24'
which is making using regexp very hard.
In summary, I want to format the third line (in both cases) as:
'1458937887.712022 $GPGGA,200229.00,3555.3269,N,15552.9641,D*2R'
Update:
Thanks to #excaza, this solves the first issue (removing the garbage):
regexprep(str, '(?<=\d\s)(.*)(?=\$GPGGA)', '')
As for the second issue, #Suever's question gave me an idea by looking at the format of the data. Is it possible to solve it while reading the data from a .txt file? Something like defining the delimiter to be * followed by two characters and a \n since all packets end with this pattern?

Related

Importing file with split lines Azure Data Factory pipeline

I have a pipe delimited text file with a header row that I need to import into an SQL Server table (obtained via SFTP). That should be easy enough, however, the input file has input rows split over several lines if the data for the row exceeds 80 chars in length. EOL character is a newline character, which ADF can cope with just fine.
So, we have something like:
Col1Name|Col2Name|Col3Name|Col4Name
aaaa|bbbbbb|cccc|ddddd
eeeeeee|fffff|gggggg|this is some data that pushes the row over the 80 character li\
mit
hhhhhh|iiiiiii|jjjjjjjj|kk
If some of the rows of data weren't split in this manner it would be straightforward to shunt the data into the destination table but I can't work out how to merge the split lines prior to mapping the data to output columns.
Things I have tried/looked at doing:
Using a text file source with pipes as delimiters and newlines as row terminators, replacing the backslash and newline combination with an empty string. Unfortunately, the data is already processed into separate rows at this point so this achieves nothing.
Mucking around with the column/row delimiters to read the file into one big blob and replacing the backslash/newline combos in the blob with an empty string. This doesn't work as the file gets truncated doing this.
Some combination of aggregate transformation with a collect() expression to merge the lines. Again, can't seem to manage this because there aren't any grouping columns in common between the lines the row has been split into to be able to perform this sort of aggregation.
Do I need to write an Azure function to pre-process the file and merge the split lines, or is there something I'm missing that would help?

How to ignore specific charactor and new line using regex

I am trying to validate a csv file using Apache-NiFi.
My CSV file has some defects.
id,name,address
1,sachith,{"Lane":"ABC.RTG.EED","No":"12"}
2,nalaka,{"Lane":"DEF",
"No":"23"}
3,muha,{"Lane":"GRF.FFF","No":"%$&%*^%"}
Here in second row,its been divided into two lines and third row has some special characters.
I want to ignore both the lines. For that I use \{("\w+":"\w+",)*[^%&*#]*\}, but this is not capturing row split error and new line.
I also used \{("\w+":"\w+",)*[^%&*#]*\}$, but it doesnt even get the right answer.
This is you might looking for: ^[0-9]+,[a-z]+,\{("\w+":"[\w\.]+","\w+":"[a-zA-Z0-9]+")\}$

Getting Beyond Compare to Match Similar Lines Properly

I am using Beyond Compare 4.1.6 to diff text configuration files. There is one configuration parameter per line, and each line is formatted as follows:
:=
I would like to configure Beyond Compare such that it will align only lines when the : portion of the line is exactly the same in both files. Put differently, everything from the beginning of the line up to and including the colon must match exactly for the two lines to be aligned. Note that a colon cannot occur in , so the colon I want Beyond Compare to base its alignment decision on will always be the first colon in the line.
An example is:
# FILE 1
abcdefgh:string=5
# FILE 2
abcdefkh:string=5
Beyond Compare aligns these two lines even though I don't want it to.
I've been unable to coerce Beyond Compare to compare lines as desired by editing its grammar rules or by tweaking other features.
How may I get Beyond Compare to match lines as described above?
Thank you!
You can compare it with a table compare.
Then you must set the = as field separator:
When you did this, you have two columns and the first is the key columns (if not, you can define it).
After this you get the result you want (if I understood your question right):
If you need it often, you may store the setting in a file format.

Finding matches after a specific line in Perl/Notepad++

My problem is that I have a document that is split into sections, each section is noted by a single line header - [Header1], [Header2], etc. - and contains various types of data sets separated into individual lines, where each line is begun by a label indicating what type of data follows, like this:
[Header1]
data_label_type1 = 1,2,3
data_label_type2 = 1,2,3,4
data_label_type1 = 1,2,3,4,5
data_label_type3 = 1,2
Note the headers/sections are out of order, so Header1 doesn't always start a document and Header2 won't always follow.
A bit off topic, but the data sets are results from an experiment I'm mainting for a thesis.
I want to be able to capture type 1 data found only in the first section (under Header1) using a single regex function. After capturing it I was going to use replace and another function to convert the captured data to a different form.
Initially I was using the regex type1\h*=\h*([[:graph:]]*) but this only goes line by line, and I've got hundreds of documents - potentially tens of thousands of individal lines to catch.
I can use regex to convert my data well enough, but my problem lies in that I have no idea how capture type 1 data from Header1 exclusively. Any help, tips or pointers to start some experimenting would be really appreciated!
Regex apparently not capable of providing a solution, will use alternatives such as a parser instead.

Fortran 90: reading a generic string with enclosed some "/" characters

Hy everybody, I've found some problems in reading unformatted character strings in a simple file. When the first / is found, everything is missed after it.
This is the example of the text I would like to read: after the first 18 character blocks that are fixed (from #Mod to Flow[kW]), there is a list of chemical species' names, that are variables (in this case 5) within the program I'm writing.
#Mod ID Mod Name Type C. #Coll MF[kg/s] Pres.[Pa] Pres.[bar] Temp.[K] Temp.[C] Ent[kJ/kg K] Power[kW] RPM[rad/s] Heat Flow[kW] METHANE ETHANE PROPANE NITROGEN H2O
I would like to skip, after some formal checks, the first 18 blocks, then read the chemical species. To do the former, I created a character array with dimension of 18, each with a length of 20.
character(20), dimension(18) :: chapp
Then I would like to associate the 18 blocks to the character array
read(1,*) (chapp(i),i=1,18)
...but this is the result: from chapp(1) to chapp(7) are saved the right first 7 strings, but this is chapp(8)
chapp(8) = 'MF[kg '
and from here on, everything is leaved blank!
How could I overcome this reading problem?
The problem is due to your using list-directed input (the * as the format). List-directed input is useful for quick and dirty input, but it has its limitations and quirks.
You stumbled across a quirk: A slash (/) in the input terminates assignment of values to the input list for the READ statement. This is exactly the behavior that you described above.
This is not choice of the compiler writer, but is mandated by all relevant Fortran standards.
The solution is to use formatted input. There are several options for this:
If you know that your labels will always be in the same columns, you can use a format string like '(1X,A4,2X,A2,1X,A3,2X)' (this is not complete) to read in the individual labels. This is error-prone, and is also bad if the program that writes out the data changes format for some reason or other, or if the labes are edited by hand.
If you can control the program that writes the label, you can use tab characters to separate the individual labels (and also, later, the labels). Read in the whole line, split it into tab-separated substrings using INDEX and read in the individual fields using an (A) format. Don't use list-directed format, or you will get hit by the / quirk mentioned above. This has the advantage that your labels can also include spaces, and that the data can be imported from/to Excel rather easily. This is what I usually do in such cases.
Otherwise, you can read in the whole line and split on multiple spaces. A bit more complicated than splitting on single tab characters, but it may be the best option if you cannot control the data source. You cannot have labels containing spaces then.