Validating xml through xsd takes entities as referred character - regex

I'm struggling with something strange. We have to send some xml generated documents to a website, where they are going to be parsed against a xsd. This xsd includes a data type for text fields where a regular expression is used to exclude different characters (in fact the reg exp includes a definite list of valid characters, excluding everything outside the rule).
For example, standard double quotes (") are excluded.
On our xml generator I've changed (") with it's entity " and with this change, those data that includes (") pass the validation.
Of course, quotes aren't our only problem, there are lots of different characters excluded from the reg exp that can be found on our source data.
So I made a little function on our Oracle package that checks units of data against the reg exp before adding them to the xml, and if any doesn't fit, loops through it's content seeking invalid characters and changing them by the html entity associated to their encoding (thus, quotes are changed by ")
My surprise is that, although the reg exp validates a piece of data with ", it doesn't validate a piece of data with " (and yes, (#) are included in the reg exp).
There are any reason for this? I don't know... are #numeric entities parsed before making the validation against a xsd or something like that?
BTW, the regExp is this:
[0-9a-zA-ZñáàéèíìóòúùÁÉÍÓÚÑü\s/çÇ¡!¿=\?%€#&#,;:\.\-_''\*\+\(\) ÀÈÌÒÙÜ’´´`·äëïöÄËÏÖ“”’]+

Related

Why won't CloudSearch find substring matches in filename text field?

I have a CloudSearch domain with a filename text field. My issue is that a text query won't match (some) documents with filenames I think it (logically) should. If I have documents with these filenames:
'cars'
'Cars Movie.jpg'
'cars.pdf'
'cars#.jpg'
and I perform a simple text query of 'cars', I get back files #1, #2, and #4 but not #3. If I search 'cars*' (or do a structured query using prefix) I can match #3. This doesn't make sense to me, especially that #4 matches but #3 does not.
TL;DR It's because of the way the tokenization algorithm handles periods.
When you perform a text search, you're performing a search against processed data, not the literal field. (Maybe that should've been obvious, but it wasn't how I was thinking about it before.)
The documentation gives an overview of how text is processed:
During indexing, Amazon CloudSearch processes text and text-array fields according to the analysis scheme configured for the field to determine what terms to add to the index. Before the analysis options are applied, the text is tokenized and normalized.
The part of the process that's ultimately causing this behavior is the tokenization:
During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm.
According to the word break rules, strings separated by whitespace such as spaces and tabs are treated as separate tokens. In many cases, punctuation is dropped and treated as whitespace. For example, strings are split at hyphens (-) and the at symbol (#). However, periods that are not followed by whitespace are considered part of the token.
The reason I was seeing the matches described in the question is because the file extensions are being included with whatever precedes them as a single token. If we look back at the example, and build an index according to these rules, it makes sense why a search of 'cars' returns documents #1, #2, and #4 but not #3.
# Text Index
1 'cars' ['cars']
2 'Cars Movie.jpg' ['cars', 'movie.jpg']
3 'cars.pdf'. ['cars.pdf']
4 'cars#.jpg' ['cars', '.jpg']
Possible Solutions
It might seem like setting a custom analysis scheme could fix this, but none of the options there (stopwords, stemming, synonyms) help you overcome the tokenization problem. I think the only possible solution, to get the desired behavior, is to tokenize the filename (using a custom algorithm) before upload, and then store the tokens in a text array field. Although devising a custom tokenization algorithm that supports multiple languages is a large problem.

Flat file schema validation using regular expression - not allow new line and delimiter char

I know this must be primitive question but I am still not able to find a solution to my simple problem.
In a BizTalk solution, I want to validate a inbound flat file against a flat file schema (Delimiter char is pipe '|'). The rule is that there must be exact same number of fields in every record (every line). So after disassembling, none of the field must have new line char (CR LF or \r\n) and pipe '|' char.
Every line in flat file is a single record and there are 10 fields in every record. so there must me exact 9 '|' pipe chars in every line.
I tried to solve it using XSD regular expression validation but since regex is not my area of expertise, I am not able to create a final regex. Currently I am testing with .*(?!([^\r\n\|])).* but it doesn't work when there are more than 9 '|' chars however it works when there are less than 9.
Finally I want a XSD regex which must not allow a new line char and '|' in string but can have empty '' value.
I have referred below links to create my regex,
XML Schema Regular Expressions
XML Schema - Regular Expressions
I think you're trying to solve the wrong problem.
First, do you really need to do this? I don't recall ever needing or even considering what you're describing.
Second, you can just Validate the parsed Xml. If the field count is wrong, it will fail there. If you really need to check for extra '|', you can put that in the Schema to test for it in a Map.
IBM Integration Bus solves this problem by allowing you to describe the non-XML data format using an XSD. The technology is called Data Format Description Language (DFDL).
https://en.wikipedia.org/wiki/Data_Format_Description_Language

How can I replace text in a Siebel data mapping?

I have an outgoing web service to send data from Siebel 7.8 to an external system. In order for the integration to work, before I send the data, I must change one of the field values, replacing every occurence of "old" with "new". How can I do this with EAI data mappings?
In an ideal world I would just use an integration source expression like Replace([Description], "old", "new"). However Siebel is far from ideal, and doesn't have a replace function (or if it does, it's not documented). I can use all the Siebel query language functions which don't need an execution context. I can also use the functions available for calculated fields (sane people could expect both lists to be the same, but Siebel documentation is also far from ideal).
My first attempt was to use the InvokeServiceMethod function and replace the text myself in eScript. So, this is my field map source expression:
InvokeServiceMethod('MyBS', 'MyReplace', 'In="' + [Description] + '"', 'Out')
After some configuration steps it works fine... except if my description field contains the " character: Error parsing expression 'In="This is a "test" with quotes"' for field '3' (SBL-DAT-00481)
I know why this happens. My double quotes are breaking the expression and I have to escape them by doubling the character, as in This is a ""test"" with quotes. However, how can I replace each " with "" in order to call my business service... if I don't have a replace function? :)
Oracle's support web has only one result for the SBL-DAT-00481 error, which as a workaround, suggests to place the whole parameter inside double quotes (which I already had). There's a linked document in which they acknowledge that the workaround is valid for a few characters such as commas or single quotes, but due to a bug in Siebel 7.7-7.8 (not present in 8.0+), it doesn't work with double quotes. They suggest to pass instead the row id as argument to the business service, and then retrieve the data directly from the BC.
Before I do that and end up with a performance-affecting workaround (pass only the ID) for the workaround (use double quotes) for the workaround (use InvokeServiceMethod) for not having a replace function... Am I going crazy here? Isn't there a simple way to do a simple text replacement in a Siebel data mapping?
first thing (quite possibly - far from optimal one) which is coming to my mind - is to create at source BC calculated field, aka (NEW_VALUE), which becomes "NEW" for every record, where origin field has a value "OLD". and simply use this field in integration map.

How can I use Regex to parse irregular CSV and not select certain characters

I have to handle a weird CSV format, and I have been running into problems. The string I have been able to work out thus far is
(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*?)+?
My files are often broken and irregular, since we have to deal with OCR'd text which is usually not checked by our users. Therefore, we tend to end up with lots of weird things, like a single " within a field, or even a newline character(which is why I am using Regex instead of my previous readLine()-based solution). I've gotten it to parse most everything correctly, except it captures [,] [,]. How can I get it to NOT select fields with only a single comma? When I try and have it not select commas, it turns "156,000" into [156] and [000]
The test string I've been using is
"156,000","",""i","parts","dog"","","Monthly "running" totals"
The ideal desire capture output is
[156,000],[],[i],[parts],[dog],[],[Monthly "running" totals]
I can do with or without the internal quotes, since I can always just strip them during processing.
Thank you all very much for your time.
Your CSV is indeed irregular and difficult to parse. I suggest you do 2 replacements first to your data.
// remove all invalid double ""
input = Regex.Replace(input, #"(?<!,|^)""(?=,|$)|(?<=,)""(?!,|$)", "\"");
// now escape all inner "
input = Regex.Replace(input, #"(?<!,|^)"(?!,|$)", #"\\\"");
// at this stage your have proper CSV data and I suggest using a good .NET csv parser
// to parse your data and get individual values
Replacement 1 demo
Replacement 2 demo

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings