Some words(be trained) can not be recognised through Tesseract-OCR

Some words(be trained) can not be recognised through Tesseract-OCR - c++

I am currently using Tesseract-OCR to recognize some texts in a picture .But now I have a question.Since some words can not be recognized .I specially have trained them and it still did not work!
Should I need some extra files when train the language data like the DAWG files or etc. I
have no idea about that. Because sometimes it can recognize a few of them when the words display at some special positions and directions .
It is really confusing . Sincerely need your help. Thanks in advance!
Other info:
I am using the Simplified Chinese.(I don't know if there are any parameters that I did not set when using Chinese)
Since the picture I wanna recognize is a table. there are a few lines in it. Could you have any idea on this situation when recognizing tables to improve the accuracy.
Since I don't know if it is caused by the special shape of the words. I paste some words directly here. 上下午一二三四五
Many thanks !

Related

NGS Analysis: How can I extract the different variants in two VCF files

I am currently working on my thesis and I am trying to analyze the results of NGS sequencing Illumina. I am not really familiar with bioinformatics and in this part of my project, I am trying two compare two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specifically I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis? If you can help me I would be more than thankful. Thank you in advance!

I understand your problem. First thing I would recommend is to use Unix software (I don't know which OS you're running) called VCFtools. It's pretty simple to use. But if You want to do all the processing with, for example python, you can use the pandas library for python which helps to process data in column format or PyVCF library, which is a parser for VCF files. I can help you more if you can provide some example data you're processing.

Converting RTF tables into xml

I was given the task to convert great amount of RTF tables into XML ones (around or way more than 100.000), but I have no idea how to even start it and i cannot get help from the lead developer, because ironically he had never written a line of code.
I was thinking about c++ as I need t to be fast, but I'm open to any ideas.
What I need is some information I can start the project with or any library/program I could use for my help, thank you.
EDIT: I have XSD schemas to work with.

Found the solution after looking for a while. I can use LibreOffice to save it as html or other various forms that will keep the table as it is and also give a clear code i can pull an XSD on to make it valid also.

Automatic Numberplate Recognition

As the title suggest, i want to build an ANPR application in windows. I am using Brazilian number plates. And i am using OpenCV for this.
So far i manged to extract the letters form the numberplate. Following images show some of the numbers i have extracted.
The problem i am facing is that how to recognize those letter. I tried to use Google tesseract. But it fails to recognize them sometimes. Then i tried to train an OCR data base using OpenCV i used about 10 images for each character. but it also did not work properly.
So i am stuck here. i need this for final year project.So can anybody help me?? i would really appreciate it.
Following site does it very nicely
https://www.anpronline.net/demo.html
Thank you..

you could train an ann or multi-class svm on the letter images, like here

Check out OpenALPR (http://www.openalpr.com). It already has the problem solved.
If you need to do it yourself, you really do need to train Tesseract. It will give you the best results. 10 images per character is not enough, you need dozens or hundreds. If you can find a font that is similar to your plate characters, a good approach is to print out a sheet of paper with all of the characters used multiple times. Then take 5-10 pictures of the page with your camera. These can then be your input for training Tesseract.

solr PatternReplaceCharFilterFactory working unexpectedly

I am relatively new to Solr so please forgive me if I'm missing something obvious. I have an application that allows users to search for musical artists. The indexing comes from a read-only database with correct spellings so on the index side I have it figured out.
On the query side however I need to anticipate various spelling errors/differences and want to help solr find those instances. From our old home-grown search solution, I have a list of regex's and the artists they apply to. When I was trying to translate those to solr using the PatternReplaceCharFilterFactory, I noticed that some worked perfectly, while others didn't work at all ... with seeming no rhyme nor reason between them.
For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="em[ei]n[ei]m" replacement="Eminem"/>
accurately captures the common misspellings of Eminem. But for the band 311:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Tt]hree [Ee]leven" replacement="311"/>
Does not work. Another example is Nine Inch Nails:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((nine|9).*inch.*nails\b)|(n\.? ?i\.? ?n\.?\b)" replacement="Nine Inch Nails"/>
works perfectly for finding the most common patterns for the band's name. But for Eve 6:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Ee]ve.{0,4}([Ss]ix|6)" replacement="Eve 6"/>
Is there something fundamental I'm missing on the usage of this filter? I've tried a number of variations on the regex's I've mentioned above (even going so far as using literals like 'three eleven'), but still with no success. I've tried making the filter in question the only PatternReplaceCharFilterFactory in the analyzer. I also know for sure that these items are in the index correctly because when I search for the correct spelling it returns the proper results.
Any suggestions?
Snowdall

I suspect the problem is not with your Char Factory, but with what comes after all, specifically the tokenizer. If you use standard tokenizer, it will get rid of the numbers you have just put into your stream. If you don't need the text to be split into tokens, you could look at KeywordTokenizerFactory instead.
In general, the best way to troubleshoot this in Solr 4+ is the Analysis screen in the Admin WebUI. It allows you to enter your text against particular field type and see what happens with it after each component in the analysis chain.

I would recommend using the SynonymFilter for the kind of application you describe. It allows you to provide an external file where you list words and their synonyms, like:
eminem <=> emenem
nine <=> 9
If you precede this with a LowerCaseFilter, you won't have to fuss about case normalization in your synonyms. You should be able to handle the 311 case too as long as you don't tokenize (ie use a KeywordTokenizer as Alexander Rafalovitch suggested).

How can I highlight different types of file in dired mode in Emacs?

In a nutshell, I want to have different faces for some types of file in dired mode. I don't think it matters, but I am using Aquamacs.
The example I will use here is .tex files. If I can do it for .tex, then I can just apply the same structure to do create other faces for other types of files.
From what I understand, I have to create a variable, write a regular expression, then apply a hook. I read a bit about regex and so far I have
^(.+)\.tex$
I think my structure and regular expression are not really correct. I am not a programmer (though I have an interest on it), I have only been using Emacs for 2 weeks or so, so any help would be greatly appreciated.
What I need is at least the basic structure of what I have to do. I understand there may be modes already created that do something similar (such as maybe Wdired and Dired-X), and I would not complain if someone told me about them, but what I really want is to have an elisp code (either already written or that I can work on), as I plan on learning a bit of elisp to be able to write my own customisations and this would be a way to learn.
Thank you!

Since you want to learn how to do it, try checking out the extension dired+.el. This mode does a lot more than what you want, but it does add new faces. Specifically, look for the variable diredp-font-lock-keywords-1 and how it is used. That should get you going.
Other SO questions that seem relevant are:
Match regular expression as keyword in define-generic-mode
Highlighting correctly in an emacs major mode
A hello world example for a major mode in emacs?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Some words(be trained) can not be recognised through Tesseract-OCR - c++

Related

NGS Analysis: How can I extract the different variants in two VCF files

Converting RTF tables into xml

Automatic Numberplate Recognition

solr PatternReplaceCharFilterFactory working unexpectedly

How can I highlight different types of file in dired mode in Emacs?

Categories

Resources