how to realize a network-based book query system - c++

Please notice that if i do not want to use database.
i am now learning Unix networking programming. And i have my university book library all book list in seperated txt files. for example, the 'b' begin books is stored in b.txt. all a-z book count is about 1 million record. a line for a book' name and detailed other info.
Now i want to do a program to provide the query service of book list, for example, giving a book name, it can return the detailed info of this bool is it exists.
So i need to first build a module to take the function of query.
Then write the server side to call the query module and get the result and sending the result to the client module.
My question is , if i do not using database. How to realize the query module using c/c++, just first locating the first letter, for example, H begin book name should find in H.txt or H1.txt and H2.txt, using fopen open the file, then read line by line, then compare with queried book name using strFind, strCmp similar function, if have then return the result. i just think this is a time consuming thing and is not realize for using. And if have any such query system could for reference not using database but is bearable in time?

There are several options. The cheapest option (=low development time, low maintenance, low hardware requirements), IMO, is to create a html page on a separate site that links to all the data files. Then you set up another page that uses google.com to search that site. Then you just tell the google web spider to index your site. That way you get excellent performance with minimal work. But... you don't get to program any C.
Simple solution using C:
Do as you yourself suggest. If you have lots of memory available for file caching the performance won't be so bad unless the load gets high.
There will still be some work to do with the rest of the solution since you should delegate the search to worker threads.
Intermediate solution using C:
Find a 3rd party search engine and integrate it with your network code.
Advanced solution using C:
Implement your own search engine.

The problem is WHY DON'T U WANT TO USE DATABASE ?
1. make it easier to deploy?
sqlite may a good choice .
2. trying another method ?
lucene is a good choice of information retrieval, which is written by java.
clucene is someone rewrite the lucene to c .
You may also need stemmer tool(get the root of words),ictclas(chinese words' term extract) etc .
3. would like to do anything by yourself ?
It is easy to manage text file in system , while , as for a "query system", store is not enough , the main problem is IR(information retrieval ).
You may learn something about index building, store and query the index

Related

How to store and efficiently acces a lot of files in a search engine (C++)?

I am currently building a little search engine as part of a university project.
The search engine has to be able to perform searches on quite a lot of documents. 500k-1M I would say. The documents are pure .txt files and are not big (max 1MB).
I am writing the search engine in C++, but I am not quite sure how to store and access the documents efficiently (in memory).
I am using an inverted index, which stores for every term a list of document ids (pure integers) in which that term occurs.
I thought about creating a document class which takes into the constructor the filename of the document and creates a document object.
For example
Document d("25.txt");
The document class would also hold other information about the document, which I would present to the user, if the user decides to take a look at the document.
However, if I search for a term, that occurs in quite a lot of documents, for example "apple", then I would probably have to create hundreds or thousands of document objects. And when the queries get longer, that would blow up the heap (I guess?).
And I really need all of the potentially relevant documents in order to create a ranking.
What would be the right way to go here? Maybe somehow with serialization? Or using a DB? Or something else?
Please note, I can not use something like solr or Lucene, as this task is part of a university course.

Solr/Lucene "kit" to test searching?

Is there a "code free" way to get SOLR/LUCENE (or something similar) pointed at a set of word docs to make them quickly searchable by a user?
I am prototyping, seeing if there is value in, a system to search through some homegrown news articles. Before I stand up code to handle search string input and document indexing, I wanted to see if it was even worth it before I starting trying to figure it all out.
Thanks,
Judd
Using the bin/post tool of Solr and the Tika handler (named the ExtractingRequestHandler), you should be able to get something up and running for prototyping rather quickly.
See the introduction of Uploading Data with Solr Cell using Apache Tika. Tika is used to process a wide range of different document types.
You can give the Solr post tool a directory or a list of files to submit to the index.
Automatically detect content types in a folder, and recursively scan it for documents for indexing into gettingstarted.
bin/post -c gettingstarted afolder/

store a high amount of HTML files

i'm working on an academic project(a search engine), the main functions of this search engine are:
1/-crawling
2/-storing
3/-indexing
4/-page ranking
all the sites that my search engine will crawl are available locally which means it's an intranet search engine.
after storing the files found by the crawler, these files need to be served quickly for caching purpose.
so i wonder what is the fastest way to store and retrieve these file ?
the first idea that came up is to use FTP or SSH, but these protocols are connection based protocols, the time to connect, search for the file and get it is lengthy.
i've already read about google's anatomy, i saw that they use a data repository, i'd like to do the same but i don't know how.
NOTES: i'm using Linux/debian, and the search engine back-end is coded using C/C++. HELP !
Storing individual files is quite easy - wget -r http://www.example.com will store a local copy of example.com's entire (crawlable) content.
Of course, beware of generated pages, where the content is different depending on when (or from where) you access the page.
Another thing to consider is that maybe you don't really want to store all the pages yourself, but just forward to the site that actually contains the pages - that way, you only need to store a reference to what page contains what words, not the entire page. Since a lot of pages will have much repeated content, you only really need to store the unique words in your database and a list of pages that contain that word (if you also filter out words that occur on nearly every page, such as "if", "and", "it", "to", "do", etc, you can reduce the amount of data that you need to store. Do a count of the number of each word on each page, and then see compare different pages, to find the pages that are meaningless to search.
Well, if the program is to be constantly running during operation, you could just store the pages in RAM - grab a gigabyte of RAM and you'd be able to store a great many pages. This would be much faster than caching them to the hard disk.
I gather from the question that the user is on a different machine from the search engine, and therefore cache. Perhaps I am overlooking something obvious here, but couldn't you just sent them the HTML over the connection already established between the user and the search engine? Text is very light data-wise, after all, so it shouldn't be too much of a strain on the connection.

Compare two strings and find how closely they are related by meaning

Problem:
I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell".
One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc.
Question:
Please suggest a better alternative, that uses any or all of Wikipepdia, DBpedia, Freebase and their cousins, or combines a different approach. I would really prefer open source software that can be downloaded and set up on a server (eg. Apache Mahout), rather than a paid web service.
It's not so much a matter of programming, but of data.
So it's not really a question for StackOverflow.
What you really want is to use WordNet I guess. That is really meant as a database for reasoning about the meaning of words. So for example, the data explicitely states that data mining is a form of data processing. And which is a physical entity...
You see, the reasoning will be only as good as your data is.
DBPedia may also include a mapping from WordNet to Wikipedia maybe?
You can't tell that "Thriller" is a song, not a music video or film genre or Lambchop album without additional context.
After you've identified what your items are, it's "simply" a matter of traversing the graph of connections in Freebase, MusicBrainz, or whatever other information sources you are using.
You'll need to decide how you're going to weight things for scoring though. Are two Michael Jackson songs more closely related because they share the same type or are they more closely related to the artist Michael Jackson because they're directly connect to him?

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.