Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. Novosoft spent over 1000 man-hours investigating the issues of calculating relevance for large volume of data. As a result, Novosoft has designed an architecture that can support novel research activities on large-scale web data. The system is based on the following major concepts:
Distributed data processing
Due to the very fast Internet growth it is almost impossible to keep up-to-date index and perform thousands of query operations per seconds with one centralized server. To solve this problem Novosoft has developed distributed data processing technology. Each service can run on a dedicated computer as well as share a computer with another service. So, all components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.
Search by meaning feature
Generally, this means the following search scheme: First, user enters a word to search. Then, search engine tries to find an entry for this word in a dictionary and along with the standard search result generates a list of possible meanings of the word entered. Then, if the user selects any particular meaning, system generates a refined search request, which consists of the word entered by user and synonyms obtained from the dictionary. This scheme guaranties that while performing refined search, the system will select desired URLs first.
That feature allows one system server to use several independent indexes. This feature might be useful for a wide range of applications, like for indexing many different independent sites from one-time search.
Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data processing architecture shown on screenshot below.
The solution consists of the following components:
- URL Server
- Search engine
- Search Engine core
- WEB front-end
- Thesaurus Editor
URL Server handles information about all documents or, rather, URLs of documents in the system. It manages a simple, but very efficient URL database (that component called URL Repository, but it is a part of implementation of URL Server) based on hash tables with high performance rehashing algorithm.
The purpose of the Crawler is to retrieve documents from Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance thread-pool technique has been used.
Search engine consists of two logical modules: Indexer and Search engine core (see screenshot below). The first one manages Indexer database, the second - processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler.
The Indexer database is a simple, high-performance database intended to keep close to one million records.
WEB front-end for the system is implemented in JSP. This module communicates with two others: Search engine core and Thesaurus. HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.
The purpose of Thesaurus is to provide meanings and synonyms for a given word, and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results. Two dictionaries were implemented: gate to WordNet® and own Custom Dictionary. WordNet® is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory.
Thesaurus Editor provides web front-end for managing Thesaurus and to edit Custom Dictionary.