Search engine software. Crawler. Indexer. Search software.

Projects

[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]

Handy Backup allows to backup to ftp, flash memory USB drive and from ftp.

Handy Password has a bookmark manager. Now all bookmarks are easily accessible by a user.

RTF TO XML converts RTF to XML documents with the XSL FO specification.

Testing Master is website testing tool which analyze website bottlenecks.

Novosoft Office Backup is an easy-to-use reliable backup software designed for windows.

On SoftEmpire.com you can find any type of free games : Crosswords & Puzzles, Arcade, Cards & Casino, etc.

Novosoft Remote Backup Service - backup and storage of important data on secure remote server.

Password Management Directory - password management software, password articles, software reviews.

All about Backup Software - Hard Drive Backup Software, Server Backup Software, Online Backup Software. Decide which backup variety best suits you.

Backuputilities.net - various backup software for data backup to FTP, ZIP backup and others.

A large-scale search engine

Custom Project Name

A large-scale full-text indexing search engine

Customer (country)

An American leading provider of Internet services

Business Case

With the current scale and growth of the World Wide Web, the importance of being able to search for and locate Web pages effectively and accurately is of prime importance. Currently, the only feasible way a searcher can locate a particular Web-based source is by using a Web search engine. The generic large-scale search engines return results in the thousands, many of which lack relevance to the query; but searchers only tend to look at the first few results anyway, hence an accurate rank is critically important.

The customer has decided to build a Web-scale search engine without problems of currently existing search systems. The goal of the project is to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth. Creating a search engine, which scales even to today's web presents many challenges, fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and the documents themselves.

The system has to keep local copies of documents retrieved from the Internet and has to have access to fast data storage. Full size of the document repository that contains all information about web pages including document header, archived document body, etc. is estimated as dozens terabytes.

Solution

Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. Novosoft spent over 1000 man-hours investigating the issues of calculating relevance for large volume of data. As a result, Novosoft has designed an architecture that can support novel research activities on large-scale web data. The system is based on the following major concepts:

Distributed data processing
Due to the very fast Internet growth it is almost impossible to keep up-to-date index and perform thousands of query operations per seconds with one centralized server. To solve this problem Novosoft has developed distributed data processing technology. Each service can run on a dedicated computer as well as share a computer with another service. So, all components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.

Search by meaning feature
Generally, this means the following search scheme: First, user enters a word to search. Then, search engine tries to find an entry for this word in a dictionary and along with the standard search result generates a list of possible meanings of the word entered. Then, if the user selects any particular meaning, system generates a refined search request, which consists of the word entered by user and synonyms obtained from the dictionary. This scheme guaranties that while performing refined search, the system will select desired URLs first.

Multi-core server
That feature allows one system server to use several independent indexes. This feature might be useful for a wide range of applications, like for indexing many different independent sites from one-time search.

Scalability
Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data processing architecture shown on screenshot below.

The solution consists of the following components:

URL Server
Crawler
Search engine
- Indexer
- Search Engine core
WEB front-end
Thesaurus
Thesaurus Editor

URL Server handles information about all documents or, rather, URLs of documents in the system. It manages a simple, but very efficient URL database (that component called URL Repository, but it is a part of implementation of URL Server) based on hash tables with high performance rehashing algorithm.

The purpose of the Crawler is to retrieve documents from Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance thread-pool technique has been used.

Search engine consists of two logical modules: Indexer and Search engine core (see screenshot below). The first one manages Indexer database, the second - processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler.

The Indexer database is a simple, high-performance database intended to keep close to one million records.

WEB front-end for the system is implemented in JSP. This module communicates with two others: Search engine core and Thesaurus. HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.

The purpose of Thesaurus is to provide meanings and synonyms for a given word, and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results. Two dictionaries were implemented: gate to WordNet� and own Custom Dictionary. WordNet� is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory.

Thesaurus Editor provides web front-end for managing Thesaurus and to edit Custom Dictionary.

Screenshot(s)

High-level System Architecture

Tools and Technologies

JAVA, JSP, TCP/IP
WordNet® lexical database (by Cognitive Science Laboratory)
Linux

Benefits

The customer received a system corresponding to the highest market requirements. The functionality of the designed system is on the same level with the world leading search engines, and the following facts show the advantages:
	The search engine can index up to 50,000,000 web pages
	Each Crawler instance process 20-30 web pages per second
	Each Indexer instance process 10-20 web pages per second
	The search system process a query faster than in 1 second on the index of 1 billion documents

We'd love to hear from you and find out how we can help.
Email us, or use our online form to ask any question.