This paper will look at some of the most popular search engine software packages currently available which are free for end users. We will take time to compare the features offered by the different packages and go into detail explaining how the search engines actually operate. We will also look at how effective the search engines, crawler and indexing functions are.
Computers have become a particularly popular way to organize and store information. The internet has made this information accessible to everyone. The sheer volume of information available on the internet makes it difficult for users to find what they actually want.
This is the main purpose of search engines and that’s to help users retrieve the information that they are looking for. There are lots of different tools which can be used to make your own search engine. Searctools.com lists a lot of these different software packages and reviews their usefulness. What’s more most of these search engine software packages are free as long as they are not used for commercial use.
Because there are so many different packages out there for search engine software it can be very difficult to choose the right one. This paper is interested in helping you to decide which search engine would be best suited for your website. We will do this by highlighting the main benefits, features and basic information of all of the most popular search engine software packages.
We will start off by looking at a simple introduction to the world of free search engine creation software. We will then look at some of the basic information of the most popular software packages currently available. At the end of the paper we will compare the different pieces of software available.
There are lots of examples of free search engine software, and you can find these in a number of different sources. There are search engine software packages available at codebeack.com, searchenginewatch.com, searchtools.com and sourceforge.net. A lot of these packages are freeware while some of them are open source which means they have the source code distributed with them.
Generally speaking the free search engine software is quite hard to understand because the documentation on it is not very good. This can make it very difficult to understand the features and functionality that it provides.
There are two different types of search engine software depending on who it is that actually does the search. There are server side search engines and remote side search engines. On remote site search engines all of the indexing and querying is done on a remote server. Server side search engine software will create an independent search engine. This runs on your computer and creates a genuine search engine.
We will only look at server side search engine software as this is what constitutes a real search engine.
The types of search engines are further categorized into two groups. There are website based search engines and file system search engines. The file system search engines only catalog files which are stored on the local network. Website search engine software on the other hand can index remote websites by using web crawlers.
Lots of search engines use both of these systems and are capable of indexing remote servers and the local file system.
If you are interested in a website search engine package then you will need to make sure that it offers a complete package. Any effective web search engine needs to include:
All of the software packages that we will look at have these main features built into them.
In this section we will look at some of the most popular search engine software packages and provide some basic information about each. The information given will include the website address, whether it is open source, licensing information, documentation, the platforms it can be used on, who built the software, and how complete the package is.
The licensing will look at whether the software is distributed as freeware for anyone to use or whether there are some conditions on its use. It will also mention any upgrade options available if any.
If the source code is available then the package will be described as being open source. This is particularly useful if you want to customize the system.
It’s important that there is enough documentation available to use the package correctly. This will show you where the documentation can be found.
This section will tell you which operating systems are required to use the various different solutions. I.e. Can it run on windows, Mac, or Linux?
This will look at the functionality of the package. Does it provide everything that you need. Does it have a web crawler, indexer, query engine and interface? If so then it will be a complete package, if not you might need other tools to complete it.
This information is really useful so that an administrator can make an informed decision when choosing the ideal search engine software package. Being able to find out whether the search engine will work on your platform in one place will make life much easier than normal. Once you have determined which packages will work on your equipment you can then find out more information about the packages and decide which one you want to install.
|
Name |
Licence |
Website |
Source
code |
Language |
O/S |
Complete |
Developer |
|
Alkaline |
Free
?none commercial use only |
www.alkaline.vestris.com |
Not
open source although source code can be purchased |
C++ |
Solaris,
IRIX, Linux, FreeBSD, Windows |
Complete |
Founder
is Daniel Doubrovkine. Plus various people from Lavtech Corp including
Aleksey Botchkov, Kalimullin and Sergei
Kartashoff. |
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
Fluid
Dynamics |
Freeware
version available or a free trial of commercial version
available |
Source
code available from website |
Perl |
Unix,
Linux, Windows 95, 98, ME, 2000 |
Complete |
Created
by Zoltan Milosevic. Owned by Fluid Dynamics Software
Corporation | |
|
|
| ||||||
|
ht://Dig |
Free |
Source
code available from website |
C
and C++ |
Linux,
Mac OS, IRIX, HP/UX, SunOS |
Complete |
Created
by Loic Dachary and Geoff Hutchinson. | |
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
Juggernaut-search |
Free
for non-commercial use |
Perl |
Linux,
Windows NT, 2000. XP also supported with commercial
version |
Complete |
Created
by Donald Kasper. | ||
|
| |||||||
|
| |||||||
|
msoGoSearch |
Free
?Unix version is free |
www.mnogosearch.org |
Available
freely |
C |
Unix Linux FreeBSD |
Complete |
Alexander
Barkov |
|
Perlfect |
Free
for both commercial and non commercial use |
Perl |
Unix,
Linux, Windows NT |
Complete |
N.Moraitakis | ||
|
| |||||||
|
| |||||||
|
SWISH-E |
Free
for all |
C
and Perl |
FreeBSD,
SunOS, NET BSD, Windows NT |
Not
complete - Need to use additional CGI code to start
searching |
The
first version of SWISH was designed and made by Kevin Hughes. In 1996, The
Library of UC Berkeley asked for permission to enhance the application
which created SWISH-E | ||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
Webinator |
Free
- Free version only supports a maximum of 10,000 pages and 10,000 hits
each day |
source
code is not available |
Vortex-Tex |
Unix,
Linux, Windows NT, Windows 2000 |
Complete |
Thunderstone
Inc | |
|
Webglimpse |
Free
- Free version only suitable for educational and governmental
use |
C
and Perl |
IRIX,
OSF, Rhapsody, AIX, SunOS |
Complete |
University
of Arizona | ||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
| |||||||
|
LEXST-SEA |
Free
- Supports up to 200,000 pages with a maximum of three nodes. Commercial
versions can support billions of pages |
not
available |
unknown |
64
bit windows operating system |
complete |
Dust
Gem Co. Ltd | |
|
Google
GSA |
Proprietary |
strictly
under lock and key |
unknown |
stand
alone system |
complete |
Google | |
|
Zoom
Search Engine |
Free
version available - limited indexing |
http://www.wrensoft.com/zoom/ |
not
available |
unknown |
Cross
platform support |
Complete |
Wrensoft |
|
Site
Search Pro |
Proprietary
- cost depends on licence |
http://www.site-search-pro.com/ |
not
available |
unknown |
Cross
platform support |
incomplete |
Shedix |
We will now spend some time comparing the various different search engine packages available so that you can decide which one will suit your website the best. Each of the search engine packages will be compared by looking at four qualities.
The method of searching will be the method used by the search engine to rank the search results. There are two different methods of this, each of which affects how the server will be set up. It will also affect the speed of results, and how much disc space is required on the server.
Almost every search engine speeds things up by indexing data before it is searched for. It is generally much quicker and easier to search through data that has already been indexed rather than raw data. It’s very important however that the indexed information are in a useful format, up to date and contain useful information.
There are a number of different indexing methods; the one that is normally used is the full text inverted index. This does require a considerable amount of disc space and the indexing process can be very slow. This is because much of the information is stored in the index.
There are also more efficient methods such as indexing only certain parts of the documents being indexed. This might include extracting the title, description, keywords and possibly the author and indexing these. This makes indexing much quicker and speeds up the whole process. There are lots of interesting indexing methods which can be used. WebGlimpse for example uses two level indexing. Alkaline uses a secret algorithm when indexing and many of the other packages use unique features too.
Ranking is the way a search engine ranks a document in relation to a query. There are a number of factors that the search engine software can use to decide whether or not the document is relevant. These factors can include word position, popularity, and word density. Different search engines will utilize different factors when ranking sites.
We will also spend time looking at a number of features of indexers and web crawlers.
We will look at the search from nine points of view:
There are also a number of other features that we will consider which do not fit into any of the above categories, these include:
|
|
Alkaline |
Fluid
Dynamic |
ht://Dig |
Juggernautsearch |
mnoGoSearch |
Perlfect |
SWISH-E |
Webinator |
Webglimpse |
LEXST-SEA |
Google-GSA |
Zoom
Search Engine |
Site
Search Pro |
|
Search
Technique | |||||||||||||
|
Indexing
Method |
Secret
algorithm |
Attribute
Indexing |
Inverted
Index |
Keywords
Index |
Inverted
Index |
Inverted
Index |
Unknown |
Inverted
Index |
Two-level
query |
Unique
algorithm |
Unique
algorithms |
Various
settings |
unknown |
|
Relevance
Ranking |
Weighting
words |
Words
weighted according to frequency |
Woord
weighting |
Word
Weighting |
Word
Weighting |
Algorithm |
Unknown |
Unknown |
Unknown |
Unknown |
Unknown |
word
weighting |
unknown |
|
Crawler
Features | |||||||||||||
|
Robot
Exclusion Standard Support |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
yes |
no |
|
Crawler
Retrieval Depth Control |
Yes |
Yes |
Yes |
No |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
yes |
yes |
|
Duplicate
Page Detection |
Yes |
Yes |
Yes |
Yes |
Yes |
Don't
know |
Yes |
Yes |
unknown |
Yes |
Yes |
yes |
yes |
|
File
Format to be Indexed |
html,
htm, text, SWF, shtml, PDF, doc, rtf, MP3 |
htm,
html shtm, shtml, stm, mp3, PDF, and txt |
htm,
html, txt, MS Word, PDF, ppt, PS and XLS |
txt,
html, htm, shtml, shtm, powerpoint, word files, excel, rtf, C, CXX, CGI,
Java, PL and PHP |
htm,
html, txt, postscript, pdf, word doc, MP3, Fields within SQL
database |
pdf,
html and txt |
XML,
html, txt, word doc, gzip, and PDF |
htm,
html, PDF, txt, word doc, asp, word perfect files, shtml, phyml and
jhtml. |
HTML,
Word doc, PDF natively. Also suppors any other format which can be
converted to plain text |
HTML,
Text |
Supports
over 220 file types including microsoft office, HTML and
PDF |
Supports
HTML, Doc and PDF files as standard |
supports
HTML files, PDF, XLS and SWF files |
|
Index
Protected Server |
Yes |
No |
Yes |
No |
Yes |
No |
No |
No |
No |
Yes |
Yes |
unknown |
unknown |
|
Searching
Features | |||||||||||||
|
Boolean
Search |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
yes |
yes |
|
Phrase
Matching |
No |
Yes |
No |
No |
Yes |
No |
Yes |
Yes |
No |
Yes |
Yes |
yes |
yes |
|
Attribute
Search |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
Yes |
No |
No |
Yes |
Yes |
yes |
yes |
|
Fuzzy
Search |
No |
No |
Yes |
Unknown |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
unknown |
unknown |
|
Word
Forms |
Yes |
Yes |
Yes |
Unknown |
Yes |
No |
Yes |
Yes |
No |
Yes |
Yes |
yes |
yes |
|
Wild
Card |
Yes |
Yes |
Yes |
Unknown |
Yes |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
yes |
unknown |
|
Regular
Expression |
No |
No |
No |
Unknown |
No |
No |
No |
Yes |
Yes |
Yes |
Yes |
yes |
no |
|
Numeric
Data Search |
Yes |
No |
No |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
yes |
unknown |
|
Case
Sensitivity |
Yes |
No |
No |
Unknown |
No |
No |
No |
No |
Yes |
Yes |
Yes |
no |
no |
|
Natural
Language Query |
No |
No |
No |
No |
No |
No |
No |
Yes |
No |
Yes |
Yes |
no |
no |
|
Other
Features | |||||||||||||
|
International
Language |
No |
Latin-extended
languages |
Yes |
No |
Yes |
Yes |
Yes |
No |
Yes |
Yes |
Yes
- auto detects language |
Yes |
Yes |
|
Page
Limit |
2
billion documents in theory. In practice recommended limit of 50k - 500k
pages |
depends
on computer, around 100,000 documents |
Can
index over 100,000 pages |
unlimited |
Millions |
1,000+ |
Unknown |
10,000
pages for free - commerical versino supports
more |
Unknown |
Billions, 200,000 pages for free version |
up
to 30 million depending on model and contract
chosen |
Over
1 million |
unknown |
|
Customizable
Result Formatting |
Yes |
Yes |
Yes |
Unknown |
Yes |
Yes |
Unknown |
Yes |
No |
Yes |
Yes |
yes |
yes |
We will now compare the features of the different software packages available so that you can decide which one is best for your requirements.
Alkaline is a very popular and sophisticated search engine and supports many of the features that you would ever want from search engine software.
Alkaline uses cellular expansion when indexing and searching documents. This is a unique algorithm which hashes the data and then finds the short search terms very quickly. This algorithm is said to make indexing very quick even when dealing with large documents.
The search method is supposed to be able to adapt to what is being searched for. This means that the more precise a user is with their search terms, the better and more relevant the results will be. The word weighting feature also gives a different weight to words depending on where they are mentioned. It’s also possible to adjust the ranking weight if required. It’s also possible to put certain words which will be given a lower ranking into a Weak Words file.
AlkalineBOT is the robot that alkaline uses and this is fully compliant with industry standard robot.txt files. It is also complaint when there are no follow tags mentioned in the HTML code. If required it is possible to disable the robots feature of Alkaline in the setup.
It’s possible to set a maximum depth for URL’s which are followed. It also features an MD5 feature which will help the search engine to identify any duplicated documents and these will be ignored from the index. This is very helpful because duplicate entries will become a problem if your search engine grows in size.
Alkaline is capable of indexing a wide variety of different files including htm, html, and shtml. It is also capable of indexing a wide variety of other file formats if external filters are installed. These formats can include flash objects, doc, PDF, RTF, WordPerfect, MP3 and XML documents. To use this, these documents first need to be processed by an external filter and then indexed in alkaline.
Alkaline supports indexing and retrieval of secured pages using HTTP.1.0 Basic authentication and NTLM support for NT. However it does not support SSL encryption.
Alkaline supports a wide range of search features including wild card, numeric search, case sensitive search, Boolean search and attribute search. It doesn’t support Fuzzy search, natural language, phrase matching or regular expression queries.
Alkaline only supports English and no other languages are supported.
The theoretical limit that alkaline can index is 2 billion documents. The recommended use of alkaline is to index between 50,000 and 500,000 pages of content. It’s possible to customize the look of the results.
Fluid dynamic makes use of attribute indexing. It extracts the text, description, keywords, address and title which can then be used for searching. It indexes all the information of the site however it is possible to set a max character limit will limit the number of bytes which are read in each documents. If you keep this at a low value then it will speed up indexing but will damage the search results.
Search results are ranked depending on the number of times the keywords are mentioned in the documents. Any keywords mentioned in the title, description or keywords are given extra weight. It’s also possible to adjust this extra weighting by adjudging settings. Whenever a search term is found the software then adds one point to the relevance of the site. If the keyword is found in the title or description then the multiplier will make this affect the results much quicker.
Fluid dynamic is fully supportive of the robot exclusion standard. This means that it will respect information in the robot.txt file and in no follow meta tags. It’s possible to control the depth of crawling because it’s possible to make the search engine stop for approval after every pass. It is capable of detecting any duplicate pages and it will not include them in the index.
Fluid dynamic supports a wide range of different files including shtml, html, shtm, htm, and mp3. It is possible to index PDF files however you will need to download additional software from foollabs at www.foolabs.com/xpdf. It has no support for accessing protected content.
Fluid dynamic supports a wide range of different search options including phrase matching, Boolean search, attribute search, wild card and word forms. Fluid dynamic doesn’t support numeric data searching, case sensitive searching, natural language queries, regular expression or fuzzy searches.
Fluid dynamic supports a wide range of languages. IT is designed to support any language that uses the Latin characters. This includes Dutch, German and English.
The results and query interfaces are designed using templates which makes it very easy to customize the site. This also makes it very easy to translate the site between languages.
Fluid dynamic doesn’t have any theoretical limit, however because of hardware requirements the limit will be around 100,000 documents.
The indexing method used by ht://dig is the most standard indexing method in use. This is the full text reverse index. The sites are ranked according to their word weight and this is normally determined by how important a word is to a specific document.
The ht://fig crawler is compliant with the industry standard robot exclusion standards. It’s also possible to limit the depth of the crawling by adjusting the maxhops option when setting up the program. Ht://Dog does detect duplicates by looking at the signatures however it does not seem to remove them from the index.
By default ht:/Dig can index txt and html flies. When using external converters or parsers it is possible to use ht://dig to index PowerPoint, postscript, excel, pdf and word doc files. If using parsers or converters then they must be included in the configuration file.
It is possible for ht://Dig to index documents on protected servers. It can also be set up to use certain usernames and passwords when accessing this protected data.
Ht://dig supports many different search features including fuzzy search, word forms, Boolean search, attribute search and wild card. It doesn’t support numeric searching, case sensitive search, nature language queries, regular expression queries and phrase matching.
The page limits are basically unlimited and will depend on your hardware. The normal limits are that ht://dig can index over 100,000 pages of content.
It’s possible to easily change the look and feel of the search results because the site uses HTML templates.
Unfortunately a lot of the documentation about Jugernauhtsearch is not complete enough to determine whether or not it supports a number of advanced features that we have been looking at. However Juggernautsearch uses a very interesting indexing method which makes searching and indexing very quick.
Juggernautsearch will extract the most important keywords from a document and only index these. The keywords are weighted depending on how frequently they appear in the documents. The index then stores all of these keywords in their weight ranked order. When searching the keywords stored in the index files are searched through. The weighting of the search terms will then determine the ranking of the page.
Indexing and searching are very fast because only keywords are indexed. This also means that the space requirements for JuggernaughtSearch are much lower than many other competing products.
Juggernaut support is fully compliant with the Robot exclusion standard which means it will not follow sites mentioned in the robots.txt section or which have no follow tags. The crawler used by Juggernaut is Page runner. This is a quick crawler however it is not possible to control the depth of crawling. It’s possible for juggernaut to detect any duplicated content. It will remove duplicate content from its index.
Juggernaut search can support a large number of different file formats, including:
Juggernaut search can use attribute searching and it’s possible to use this to restrict search results to a single URL. It’s not possible to use Boolean search with this software though this is because this feature can only be used when the whole document is available for searching. Juggernaut does not know all the words in a document, only the keywords. For the same reason juggernautsearch does not support numeric searching, phrase matching or natural language searching.
Juggernaut search only supports use of the English language.
A full inverted index is used by mnoGoSearch. Words which are located in specific parts of the document are given different weights. mnoGoSearch will look at a number of different factors when deciding how relevant documents are.
mnoGoSearch is compliant with the industry standard robot exclusion standards. It’s also possible to limit the depth that the crawler crawls. It supports txt and html files out of the box. If you want to add support for other file types then an external parser is required. When using parser ps, pdf and word doc files can be indexed and searched. If your server supports HTTP 1.1 then mnoGoSearch can support mp4 files. It is also capable of indexing information inside an SQL database. It is also possible for mnoGoSearch to access information on protected servers.
mnoGoSearch supports attribute search, word forms, fuzzy search, phrase matching, wild card and Boolean search. It doesn’t support a numeric search, nature language query, or regular language expression.
mnoGoSearch will support almost every 8 bit character set language and will also include a number of multi-byte character sets. It is capable of using Chinese, Japanese, and Korean. Some of these languages may require a conversion table. mnoGoSearch also supports Mac character sets. mnoGoSearch supports over 700 languages, which is pretty impressive.
It’s possible for mnoGoSearch to index millions of documents. It is also a very flexible system and extremely easy to customize. It provides access with PHP3, C CGI and Perm.
Perlfect makes use of one of the most standard and commonly used indexing and ranking algorithm available. This is the inverted index. To calculate the weighting of pages it uses an algorithm developed by Gerald Salton.
Perlfect is not complaint with the robot exclusion standard. This is because the system is normally only intended for use to index an individual website. It’s also not possible to control the depth of crawling and it can’t be used on protected servers.
The only search feature supported by Perlfect is the Boolean search. It’s possible to include a keyword using the plus sign and excluding a word by using the minus sign.
The software supports a number of different languages including French, Italian and German. It’s also possible to customise the look and feel of the site including the language using the templates.
Perlfect is a good all round lightweight search engine. It can however only be used to index around a thousand documents.
The documentation included with SWISH-E has made it difficult to determine how SWISH-E indexes and ranks search results.
The crawler used by SWISH-E is compliant with Robot Exclusion Standards. This means that it’s possible to stop crawlers accessing certain pages. It’s also possible to control the depth of crawling. SWISH-E can’t be used on protected servers.
SWISH-E supports txt, xml and html files as standard. If you want to index other file types including PDF, gzip, or word doc files then you will need to use a converter so that SWISH-E can index them. Gif, Mov and MPG files can be indexed however the content cannot.
SWISH-E supports Fuzzy search, word forms, phrase matching, Boolean search, wild card and attribute search. SWISH-E doesn’t support Numerical searches, case sensitive searches, natural language query or regular expression searches.
SWISH-E will support any language that uses single byte characters like English, German or Italian.
Webinator indexes files using the inverted index. The pages are then ranked depending on word ordering, frequency, position and word proximity. You can adjust the significance of each of these things in the options.
The crawler used by webinator is compliant with the industry standard robot exclusion standards. It’s also possible to control the maximum depth for crawling. However webinator is not capable of indexing content on protected servers
Webinator can detect duplicate content and does this by hashing the data and comparing it to entries already in the database. Webinator supports PDF, DOC, HTM, HTML, SWF, SHTML, JSP, PHTML and JHTML as standard.
Webinator supports a number of different searching features which include wild card search, regular expression, word forms, fuzzy search, Boolean search, and natural language search. It doesn’t support case sensitive searching or attribute searching.
Webinator only supports English and cannot support any other languages
The free version of webinator which can be downloaded will only index around 10,000 pages of data.
The interface is fully customizable.
The indexer used by WebGlimpse is Glimpse. This uses a special two level query method. This means that it can index files very quickly in small files. It also supports approximate matching which further increases the speed. Two-level query is a mix between inverted index and sequential search.
The first thing that happens is that the indexer breaks up the information into very small bits of information, these small bits of information are referred to as blocks. The number of blocks is limited to 256 which means that only one byte is needed to store the address of the block.
Every word is indexed however unlike an inverted index not every occurrence of the word is indexed. This makes the index much smaller than normal because the words are normally combined into one block.
When searching for content there are also two phases. The first thing that happens is that glimpse will search through the index to find any block which might contain information you are searching for. Any block that does is then searched. Index is very small and this means that the search results are very quick.
WebGlimpse is compliant with the industry standard robot exclusion standards. It’s also possible to control the depth of the crawler. As standard WebGlimpse can index txt and html files. It’s also possible to add support for PDF files and many other file types when using filters.
WebGlimpse supports lots of searching features including regular expression, case sensitive searching, fuzzy search, Boolean search and wild card search. WebGlimpse does not support Numeric searching, natural language query or phrase matching.
WebGlimpse is capable of using all languages that store in a single byte. However it’s not possible to change the language of the interface unless you purchase a commercial license.
LEXST-SEA uses a unique indexing system with a secret algorithm. This means that indexing is very quick and also accurate. Unfortunately the techniques that LEXST uses are so secret that it’s not possible to find out exactly what techniques it uses to perform searches for users.
The crawler that LEXST-SEA uses is fully compliant with robot exclusion standards. This means that pages mentioned in the robot.txt file or with no follow tags in the code will not be included. The software is fully customizable and the depth of crawling can be adjusted as required.
LEXST-SEA can detect duplicate content and will ensure that it is not included in the search engine database. LEXST-SEA supports a wide range of file formats including HTML, TXT and PDF.
Unlike all of the other search engine packages that we have looked at LEXST-SEA is capable of indexing billions of pages. All of the others run on single machines, however LEXST-SEA can run on a network of nodes all of which share the workload.
LEXST-SEA supports a wide range of different searching features including wild card search, Boolean search, and phrase matching.
LEXST-SEA can support any language including English.
The free version has a limit of indexing 20,000 pages however the commercial versions are unlimited and can be used to index and search billions of pages.
Up until this point we have only been interested in free software packages. However it is a good idea to look at paid services from the big players in the industry like Google.
Google GSA is like no other search application that we have looked at before. This is completely unique because it comes installed on a standalone box. This box is plugged into your network and configured remotely. Google are very secretive over the technology that the system uses to index sites which is why it can be confusing to find anything out. In fact the box itself is sealed and a contract is signed to the effect that you will not modify the box or software in anyway.
Google GSA supports a wide range of different file formats, over 240 different formats in fact. These formats include HTML, PDF and word doc files all without the need for external parsers.
There are a number of different versions of Google GSA, all of which are capable of storing different numbers of pages.
Google GSA is a very capable search engine and search results are returned very quickly. Google GSA is capable of using Boolean search, attribute search, wild card search and fuzzy search. It’s basically the same as the popular Google search only for your own corporate network.
Because Google GSA is a commercial product it is available in almost all languages. It can also support multiple languages for cross-languages searches. Auto language detection makes it possible to search in any language automatically. The list of supported languages includes Greek, Chinese, English, Turkish, Japanese and Hebrew among many others.
Indexing using zoom search engine can be done in a number of different modes including spider mode and offline mode indexing. The spider mode is designed to crawl pages on internet or intranet sites, it’s suitable for use with dynamic and static content.
Offline indexing is useful if you want to use the search engine to search for files on your own computer.
Many of the popular formats including HTML and PDF documents support full text indexing. It’s also possible to add support for image files and MP3’s in the latest versions.
Incremental indexing makes it possible to gradually update the index rather than having to completely re-index everything on the site.
Search keywords are ranked and weighted based on a number of factors including where the keywords are mentioned, keyword density, and word proximity among other things.
Zoom Search Engine is a very capable search engine which is capable of searching many different file formats. It also supports wildcard searching, word stemming and logic arguments
Zoom search engine supports a number of different languages and character sets although not all of these languages are available in the free edition.
Depending on the package that you choose you will be able to index around one million pages using the software. There are a number of different versions including free, professional and enterprise.
It’s possible to customize the search results and interface as much as you like. The whole thing is built using templates which makes things very easy.
Depending on the package you choose Site Search Pro Is capable of indexing different file formats including XLS, Doc, PDF and HTML files. It also allows the administrator to control the crawler which will limit how deep it can go.
Site Search Pro supports a wide range of different searching features including attributes search, Boolean logic search and numerical searches.
Site search Pro supports all of the popular languages in the world including English and other languages based on the same character set.
We have spent some time comparing the various different search engine packages available. Every package is different and suitable for a multitude of different applications. Depending on what you want a search for you might be better off with one or the other package.
It’s not possible to say which one is the best because each one is unique and different in their own way. Some of the search engine packages are much more simplistic than others but may be better suited to certain applications. Applications like LEXST-SEA are designed to index huge amounts of data where as more simple search engine software like Perlfect are designed for individual websites.
You need to choose a software package which matches as many of your requirements as possible.