Previous: Flexible redirections Contents and Introduction Publishing Databases on the Web

Using the Search Engine

introduction

The search engine currently used on our web servers is the Search'97 Information Server from Verity.

changes from the previous release NEW

This part contains an overview of the features that have changed in the latest installation of the search engine, in September 2000. All changes are described thoroughly in the text.  

collections

The collections are sets of index files that are maintained on a daily basis and that are available for searching. There are a few limitations to these collections in relation with the choice of language, and the choice of meta information to be indexed (see further). I have set up the collections to coincide with access rights and the physical locations of the data, while also trying to create useful logical entities. Here is a list of collections that can be searched via the Search'97 Information Server search interface: The collections listed above are grouped in meta-collections as follows: The meta-collections EUROPAplus and EUROPAplusonly are available for searching via the IntraComm web server only. The meta-collection fullEUROPA is only available through the EUROPA and EUR-OP web servers.

Remark:

search form

<FORM ACTION="/search/s97.vts" METHOD="POST">
For public collections (the ones listed before) the URL "/search/s97.vts" should be used. "/search/s97.vts" cannot be used for queries on collections that contain restricted data. In those cases "/search97cgi/s97_cgi/..." should be used. Contact the DI-DC-D web team when you are planning a search on restricted data.
<INPUT TYPE="type" NAME="Collection" VALUE="collection">
Provides the collection to be searched. See previous paragraph for a list of available collections. You can repeat this input tag for all the collections you want to search on. You might want to use a list of checkbox input tags to list a series of collections from which the user can select. This can be used to define the scope of the search.
<INPUT TYPE="type" NAME="Action" VALUE="FilterSearch">
This input tag indicates that the query will be filtered before being passed on to the search engine; this should be set in all but the simplest cases. The other possible value for this parameter is "Search". Use the default value "Search" only when the queries will not be filtered; this will give a small performance gain. "type" will normally be set to "Hidden".
<INPUT TYPE="type" NAME="Filter" VALUE="filter_template">
Specifies the filter template to be used with the "FilterSearch" action and "Simple" query mode only. Use "EUROPA_filter.hts" or "EUROPA_filt_topics.hts" to take advantage of the topics. "type" will normally be set to "Hidden".
<INPUT TYPE="type" NAME="ResultTemplate" VALUE="result_template">
Choose the template file that will be used for the results list:
On IntraComm use: EC_res-ln.hts, where "ln" equals "en" (English) or "fr" (French).
On EUROPA use: EUROPA_res-ln.hts, where "ln" equals "en" (English) or "fr" (French), translations are being worked on.
Optional but recommended input tag, because it implicitly sets the language used in the results list.
<INPUT TYPE="type" NAME="QueryMode" VALUE="query_mode">
This parameter sets the query mode. The default query mode "Simple" specifies that the Verity Query Language can be used. This query mode used together with the "FilterSearch" gives the best search results. The other possible values are "Boolean", "FreeText", and "Internet". Optional input tag.
<INPUT TYPE="type" NAME="SearchPage" VALUE="return_URL">
We created this variable to pass a return URL to the search engine in order for it to be displayed in the result lists. This inelegant solution was necessary because somehow somewhere the "HTTP-Referer" variable gets lost. "type" should normally be set to "Hidden". Recommended.
<TEXTAREA NAME="QueryText" ROWS=2 COLS=50></TEXTAREA>
This is what it is all about: entering a query.
<INPUT TYPE="edit" SIZE=30 NAME="InKeywords" VALUE="">
Search for documents with a matching Keywords field. The search will eventually be combined with the rest of the query by inserting it in the QueryText. This optional input tag requires a "FilterSearch".
<INPUT TYPE="edit" SIZE=30 NAME="InTitle" VALUE="">
Search for documents with a matching Title field. The search will eventually be combined with the rest of the query by inserting it in the QueryText. This optional input tag requires a "FilterSearch".
<INPUT TYPE="edit" SIZE=9 NAME="StartDate" VALUE="">
Tells the search engine to only select documents with a "last modified" time stamp starting at the specified date. This optional input tag requires a "FilterSearch". Although a multitude of date formats are accepted, we recommend to use "dd-mm-yyyy" (day, month, year).
<INPUT TYPE="type" NAME="SearchIn" VALUE="URL_path">
Limit the scope of the search to one (or more) URL paths. When more URL paths are to be selected use multiple input tags with "NAME="SearchIn"". Please note that the use of this parameter, while reducing noise significantly, will slow down the queries noticeably. If possible, use topics (TopicInclude) instead. See below for information on the topic tags and how to use them.
Optional input tag requiring a "FilterSearch".
<INPUT TYPE="type" NAME="ExcludeDir" VALUE="URL_path">
Limit the scope of the search excluding one (or more) URL paths. When more URL paths are to be excluded use multiple input tags with "NAME="ExcludeDir"". Please note that the use of this parameter, while reducing noise significantly, will slow down the queries noticeably. If possible, use topics (TopicInclude) instead. See below for information on the topic tags and how to use them.
Optional input tag requiring a "FilterSearch".
<INPUT TYPE=="type" NAME="HTMLonly" VALUE="yes|no">
Limit the scope of the search to HTML pages. Typically used with radio buttons and a "yes or no" choice. Optional input tag requiring a "FilterSearch".
<INPUT TYPE="type" NAME="SourceQueryText" VALUE="source_query_text">
Can be used to limit the scope of the search. The specified value will be "AND"-ed with the contents of the QueryText parameter.
The preceding parameters "StartDate", "SearchIn", "ExcludeDir" and "HTMLonly" are inserted into the SourceQueryText by the filter template. For example: setting "StartDate" to "1999-06-10" will result in "Modified >= 1999-06-10" being inserted in the SourceQueryText.
This is an optional input tag that under normal circumstances will be hidden.
<INPUT TYPE="type" NAME="ResultCount" VALUE="result_count">
Specifies the number of entries to be shown per result list page. The default value is "10". A larger number will generate larger result list pages, and a longer response time for the visitor. Optional input tag.
<INPUT TYPE="type" NAME="TopicMime" VALUE="topic_mime"> NEW
Limit the search to one (or more) document types, according to the topic_mime value. Possible values are
html001 HTML documents
pdf001 PDF documents
word001 Microsoft Word documents
You can also use the negation of these values. "<NOT> html001" for example will retrieve all documents that are not html.
Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".
<INPUT TYPE="type" NAME="TopicLang" VALUE="language"> NEW
Limit the search to one (or more) languages. Possible values are
da001 Danish
de001 German
el001 Greek
en001 English
es001 Spanish
fi001 Finnish
fr001 French
it001 Italian
nl001 Dutch
pt001 Portuguese
sv001 Swedish
The documents are classified in the following way:
1. Information given in the language metatag (applies only to HTML documents which do have this tag).
2. Language information given in the document URL like "catalog/LL/filename" where LL is the language.
3. Language information given in the document file name like "catalog/filename_LL.ext" where LL is the language.
Please keep in mind that this information is not always dependable because
1. Not all HTML documents have the language metatag populated.
2. The search engine interprets non-alphanumeric characters like slash "/" or underscore "_" as blank space and there might index some documents in a wrong way.
3. Documents that do not follow the above naming scheme are not classified according to language and cannot be searched with the use of this tag.
In spite of these limitations we believe that the language classification does provide some help. It is up to the publishers to decide about its usefulness.
The negation of these values is also possible. "<NOT> fr001" for example will retrieve the documents that are not in the french language, ie the documents that have not been classified as french according to the mechanism explained above.
Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".
<INPUT TYPE="type" NAME="TopicInclude" VALUE="include"> NEW
Limit the search to documents hosted in a particular directory on the server. Possible values and their corresponding directories on the EUROPA server are
policy001 */europa.eu.int/pol/*
scadplus001 http://europa.eu.int/scadplus
eurlex001 http://europa.eu.int/eur-lex
generalreport001 */europa.eu.int/abc/doc/off/rg/*
monthlybulletin001 */europa.eu.int/abc/doc/off/bull/*
The use of these topics results in much faster retrievals comparing to the same searches performed with the SearchIn input parameter.
Negation of the values is also possible to retrieve documents that do not belong to one of the above sets.
Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".
Results sorted by
<SELECT NAME="SortField">
<OPTION SELECTED VALUE="Score">Score
<OPTION VALUE="Modified">Date
</SELECT>
in
<SELECT NAME="SortOrder">
<OPTION SELECTED VALUE="desc">Descending
<OPTION VALUE="asc">Ascending
</SELECT> order
Allow the visitors to have results sorted by score or date and in ascending or descending order. Optional.
<INPUT TYPE="Submit" VALUE="Search">
Go for it!
<INPUT TYPE="Reset" VALUE="Reset">
Reset the values of the form parameters to their default values. Optional.
</FORM>
That's it!

information available for searching

The indexer programs indexes (most) file formats that are available on our web servers. The complete contents are being indexed. On top of that a series of fields, that are present either inside the documents or attached to the documents will be indexed separately as well.
Here is an overview of these fields, with for each of them the corresponding HTML tag, PDF field, and MS-Word field:
field name HTML tags standard PDF field standard Word field
Title  Title tag  FTS_Title  Title 
Keywords  meta tag Keywords  FTS_Keywords  Keywords 
Modified  file time stamp file time stamp file time stamp
Description  meta tags Description
or Abstract 
FTS_Subject (1) Subject (1)
Author  meta tag Author  FTS_Author  Author 
Language  meta tag Language  n/a  n/a 
Reference  meta tag Reference  n/a  n/a 
Creator  meta tag Creator  n/a  n/a 
Publisher  meta tag Publisher  n/a  n/a 
Type  meta tag Type  n/a  n/a 
DateAlarm  meta tag DateAlarm  n/a  n/a 
DatePublication  meta tag DatePublication  n/a  n/a 
Classification meta tag Classification  n/a  n/a 
(1) these fields are both being indexed as Subject fields

how to search on the contents of these fields

A query for "field_name <contains> string" can be used to find documents with a corresponding meta tag. Using <starts> or <matches> instead of <contains> where possible will give results more quickly because <contains> makes the search scan complete fields before deciding whether it found a match or not.
For numeric fields relational operators can be used. You can find more details in the help files on IntraComm. Here are a couple of examples to get you going:
Creator <contains> "COMM/DGX/D.2/EUROPAplus"
will return all documents with a "Creator" meta tag containing the specifiedstring
DatePublication >= 1999-05-01
will return all documents with a DatePublication meta tag that contains a value that is greater or equal than the specified value
(DatePublication < 1999-05-01) <AND> (DatePublication > 1900-01-01)
will return all documents with a DatePublication meta tag that contains a value that is smaller than the specified value; if you would only specify the first part of the query then all documents that do not have the DatePublication meta tag set would be returned as well ("null" value is the smallest possible value)
Language <matches> fr
will return all documents with the "Language" meta tag set to "fr"
Language <contains> fr
will return all documents that have the "Language" meta tag containing "fr"
Please consider implementing language searches with the TopicLang parameter. Check the "form" section of this document for information about it.

special fields

The "Title" and "Keywords" fields are attached to the body of the document as hidden zones when the document is being indexed. This has two consequences.
The contents of these fields will be included in the body of the documents while being indexed. This way we make sure that these contents are found when using a basic search.
The Verity Query Language "proximity" operator "<IN>" can be used to search within one of these zones. For example: specifying "competition <IN> Keywords" as a query will return all documents that have the word "competition" specified in the "Keywords" field.
See also the locally created input tags "InTitle" and "InKeywords".

zone searching

Zones are specific regions of a document to which searches can be limited. In our environment the search over zones can be used over HTML documents. For text regions of a document, searching a zone is much faster than searching a field. For numeric fields like dates, you can perform relative comparisons using the relational operators. This is not possible on zones.
We already mentioned two zones (title and keywords) before. You can find a list of standard zones that are extracted from HTML documents in the detailed help on IntraComm (chapter on attribute searching).

Note: adding an extra field to be indexed, or changing fields into zones necessitates a complete re-indexation of all web pages. With the amount of data and the number of collections involved this turns out to be a major operation that requires some planning.

preparing documents for better search results

While creating web pages and other documents you should initialize meta information. Especially the document title, and the "Keywords" and "Description" meta tags are useful. The indexer can retrieve these informations from HTML pages, PDF documents, Word files, to name the most obvious file formats.

The results lists can show more useful information, provided that information is available.
The title of a document can only be provided when it is available in the document. Try to use meaningful titles: 2000 documents with the same title can not really be considered as using meaningful titles. The description (a.k.a. "abstract") of a document will be shown in the results lists whenever it is provided by the author. Otherwise the indexer will have a go at it and produce a summary that will be shown in the results list. The resulting summaries are not always convincing, to say the least.

The contents of meta tags can be searched upon. For visitors this can reduce the noise; for authors this might be a means to push certain documents to the foreground.

Recommendation: for HTML pages that include a search form, add "Search Form" in the Keywords meta tag. This will allow us to find all search forms in case changes are needed.

performance issues

limiting scope of search

The scope of a search can be limited in two ways: through the choice of a collection to search or through added query information (SearchIn, ExcludeDir, topics etc.). Limiting the scope through added query information can reduce "noise" significantly but might slow down the response from the search engine. Users are encouraged to use topics when possible since they improve response time significantly while SearchIn and ExcludeDir add delay. Limiting the scope through the selection of one particular collection also gives significantly faster response times.
One might be tempted to create more, smaller, collections in order to have quicker searches with reduced noise. The problem with this approach is that this will multiply the number of index files that have to be opened for every search. For example: having one collection per language already creates 11 collections, multiply this number with sub-domains on EUROPA requiring a separate collection because of subject, physical location, size, etc., and we quickly end up with dozens of collections. The search engine needs to open all these collections in order to allow for a global search throughout all the indexed information. The larger the number of files to be searched, the slower the queries will become. We might even crash into system limits.

zone searching is faster than searching through fields

Where possible use zone searching. For example: the query "competition <in> keywords" is noticeably faster than this query "keywords <contains> competition". Even field searches using the "starts" or "matches" operators are much slower than zone searches.

limitations

dynamically generated data

The indexer procedures have been set up (by default) not to index output generated by CGI scripts. HTML data generated through CGI scripts is per definition of a dynamic nature: two calls to the same CGI script will probably not return the same results. It does not make sense to index data that is not going be the same after being indexed.

Another reason not to index output generated through CGI scripts is that this data normally does not have a "last modified" time stamp assigned. Server parsed HTML pages with server side includes have the same "problem". If we were to index dynamically generated data then we would need to re-index everything each time the web spider runs. This means: requesting the web server to execute each instance of every indexed call to a CGI script. Needless to say that this would generate a heavy performance hit, with uncertain results.

stemming

Stemming has been disabled.

stop words

Another problem related to the choice of the working language are the stop words. Stop words are different in each language: "the" is a stop word in English but not in French. We decided not to use stop words in this release. All words are searchable.

meta information

Meta information, such as "title" and meta tags, are indexed and can be searched upon. The restriction is that the "fields" or "zones" that are to be indexed have to be set up at the time the collection is initialized. We choose to index the "compulsory" meta tags as described in the IPG for EUROPA.
Please note, nothing is provided in the indexer to give meta information a higher "weight" than other information. We are looking for a way to add weight to meta data while launching queries. Come back here later for more news.

one search, one collection format

One separate search action can only be launched on collections with an identical set up. This means for example that collections with different language set ups could only be searched through separate search actions. The same goes for collections with different sets of fields (meta tags, etc.) in the indexes.
Because we allow searches spanning all collections we need to have all collections set up identically.

clustering is not possible

Because all languages are living in the same collections we cannot use the "clustering" facility offered by the Search'97 Information Server. Clustering is used to automatically find the groups of similar documents returned by the search engine. In our case (non-English) stop words are also being used to build clusters around them, which creates a roaring noise.
 

Data Centre