Technical Guidelines: Using the Search Engine

Using the Search Engine

introduction

The search engine currently used on our web servers is the Search'97 Information Server from Verity.

changes from the previous release `NEW`

This part contains an overview of the features that have changed in the latest installation of the search engine, in September 2000. All changes are described thoroughly in the text.

New input tags are supported for a query form: TopicMime, TopicInclude, TopicLang. These tags take advantage of the Information Server topics feature to facilitate searching within some document subsets in a very fast and efficient way. The topics are sort of precompiled searches which increase speed significantly. Their use is encouraged when possible instead of the SearchIn, ExcludeDir parameters. The topics can be used only in a filtered search using a new template, EUROPA_filt_topics.hts. For instructions on how to use topics see below, in the "search form" section. Publishers' comments, remarks and suggestions on the topics are very welcome and will help us improve the service. Besides, you can contact the Data Center if you need more information and/or help.
Stemming has been disabled. It was useful only for the english language and it could create noise when searching in other languages.
All indexes are case insensitive, no case sensitive search is possible any more. The <case> operator has no effect any more.
No stop words are used at all.
The classification tag is now available for searching.

collections

The collections are sets of index files that are maintained on a daily basis and that are available for searching. There are a few limitations to these collections in relation with the choice of language, and the choice of meta information to be indexed (see further). I have set up the collections to coincide with access rights and the physical locations of the data, while also trying to create useful logical entities. Here is a list of collections that can be searched via the Search'97 Information Server search interface:

EUROPApluscore: IntraComm pages.
EUROPAcore: EUROPA pages, without EuroStat Datashop, EUR-LEX and SCADplus pages.
EURLEXfiles: all EUR-LEX pages on EUROPA; indexed separately from the other EUROPA pages because of the size and the distinct structure.
SCADplus: all SCADplus pages on EUROPA; indexed separately from the other EUROPA pages because of the size and the distinct structure.
Citizens: the citizens.eu.int site still resides on a separate platform.
OPOCE: eur-op.eu.int pages.
CURIA: www.curia.eu.int pages.

The collections listed above are grouped in meta-collections as follows:

EUROPAplusonly: indexes for all the IntraComm pages, in principle this excludes pages that are also available through EUROPA;
fullEUROPA: all pages available through the EUROPA, EUR-OP, CURIA, and "Citizens First" servers;
EUROPAplus: all pages served via the IntraComm, EUROPA, EUR-OP, CURIA, and "Citizens First" servers.

The meta-collections EUROPAplus and EUROPAplusonly are available for searching via the IntraComm web server only. The meta-collection fullEUROPA is only available through the EUROPA and EUR-OP web servers.

Remark:

This list only shows collections that contain publicly available data. A few other collections containing restricted data do exist.

search form

<FORM ACTION="/search/s97.vts" METHOD="POST">

For public collections (the ones listed before) the URL "/search/s97.vts" should be used. "/search/s97.vts" cannot be used for queries on collections that contain restricted data. In those cases "/search97cgi/s97_cgi/..." should be used. Contact the DI-DC-D web team when you are planning a search on restricted data.

<INPUT TYPE="type" NAME="Collection" VALUE="collection">

Provides the collection to be searched. See previous paragraph for a list of available collections. You can repeat this input tag for all the collections you want to search on. You might want to use a list of checkbox input tags to list a series of collections from which the user can select. This can be used to define the scope of the search.

<INPUT TYPE="type" NAME="Action" VALUE="FilterSearch">

This input tag indicates that the query will be filtered before being passed on to the search engine; this should be set in all but the simplest cases. The other possible value for this parameter is "Search". Use the default value "Search" only when the queries will not be filtered; this will give a small performance gain. "type" will normally be set to "Hidden".

<INPUT TYPE="type" NAME="Filter" VALUE="filter_template">

Specifies the filter template to be used with the "FilterSearch" action and "Simple" query mode only. Use "EUROPA_filter.hts" or "EUROPA_filt_topics.hts" to take advantage of the topics. "type" will normally be set to "Hidden".

<INPUT TYPE="type" NAME="ResultTemplate" VALUE="result_template">

Choose the template file that will be used for the results list:

On IntraComm use: EC_res-ln.hts, where "ln" equals "en" (English) or "fr" (French).

On EUROPA use: EUROPA_res-ln.hts, where "ln" equals "en" (English) or "fr" (French), translations are being worked on.

Optional but recommended input tag, because it implicitly sets the language used in the results list.

<INPUT TYPE="type" NAME="QueryMode" VALUE="query_mode">

This parameter sets the query mode. The default query mode "Simple" specifies that the Verity Query Language can be used. This query mode used together with the "FilterSearch" gives the best search results. The other possible values are "Boolean", "FreeText", and "Internet". Optional input tag.

<INPUT TYPE="type" NAME="SearchPage" VALUE="return_URL">

We created this variable to pass a return URL to the search engine in order for it to be displayed in the result lists. This inelegant solution was necessary because somehow somewhere the "HTTP-Referer" variable gets lost. "type" should normally be set to "Hidden". Recommended.

<TEXTAREA NAME="QueryText" ROWS=2 COLS=50></TEXTAREA>

This is what it is all about: entering a query.

<INPUT TYPE="edit" SIZE=30 NAME="InKeywords" VALUE="">

Search for documents with a matching Keywords field. The search will eventually be combined with the rest of the query by inserting it in the QueryText. This optional input tag requires a "FilterSearch".

<INPUT TYPE="edit" SIZE=30 NAME="InTitle" VALUE="">

Search for documents with a matching Title field. The search will eventually be combined with the rest of the query by inserting it in the QueryText. This optional input tag requires a "FilterSearch".

<INPUT TYPE="edit" SIZE=9 NAME="StartDate" VALUE="">

Tells the search engine to only select documents with a "last modified" time stamp starting at the specified date. This optional input tag requires a "FilterSearch". Although a multitude of date formats are accepted, we recommend to use "dd-mm-yyyy" (day, month, year).

<INPUT TYPE="type" NAME="SearchIn" VALUE="URL_path">

Limit the scope of the search to one (or more) URL paths. When more URL paths are to be selected use multiple input tags with "NAME="SearchIn"". Please note that the use of this parameter, while reducing noise significantly, will slow down the queries noticeably. If possible, use topics (TopicInclude) instead. See below for information on the topic tags and how to use them.

Optional input tag requiring a "FilterSearch".

<INPUT TYPE="type" NAME="ExcludeDir" VALUE="URL_path">

Limit the scope of the search excluding one (or more) URL paths. When more URL paths are to be excluded use multiple input tags with "NAME="ExcludeDir"". Please note that the use of this parameter, while reducing noise significantly, will slow down the queries noticeably. If possible, use topics (TopicInclude) instead. See below for information on the topic tags and how to use them.

Optional input tag requiring a "FilterSearch".

<INPUT TYPE=="type" NAME="HTMLonly" VALUE="yes|no">

Limit the scope of the search to HTML pages. Typically used with radio buttons and a "yes or no" choice. Optional input tag requiring a "FilterSearch".

<INPUT TYPE="type" NAME="SourceQueryText" VALUE="source_query_text">

Can be used to limit the scope of the search. The specified value will be "AND"-ed with the contents of the QueryText parameter.

The preceding parameters "StartDate", "SearchIn", "ExcludeDir" and "HTMLonly" are inserted into the SourceQueryText by the filter template. For example: setting "StartDate" to "1999-06-10" will result in "Modified >= 1999-06-10" being inserted in the SourceQueryText.

This is an optional input tag that under normal circumstances will be hidden.

<INPUT TYPE="type" NAME="ResultCount" VALUE="result_count">

Specifies the number of entries to be shown per result list page. The default value is "10". A larger number will generate larger result list pages, and a longer response time for the visitor. Optional input tag.

<INPUT TYPE="type" NAME="TopicMime" VALUE="topic_mime"> NEW

Limit the search to one (or more) document types, according to the topic_mime value. Possible values are

html001	HTML documents
pdf001	PDF documents
word001	Microsoft Word documents

You can also use the negation of these values. "<NOT> html001" for example will retrieve all documents that are not html.

Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".

<INPUT TYPE="type" NAME="TopicLang" VALUE="language"> NEW

Limit the search to one (or more) languages. Possible values are

da001	Danish
de001	German
el001	Greek
en001	English
es001	Spanish
fi001	Finnish
fr001	French
it001	Italian
nl001	Dutch
pt001	Portuguese
sv001	Swedish

The documents are classified in the following way:

1. Information given in the language metatag (applies only to HTML documents which do have this tag).

2. Language information given in the document URL like "catalog/LL/filename" where LL is the language.

3. Language information given in the document file name like "catalog/filename_LL.ext" where LL is the language.

Please keep in mind that this information is not always dependable because

1. Not all HTML documents have the language metatag populated.

2. The search engine interprets non-alphanumeric characters like slash "/" or underscore "_" as blank space and there might index some documents in a wrong way.

3. Documents that do not follow the above naming scheme are not classified according to language and cannot be searched with the use of this tag.

In spite of these limitations we believe that the language classification does provide some help. It is up to the publishers to decide about its usefulness.

The negation of these values is also possible. "<NOT> fr001" for example will retrieve the documents that are not in the french language, ie the documents that have not been classified as french according to the mechanism explained above.

Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".

<INPUT TYPE="type" NAME="TopicInclude" VALUE="include"> NEW

Limit the search to documents hosted in a particular directory on the server. Possible values and their corresponding directories on the EUROPA server are

policy001	/europa.eu.int/pol/
scadplus001	http://europa.eu.int/scadplus
eurlex001	http://europa.eu.int/eur-lex
generalreport001	/europa.eu.int/abc/doc/off/rg/
monthlybulletin001	/europa.eu.int/abc/doc/off/bull/

The use of these topics results in much faster retrievals comparing to the same searches performed with the SearchIn input parameter.

Negation of the values is also possible to retrieve documents that do not belong to one of the above sets.

Optional input tag requiring a "FilterSearch" and the "EUROPA_filt_topics.hts" template. type should normally be set to "hidden" or "select".

Results sorted by

<SELECT NAME="SortField">

<OPTION SELECTED VALUE="Score">Score

<OPTION VALUE="Modified">Date

</SELECT>

in

<SELECT NAME="SortOrder">

<OPTION SELECTED VALUE="desc">Descending

<OPTION VALUE="asc">Ascending

</SELECT> order

Allow the visitors to have results sorted by score or date and in ascending or descending order. Optional.

<INPUT TYPE="Submit" VALUE="Search">

Go for it!

<INPUT TYPE="Reset" VALUE="Reset">

Reset the values of the form parameters to their default values. Optional.

</FORM>

That's it!

information available for searching

The indexer programs indexes (most) file formats that are available on our web servers. The complete contents are being indexed. On top of that a series of fields, that are present either inside the documents or attached to the documents will be indexed separately as well.
Here is an overview of these fields, with for each of them the corresponding HTML tag, PDF field, and MS-Word field:

field name	HTML tags	standard PDF field	standard Word field
Title	Title tag	FTS_Title	Title
Keywords	meta tag Keywords	FTS_Keywords	Keywords
Modified	file time stamp	file time stamp	file time stamp
Description	meta tags Description or Abstract	FTS_Subject (1)	Subject (1)
Author	meta tag Author	FTS_Author	Author
Language	meta tag Language	n/a	n/a
Reference	meta tag Reference	n/a	n/a
Creator	meta tag Creator	n/a	n/a
Publisher	meta tag Publisher	n/a	n/a
Type	meta tag Type	n/a	n/a
DateAlarm	meta tag DateAlarm	n/a	n/a
DatePublication	meta tag DatePublication	n/a	n/a
Classification	meta tag Classification	n/a	n/a
(1) these fields are both being indexed as Subject fields

how to search on the contents of these fields

A query for "field_name <contains> string" can be used to find documents with a corresponding meta tag. Using <starts> or <matches> instead of <contains> where possible will give results more quickly because <contains> makes the search scan complete fields before deciding whether it found a match or not.
For numeric fields relational operators can be used. You can find more details in the help files on IntraComm. Here are a couple of examples to get you going:

Creator <contains> "COMM/DGX/D.2/EUROPAplus": will return all documents with a "Creator" meta tag containing the specifiedstring
DatePublication >= 1999-05-01: will return all documents with a DatePublication meta tag that contains a value that is greater or equal than the specified value
(DatePublication < 1999-05-01) <AND> (DatePublication > 1900-01-01): will return all documents with a DatePublication meta tag that contains a value that is smaller than the specified value; if you would only specify the first part of the query then all documents that do not have the DatePublication meta tag set would be returned as well ("null" value is the smallest possible value)
Language <matches> fr: will return all documents with the "Language" meta tag set to "fr"
Language <contains> fr: will return all documents that have the "Language" meta tag containing "fr"

Please consider implementing language searches with the TopicLang parameter. Check the "form" section of this document for information about it.

special fields

The "Title" and "Keywords" fields are attached to the body of the document as hidden zones when the document is being indexed. This has two consequences.
The contents of these fields will be included in the body of the documents while being indexed. This way we make sure that these contents are found when using a basic search.
The Verity Query Language "proximity" operator "<IN>" can be used to search within one of these zones. For example: specifying "competition <IN> Keywords" as a query will return all documents that have the word "competition" specified in the "Keywords" field.
See also the locally created input tags "InTitle" and "InKeywords".

zone searching

Zones are specific regions of a document to which searches can be limited. In our environment the search over zones can be used over HTML documents. For text regions of a document, searching a zone is much faster than searching a field. For numeric fields like dates, you can perform relative comparisons using the relational operators. This is not possible on zones.
We already mentioned two zones (title and keywords) before. You can find a list of standard zones that are extracted from HTML documents in the detailed help on IntraComm (chapter on attribute searching).

Note: adding an extra field to be indexed, or changing fields into zones necessitates a complete re-indexation of all web pages. With the amount of data and the number of collections involved this turns out to be a major operation that requires some planning.

preparing documents for better search results

While creating web pages and other documents you should initialize meta information. Especially the document title, and the "Keywords" and "Description" meta tags are useful. The indexer can retrieve these informations from HTML pages, PDF documents, Word files, to name the most obvious file formats.

The results lists can show more useful information, provided that information is available.
The title of a document can only be provided when it is available in the document. Try to use meaningful titles: 2000 documents with the same title can not really be considered as using meaningful titles. The description (a.k.a. "abstract") of a document will be shown in the results lists whenever it is provided by the author. Otherwise the indexer will have a go at it and produce a summary that will be shown in the results list. The resulting summaries are not always convincing, to say the least.

The contents of meta tags can be searched upon. For visitors this can reduce the noise; for authors this might be a means to push certain documents to the foreground.

Recommendation: for HTML pages that include a search form, add "Search Form" in the Keywords meta tag. This will allow us to find all search forms in case changes are needed.

performance issues

limiting scope of search

The scope of a search can be limited in two ways: through the choice of a collection to search or through added query information (SearchIn, ExcludeDir, topics etc.). Limiting the scope through added query information can reduce "noise" significantly but might slow down the response from the search engine. Users are encouraged to use topics when possible since they improve response time significantly while SearchIn and ExcludeDir add delay. Limiting the scope through the selection of one particular collection also gives significantly faster response times.
One might be tempted to create more, smaller, collections in order to have quicker searches with reduced noise. The problem with this approach is that this will multiply the number of index files that have to be opened for every search. For example: having one collection per language already creates 11 collections, multiply this number with sub-domains on EUROPA requiring a separate collection because of subject, physical location, size, etc., and we quickly end up with dozens of collections. The search engine needs to open all these collections in order to allow for a global search throughout all the indexed information. The larger the number of files to be searched, the slower the queries will become. We might even crash into system limits.

zone searching is faster than searching through fields

Where possible use zone searching. For example: the query "competition <in> keywords" is noticeably faster than this query "keywords <contains> competition". Even field searches using the "starts" or "matches" operators are much slower than zone searches.

limitations

dynamically generated data

The indexer procedures have been set up (by default) not to index output generated by CGI scripts. HTML data generated through CGI scripts is per definition of a dynamic nature: two calls to the same CGI script will probably not return the same results. It does not make sense to index data that is not going be the same after being indexed.

Another reason not to index output generated through CGI scripts is that this data normally does not have a "last modified" time stamp assigned. Server parsed HTML pages with server side includes have the same "problem". If we were to index dynamically generated data then we would need to re-index everything each time the web spider runs. This means: requesting the web server to execute each instance of every indexed call to a CGI script. Needless to say that this would generate a heavy performance hit, with uncertain results.

stemming

Stemming has been disabled.

stop words

Another problem related to the choice of the working language are the stop words. Stop words are different in each language: "the" is a stop word in English but not in French. We decided not to use stop words in this release. All words are searchable.

meta information

Meta information, such as "title" and meta tags, are indexed and can be searched upon. The restriction is that the "fields" or "zones" that are to be indexed have to be set up at the time the collection is initialized. We choose to index the "compulsory" meta tags as described in the IPG for EUROPA.
Please note, nothing is provided in the indexer to give meta information a higher "weight" than other information. We are looking for a way to add weight to meta data while launching queries. Come back here later for more news.

one search, one collection format

One separate search action can only be launched on collections with an identical set up. This means for example that collections with different language set ups could only be searched through separate search actions. The same goes for collections with different sets of fields (meta tags, etc.) in the indexes.
Because we allow searches spanning all collections we need to have all collections set up identically.

clustering is not possible

Because all languages are living in the same collections we cannot use the "clustering" facility offered by the Search'97 Information Server. Clustering is used to automatically find the groups of similar documents returned by the search engine. In our case (non-English) stop words are also being used to build clusters around them, which creates a roaring noise.

Using the Search Engine

introduction

changes from the previous release NEW

collections

search form

information available for searching

how to search on the contents of these fields

special fields

zone searching

preparing documents for better search results

performance issues

limiting scope of search

zone searching is faster than searching through fields

limitations

dynamically generated data

stemming

stop words

meta information

one search, one collection format

clustering is not possible

changes from the previous release `NEW`