SEARCHING YOUR PAGES
Our search server indexes the University of Kentucky web servers
continuously and may visit pages that change often several times
each week. It crawls our web pages by starting at the UK home page
(www.uky.edu)
and following links. To be included in the index your pages must:
· be on a University of Kentucky web server (This is primarily the
uky.edu domain, but a few other domains are included as well.);
· be on a web server that does not exclude the search server (Our
indexer follows the
Robots
Exclusion Protocol.);
· be accessible by following links from the UK home page (This usually
means having a link to your page from your department's, college's, or
other unit's page or from our
site index.);
· not exclude indexing through HTML meta tags; and
· not require authentication using passwords.
SOME SEARCHING CONSIDERATIONS
Our search service, like most, has two parts: an indexing program (known
as a spider or robot or crawler) and a search program. The spider
follows links beginning with the www.uky.edu home page and examines and
indexes the pages it finds. The search program uses this index and
responds to user requests.
This means if you have a page that is not linked to by any other
page it won't be included in the search index! This may be
desirable for some pages, but for most it isn't. There must be an
unbroken chain of links from the UK home page at www.uky.edu to each
page that you want to be indexed. In some cases this might mean creating
a local index page for your site that includes links to your other
pages.
Most indexing services, including the numerous commercial services that
also search our servers, index each word in an HTML file, with words in
titles and headings generally given greater weight. In addition to HTML
pages, our spider indexes plain text files, PDF files, numerous word
processing formats, Powerpoint files, and many others. The search server
caches the files it indexes — it keeps a local copy which people
using the service may view. This can be useful, for example, if the
original server is down. For some pages caching is inappropriate and
this can be controlled with meta tags. For files other than HTML the
search server will offer to display the file in a simple HTML or text
form from its cached copy. This is useful if the user doesn't have the
application needed to read the original file.
Note that text that is included on a web page as an image rather than
actual text can't be indexed — the spider can't read pictures of
words — and thus won't be found by a search. Most spiders will
find text in alt tags if you include them, however. If your page depends
on the browser interpreting some Java or Javascript code to locate the
actual content it won't be found by spiders. Pages using frames can
cause odd results, so frames must be used with care. Pages using Flash
and similar graphic-oriented products will require special care to
insure that they are indexed appropriately.
Spiders generally have a limit on the size of the pages they will
download and process. For HTML files our system will index up to a size
of 2.5 megabytes then discard the remainder of the file. For non-HTML
files of up to 30 megabytes will be processed. Files larger than that
are ignored. Non-HTML files are converted to HTML and the first 2.5
megabytes of the HTML file are indexed. The remainder is discarded.
These sizes include just the HTML page or other document itself, not any
associated images in separate files, so it is unlikely to affect
anything but large PDF and Powerpoint files. You should consider
breaking large PDF documents into smaller parts in any case. One
megabyte is a good limit.
SEARCHING META TAGS
Some spiders also look for meta tags containing additional information
about a web page. Our Google Search Appliance generally ignores meta
tags and doesn't display them on the results page, but they can be used
to help restrict searches. It also attempts to discover metadata in some
non-HTML files. When possible it will identify title, author, subject,
and keyword information and index them. For PDF files only the title is
used. Pages titles are indexed separately and can be searched using the
intitle: search parameter. Indexed metadata is treated as part of the
page body.
Keywords and description tags are supported by some other services:
<meta name="keywords" content="your keywords here">
<meta name="description" content="your description here">
Put meta tags in the head section of an HTML file along with the title tag.
Indexing engines will only use a limited amount of text, so keep your
description and keywords list concise.
More complex metadata schemes have been proposed to allow more elaborate
cataloging of web pages, although none have yet found wide acceptance.
See the World Wide Web Consortium's
Metadata and Resource Description,
which includes material on the still evolving Resource Description Framework
(RDF), and the
Dublin Core Metadata Initiative.
CONTROLLING INDEXING OF YOUR PAGES
Most spiders follow a common
Robots
Exclusion Protocol
which allows you to prevent some or all of your pages from being
indexed. Our spider follows this protocol. You can use the robots meta
tag in a specific page to control how it is handled — disable
indexing or disable following links from the page, for example. Google
respects the noindex, nofollow, and noarchive values of the robots meta
tag.
If you are running your own server you can use the robots.txt file
to globally control how your server is indexed. There is only a single
robots.txt file for each server. For example, the robots.txt file for
www.uky.edu is at the URL www.uky.edu/robots.txt.
Remember that these methods are merely suggestions to the indexing
program. You have no way to compel their compliance. Sensitive
or confidential information shouldn't be placed on publicly available
web pages!
It is possible to exclude known spiders from your pages with a deny
directive in a .htaccess file. (This works on Apache servers, such
as www.uky.edu.) For example, our Google Search Appliance is
diogenes.uky.edu, so putting a .htaccess file containing:
deny from diogenes.uky.edu
in a directory will prevent our appliance from indexing and caching
the files in that directory and any subdirectories. We have
more information
about restricting access to web pages.
CONTROLLING CACHING OF YOUR PAGES
Caching by the Google Search Appliance can be a particular concern if
you have pages that are restricted to certain groups of clients by their
IP address (q.v. our
Restricting Access To Web Pages
information). Since our GSA has a UK IP address, it will be given access
to any web page that is restricted to UK IP addresses. Since it caches
the pages it indexes, the cached copy then becomes available to all
users, regardless of their IP address. You can stop caching of pages by
including this meta tag:
<meta name="robots" content="noarchive">
The Google appliance also uses a small representative piece of each page
returned as a search result. You can disable this as well:
<meta name="googlebot" content="nosnippet">
(Note that instructions to the Google crawler can be placed in either
robots or googlebot meta tags.) These meta tags must be present in
documents at crawl time for the appliance to obey them!
A special case is PDF files that should be indexed, but not cached.
There is no way to directly include meta information in a PDF file,
but if security is enabled for a PDF file it will be treated as if
the noarchive tag was specified. Security settings can be controlled
using Adobe Acrobat (not the free Reader).
RANKING SEARCH RESULTS
The Google Search Appliance considers both the content of web pages and
the link relationships among pages to rank the results. The default
results page is in order of decreasing rank. We can define special
keywords that will always return specified pages at the top of search
results.
CUSTOM SEARCHING
You can include a search form on your pages that will restrict the
search to your pages in a couple of ways. Here's an example:
Examine the HTML source of this page to see the details. The key items
are the as_dt and as_sitesearch fields. Set as_dt to "i" to include only
the specified path and specify the host and directory path in the
as_sitesearch field (e.g. "www.uky.edu/Providers"). Add a slash to the end
to exclude lower levels of subdirectories. Only a single URL can be
specified using this mechanism.
The qualification used becomes part of the search URL and is also displayed
on the results page and as part of the search box on the results page.
Note that if you are converting search forms from our old Ultraseek
search service there are several changes necessary, including the name
of the input field for the user of the form.
Another method of restricting searches uses the sitesearch parameter
rather than as_dt and as_sitesearch. In this case the restriction does
not appear on the results page as in the first example, although it is
still part of the URL.
If your unit requires more complex search qualifications (to include
results from several directories or servers, for example) you will need
to use the as_oq parameter instead of sitesearch or as_sitesearch. The
as_oq parameter adds added terms to the query using a boolean or.
The as_oq string becomes part of the query string and results display,
so the results page can get rather cluttered. We may be able to set up a
subcollection for you to simplify this. The following example uses a
subcollection.
The Google Search Appliance has a number of keywords defined for items
of campus-wide interest. Even if a search is restricted using one of
these methods the keywords will still be found. You can suppress the
keyword matches by including the numgm parameter with a value of 0 (zero).
This page was last updated on 2006-06-26.
Please direct questions and comments regarding this page to
webmaster@www.uky.edu.
|