We have moved from a Google Search Appliance to Google Custom Search Engines. The Appliance was removed from service at the end of June, 2013. This is the old documentation for the GSA. See the current documentation for current information.
Our search server indexes the University of Kentucky web servers continuously and may visit pages that change often several times each week or even several times each day. It crawls our web pages by starting at the UK home page (www.uky.edu) and following links. To be included in the index your pages must:
· be on a University of Kentucky web server (This is primarily the uky.edu domain, but a few other domains are included as well.);
· be on a web server that does not exclude the search server (Our indexer follows the Robots Exclusion Protocol.);
· be accessible by following links from the UK home page (This usually means having a link to your page from your department’s, college’s, or other unit’s page or from our site index.);
· not exclude indexing through HTML meta tags; and
· not require authentication using passwords.
SOME SEARCHING CONSIDERATIONS
Our search service, like most, has two parts: an indexing program (known as a spider or robot or crawler) and a search program. The spider follows links beginning with the www.uky.edu home page and examines and indexes the pages it finds. The search program uses this index and responds to user requests.
This means if you have a page that is not linked to by any other page it won’t be included in the search index! This may be desirable for some pages, but for most it isn’t. There must be an unbroken chain of links from the UK home page at www.uky.edu to each page that you want to be indexed. In some cases this might mean creating a local index page for your site that includes links to your other pages.
Most indexing services, including the numerous commercial services that also search our servers, index each word in an HTML file, with words in titles and headings generally given greater weight. In addition to HTML pages, our spider indexes plain text files, PDF files, numerous word processing formats, Powerpoint files, and many others. The search server caches the files it indexes — it keeps a local copy which people using the service may view. This can be useful, for example, if the original server is down. For some pages caching is inappropriate and this can be controlled with meta tags. For files other than HTML the search server will offer to display the file in a simple HTML or text form or a formatted preview from its cached copy. This is useful if the user doesn’t have the application needed to read the original file.
Spiders generally have a limit on the size of the pages they will download and process. For HTML files our system will index up to a size of 2.5 megabytes then discard the remainder of the file. For non-HTML files of up to 30 megabytes will be processed. Files larger than that are ignored. Non-HTML files are converted to HTML and the first 2.5 megabytes of the HTML file are indexed. The remainder is discarded. These sizes include just the HTML page or other document itself, not any associated images in separate files, so it is unlikely to affect anything but large PDF and Powerpoint files. You should consider breaking large PDF documents into smaller parts in any case. One megabyte is a good limit.
Our search appliance tries to visit pages often enough to catch any changes. It uses the Last-Modified HTTP response header to see when a page was last modified and will check back sooner if a page seems to be changing often. Some web servers and content management systems do not set the Last-Modified header to the date and time of the last content change, but simply use the current time or the time the page was last generated. This can cause the spider to check back very frequently, which can cause an unnecessarily high load on the server.
SEARCHING META TAGS
Some spiders also look for meta tags containing additional information about a web page. Our Google Search Appliance generally treats the content of meta tags as additional text and indexes it. It also attempts to discover metadata in some non-HTML files. When possible it will identify title, author, subject, and keyword information and index them. For PDF files only the title is used. Page titles are indexed separately and can be searched using the intitle: search parameter. Indexed metadata is treated as part of the page body.
Keywords and description tags are supported by some other services:
<meta name="keywords" content="your keywords here"> <meta name="description" content="your description here">
Put meta tags in the head section of an HTML file along with the title tag. Indexing engines will only use a limited amount of text, so keep your description and keywords list concise.
More complex metadata schemes have been proposed to allow more elaborate cataloging of web pages, although none have yet found wide acceptance. See the World Wide Web Consortium’s Metadata and Resource Description, which includes material on the still evolving Resource Description Framework (RDF), and the Dublin Core Metadata Initiative.
CONTROLLING INDEXING OF YOUR PAGES
Most spiders follow a common Robots Exclusion Protocol which allows you to prevent some or all of your pages from being indexed. Our spider follows this protocol. You can use the robots meta tag in a specific page to control how it is handled — disable indexing or disable following links from the page, for example. Google respects the noindex, nofollow, and noarchive values of the robots meta tag.
If you are running your own server you can use the robots.txt file to globally control how your server is indexed. There is only a single robots.txt file for each server. For example, the robots.txt file for www.uky.edu is at the URL www.uky.edu/robots.txt.
Remember that these methods are merely suggestions to the indexing program. You have no way to compel their compliance. Sensitive or confidential information shouldn't be placed on publicly available web pages!
It is possible to exclude known spiders from your pages with a deny directive in a .htaccess file. (This works on Apache servers, such as www.uky.edu.) For example, our Google Search Appliance is diogenes.uky.edu, so putting a .htaccess file containing:
deny from diogenes.uky.edu
in a directory will prevent our appliance from indexing and caching the files in that directory and any subdirectories. We have more information about restricting access to web pages.
CONTROLLING CACHING OF YOUR PAGES
Caching by the Google Search Appliance can be a particular concern if you have pages that are restricted to certain groups of clients by their IP address (q.v. our Restricting Access To Web Pages information). Since our GSA has a UK IP address, it will be given access to any web page that is restricted to UK IP addresses. Since it caches the pages it indexes, the cached copy then becomes available to all users, regardless of their IP address. You can stop caching of pages by including this meta tag:
<meta name="robots" content="noarchive">
The Google appliance also uses a small representative piece of each page returned as a search result. You can disable this as well:
<meta name="googlebot" content="nosnippet">
(Note that instructions to the Google crawler can be placed in either robots or googlebot meta tags.) These meta tags must be present in documents at crawl time for the appliance to obey them!
A special case is PDF files that should be indexed, but not cached. There is no way to directly include meta information in a PDF file, but if security is enabled for a PDF file it will be treated as if the noarchive tag was specified. Security settings can be controlled using Adobe Acrobat (not the free Reader).
RANKING SEARCH RESULTS
The Google Search Appliance considers both the content of web pages and the link relationships among pages to rank the results. The default results page is in order of decreasing rank. We can define special keywords that will always return specified links at the top of search results.
You can include a search form on your pages that will restrict the search to your pages in a couple of ways. Here's an example:
Examine the HTML source of this page to see the details. The key items are the as_dt and as_sitesearch fields. Set as_dt to “i” to include only the specified path and specify the host and directory path in the as_sitesearch field (e.g. “www.uky.edu/Providers”). Add a slash to the end to exclude lower levels of subdirectories. Only a single URL can be specified using this mechanism. The qualification used becomes part of the search URL and is also displayed on the results page and as part of the search box on the results page.
Another method of restricting searches uses the sitesearch parameter rather than as_dt and as_sitesearch. In this case the restriction does not appear on the results page as in the first example, although it is still part of the URL.
If your unit requires more complex search qualifications (to include results from several directories or servers, for example) you will need to use the as_oq parameter instead of sitesearch or as_sitesearch. The as_oq parameter adds added terms to the query using a boolean or.
The as_oq string becomes part of the query string and results display, so the results page can get rather cluttered. We may be able to set up a collection for you to simplify this. The following example uses a collection.
The Google Search Appliance has a number of keywords defined for items of campus-wide interest. Even if a search is restricted using one of these methods the keywords will still be found. You can suppress the keyword matches by including the numgm parameter with a value of 0 (zero).