Link to main University of Kentucky page

Back to Web Providers... SEARCHING YOUR PAGES

We have moved from a Google Search Appliance to Google Custom Search Engines. The Appliance was removed from service at the end of June, 2013. The old GSA documentation is still available.

We are using Google Custom Search Engines, a Google-hosted service, to search our web pages. This searches a subset of Google’s index based on domain names or URL patterns. To be included in our campus search your pages must:

· be on a University of Kentucky web server (This is primarily the uky.edu domain, but a few other domains are included as well. If you have a UK domain not ending in uky.edu contact the address at the end of this page to get it included in the search.);

· be on a web server that does not exclude the Google crawler. (Google follows the Robots Exclusion Protocol.);

· be accessible by following links from an indexed page. This usually means having a link to your page from your department’s, college’s, or other unit’s page or from our site index.;

· not exclude indexing through HTML meta tags; and

· not require authentication (using passwords, IP address restrictions, etc.).

SOME SEARCHING CONSIDERATIONS

Search services generally have two parts: an indexing program (known as a spider or robot or crawler) and a search program. The spider follows links beginning with the www.uky.edu home page, or other page it encounters, and examines and indexes the pages it finds. The search program uses this index and responds to user requests.

This means if you have a page that is not linked to by any other page it won’t be included in the search index! This may be desirable for some pages, but for most it isn’t. There must be an unbroken chain of links from some indexed page to each page that you want to be indexed. In some cases this might mean creating a local index page or sitemap that includes links to your other pages.

Most indexing services, including the numerous commercial services that search our servers, index each word in an HTML file, with words in titles and headings generally given greater weight. In addition to HTML pages, many crawlers index plain text files, PDF files, numerous word processing formats, Powerpoint files, and others. The search server may cache the files it indexes — it keeps a local copy which people using the service may view. This can be useful, for example, if the original web server is down or it may be used to present a preview to the user. For some pages caching is inappropriate and this can be controlled with meta tags. For files other than HTML the search server may offer to display the file in a simple HTML or text form or a formatted preview from its cached copy. This is useful if the user doesn’t have the application needed to read the original file.

Note that text that is included on a web page as an image rather than actual text generally can’t be indexed — most spiders can’t read pictures of words — and thus such text won’t be found by a search. Some will perform OCR if they recognize text in an image, but the accuracy of the results can be problematical. Most spiders will find text in alt or title tags if you include them, however. If your page depends on the browser interpreting some Java or Javascript code to locate the actual content it won’t be found by spiders. Pages using frames can cause odd results, so frames must be used with care. Pages using Flash and similar graphic-oriented products will require special care to insure that they are indexed appropriately.

Spiders generally have a limit on the size of the pages they will download and process. A few megabytes may be read with the remainder discarded. Files larger than a certain size may be ignored. These sizes include just the HTML page or other document itself, not any associated images in separate files, so it is unlikely to affect anything but large PDF and Powerpoint files. You should consider breaking large PDF documents into smaller parts in any case. One megabyte is a good limit.

Search services try to visit pages often enough to catch any changes. They may use the Last-Modified HTTP response header to see when a page was last modified and will check back sooner if a page seems to be changing often. Some web servers and content management systems do not set the Last-Modified header to the date and time of the last content change, but simply use the current time or the time the page was last generated. This can cause the spider to check back very frequently, which can cause an unnecessarily high load on the server.

SEARCHING META TAGS

Some spiders also look for meta tags containing additional information about a web page. Google recognizes certain meta tags and will use them in indexing, sorting results, and creating page snippets. It also attempts to discover metadata in some non-HTML files. When possible it will identify title, author, subject, and keyword information and index them. For PDF files the title metadata is used. Page titles are indexed separately and can be searched using the intitle: search parameter. Indexed metadata is treated as part of the page body.

Keywords and description tags are supported by some services:

  <meta name="keywords" content="your keywords here">
  <meta name="description" content="your description here">

Put meta tags in the head section of an HTML file along with the title tag. Indexing engines will only use a limited amount of text, so keep your description and keywords list concise.

More complex metadata schemes have been proposed to allow more elaborate cataloging of web pages, although none have yet found wide acceptance. See the World Wide Web Consortium’s Metadata and Resource Description, which includes material on the still evolving Resource Description Framework (RDF), and the Dublin Core Metadata Initiative.

CONTROLLING INDEXING OF YOUR PAGES

Most spiders follow a common Robots Exclusion Protocol which allows you to prevent some or all of your pages from being indexed. You can use the robots meta tag in a specific page to control how it is handled — disable indexing or disable following links from the page, for example. Google respects the noindex, nofollow, and noarchive values of the robots meta tag.

If you are running your own server you can use the robots.txt file to globally control how your server is indexed. There is only a single robots.txt file for each server. For example, the robots.txt file for www.uky.edu is at the URL www.uky.edu/robots.txt.

Remember that these methods are merely suggestions to the indexing program. You have no way to compel their compliance. Sensitive or confidential information should not be placed on publicly available web pages!

It is possible to exclude known spiders from your pages with a deny directive in a .htaccess file. (This works on Apache servers, such as www.uky.edu.) We have more information about restricting access to web pages.

CONTROLLING CACHING OF YOUR PAGES

Caching by search services is generally desirable, but if you have pages that should not be cached you can stop it by including this meta tag:

  <meta name="robots" content="noarchive">

Google also uses a small representative piece of each page returned as a search result. You can disable this as well:

  <meta name="googlebot" content="nosnippet">

(Note that instructions to the Google crawler can be placed in either robots or googlebot meta tags.) These meta tags must be present in documents at crawl time for the service to obey them!

A special case is PDF files that should be indexed, but not cached. There is no way to directly include some meta information in a PDF file, but if security is enabled for a PDF file it will be treated as if the noarchive tag was specified. Security settings can be controlled using Adobe Acrobat (not the free Reader).

RANKING SEARCH RESULTS

Google considers both the content of web pages and the link relationships among pages to rank the results. The default results page is in order of decreasing rank. We can define special keywords that will always return specified links at the top of search results when using our Custom Search Engine. If you need a keyword added to our CSE contact the address at the end of this page to get it included.

CUSTOM SEARCHING

If you want to include a search form on your pages that will restrict the search to your pages you will need to create a Google Custom Search Engine using your URL pattern. This is a free service. More information is available from Google.

Sample pages using several CSEs are available as templates and for testing.

· Dedicated UK-wide CSE search page:
http://ukcc.uky.edu/cse/
This is most like the unrestricted search on our search appliance.

· Page with a search box that shows results on a separate page:
http://ukcc.uky.edu/cse/searchbox.html
This uses the same UK-wide CSE.

· Search page using a CSE that searches only the Libraries pages — libraries.uky.edu, libguides.uky.edu, www.uky.edu/Libraries, and library.law.uky.edu: http://ukcc.uky.edu/cse/libsearch.html

· Search page using a CSE that searches only Healthcare and Medical Center-related pages — ukhealthcare.uky.edu, www.mc.uky.edu, www.hosp.uky.edu, and a few others:
http://ukcc.uky.edu/cse/medsearch.html

Examine the code on those pages to see how Custom Search Engine calls are set up. To get the full functionality of the Google CSE requires Javascript on the web browser. There is a script (supplied by Google) defined in the head section and gcse:search or gcse:searchbox-only tags in the body that dynamically define the search box and results. There is code on the sample pages to provide as much function as possible without Javascript in noscript blocks. The cx parameter selects the CSE that will be used. Each CSE has an identifier assigned by Google. The sample pages use the cx codes for the CSEs described above.

If you have existing forms using custom searches on our GSA they may continue to work using our default CSE. We will be redirecting search queries to diogenes.uky.edu to google.com, but because of differences between them only the simplest queries are likely to perform well. Complex custom queries may result in no results or results from the entire UK-wide CSE rather than a restricted subset.


This page was last updated on 2013-07-02. Please direct questions and comments regarding this page to webmaster@www.uky.edu.