Topics Map > University of Chicago > IT Services > Services & Support > Web Development/Hosting/Resources

Google Search Appliance - Excluding Content

This article explains how to exclude your web-based content from the University of Chicago Google Search Appliance (GSA).

The University of Chicago uses a Google Search Appliance (GSA) as the main search tool for public web-based content. The Appliance continually indexes new documents as they are posted to the University of Chicago websites, and guides users to relevant content using customized search results. The GSA uses the same technology as Google.com: it's a locally run instance of Google focused exclusively on the University of Chicago.

Excluding Content from the GSA

You can have a site permanently removed from our appliance crawl by contacting us and requesting an addition to the "Do Not Crawl" URL patterns list. Note: This step will only remove the content from the UChicago GSA index; other search engines will continue to crawl and index your site if it is publicly available. We recommend using the "robots" methods described below in order to maintain control over the indexing of your site.

To remove content from all search engines, consider using one of the methods described below.

Exclude individual pages:

To exclude individual pages from search engines crawls, include the following meta tag between the <head> and </head> tags on your page:

yes

<meta name="robots" content="noindex, nofollow">

This will prevent crawlers (robots) from indexing the page, or following any links on the page. If the page has already been indexed, it will be removed from the index the next time Google crawls the page. If you remove this tag, your page will be indexed the next time Google crawls the page.

Exclude an entire site:

To exclude an entire site or directory from search engines crawls, insert a robots.txt file at the top level of the site. The contents of the robots.txt file should resemble the following lines:

User-agent: * Disallow: /

(Note: If you just want to restrict the UChicago appliance from crawling your site, the appliance user-agent is: "gsa-crawler (Enterprise; S5-PK7Z8TT6T2NJS; webadmin-
bots@listhost.uchicago.edu,alantak@uchicago.edu)".)

More detailed instructions about robots.txt files can be found at The Web Robots Pages.

External Resources




Keywords:gsa, custom, indexing, exclude, do_not_crawl, robots.txt, website, disallow   Doc ID:15755
Owner:Alan T.Group:University of Chicago
Created:2010-11-28 19:00 CDTUpdated:2017-04-04 10:18 CDT
Sites:University of Chicago, University of Chicago - Sandbox
Feedback:  0   0