Spider in Search Engine

Published on January 2018 | Categories: Documents | Downloads: 30 | Comments: 0 | Views: 318

of 10

Content

INFORMATION RETRIVAL CHAPTER 7 PRESENTED BY: Ashish Gautam Bsc CSIT 7th Sem

Search Engine • A program that searches for documents for specified keyword and returns a list of the documents • Typically, a search engine works by sending out a spider to fetch as many as documents. • Another program indexer then reads these documents and creates index based on the word • contained in each document such that only meaningful results are retrieved to query. • A web search engine is designed to search the information on WWW.

How does search Engine Works ????

• A search engine operates in the following order: 1. Web crawling 2. Indexing 3. Searching • Web search engine works by storing information about many web pages which they retrieve. • These pages are retrieved by a web crawler (sometimes also called spiders), i.e. automated web • browser which follows every links on the site. • Exclusions can be made by the use of robots.txt. Cont..

Cont..

• The contexts of each page can be analyzed to determine how it should be indexed. • When a user enters a query into a search engine, the engine examines and provides the listing of • best matching web pages with ranking.

WEB CRAWLING • Web crawling is the process by which we gather pages form the web, in order to index them and • support a search engine. • The feature of a crawler must provide; (1) Robustness [detect the spider trap] (2) Politeness [follow the restrictions to spider (robots.txt)].

Feature a Crawler mustProvide • Distributed

• Performance and Efficiency • Quality • Freshness • Extensible

Crawling Operation • The crawler begins with one or more URLs that constitute a seed set. • It picks a URL from this seed set, and then fetches the web page at that URL. • The fetched page is then parsed to extract both the text and the links form the page. • The extracted text is fed to a text indexer. • The extracted links (URLs) are then added to a URL frontier which at all time consists of URLs • whose corresponding pages have yet to be fetched by crawler.

To be Continued by Kshitiz…

Thank You!!!

Spider in Search Engine

Comments

Content

Sponsor Documents

Recommended