Spider in Search Engine

Published on January 2018 | Categories: Documents | Downloads: 30 | Comments: 0 | Views: 318
of 10
Download PDF   Embed   Report

Comments

Content

INFORMATION RETRIVAL CHAPTER 7 PRESENTED BY: Ashish Gautam Bsc CSIT 7th Sem

Search Engine • A program that searches for documents for specified keyword and returns a list of the documents • Typically, a search engine works by sending out a spider to fetch as many as documents. • Another program indexer then reads these documents and creates index based on the word • contained in each document such that only meaningful results are retrieved to query. • A web search engine is designed to search the information on WWW.

How does search Engine Works ????

• A search engine operates in the following order: 1. Web crawling 2. Indexing 3. Searching • Web search engine works by storing information about many web pages which they retrieve. • These pages are retrieved by a web crawler (sometimes also called spiders), i.e. automated web • browser which follows every links on the site. • Exclusions can be made by the use of robots.txt. Cont..

Cont..

• The contexts of each page can be analyzed to determine how it should be indexed. • When a user enters a query into a search engine, the engine examines and provides the listing of • best matching web pages with ranking.

WEB CRAWLING • Web crawling is the process by which we gather pages form the web, in order to index them and • support a search engine. • The feature of a crawler must provide; (1) Robustness [detect the spider trap] (2) Politeness [follow the restrictions to spider (robots.txt)].

Feature a Crawler mustProvide • Distributed

• Performance and Efficiency • Quality • Freshness • Extensible

Crawling Operation • The crawler begins with one or more URLs that constitute a seed set. • It picks a URL from this seed set, and then fetches the web page at that URL. • The fetched page is then parsed to extract both the text and the links form the page. • The extracted text is fed to a text indexer. • The extracted links (URLs) are then added to a URL frontier which at all time consists of URLs • whose corresponding pages have yet to be fetched by crawler.

To be Continued by Kshitiz…

Thank You!!!

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close