Search Engine

Published on January 2017 | Categories: Documents | Downloads: 47 | Comments: 0 | Views: 1044
of 28
Download PDF   Embed   Report

Comments

Content

INTRODUCTION
Search engine is a software program that takes as input a search phrase, match it with entries made earlier and returns a set of links pointing to their locations. Actually consumers would really prefer a finding engine, rather than a search engine. Search engines match queries against an index that they create. The index consists of the words in each document, plus pointers to their locations within the documents. This is called an inverted file. A search engine or IR system comprises four essential modules: A document processor A query processor A search and matching function A ranking capability While users focus on "search," the search and matching function is only one of the four modules. Each of these four modules may cause the expected or unexpected results that consumers get when they use a search engine.

CHAPTER 2 SEARCH ENGINE MODULES

SEARCH ENGINE MODULES

1.Document Processor The document processor prepares, processes, and inputs the documents, pages, or sites that users search against. The document processor performs some or all of the following steps: 1.Process the input. 2.Identifies potential indexable elements in documents. 3.Deletes stop words. 4.Stems terms. 5.Extracts index entries. 6.Computes weights. 7.Creates and updates the main inverted file against which the search engine searches in order to match queries to documents. Processing the input It simply standardizes the multiple formats encountered when deriving documents from various providers or handling various Web sites. This step serves to merge all the data into a single, consistent data structure that all the downstream processes can handle. The need for a well-formed, consistent format is of relative importance in direct proportion to the sophistication of later steps of document processing. Identifying potential indexable elements It dramatically affects the nature and quality of the document representation that the engine will search against. In designing the system, we must define the word "term." Is it the alphanumeric characters between blank spaces or punctuation? If so, what about non-compositional phrases (phrases in which the separate words do not convey the meaning of the phrase, like "skunk works" or "hot dog"), multiword proper names, or inter-word symbols such as hyphens or apostrophes that can denote the difference between "small business men" versus small-business men.” Each search engine depends on a set of rules that its document processor must execute to determine what action is to be taken by the "tokenizer," i.e. the software used to define a term suitable for indexing.

Deleting stop words This step helps save system resources by eliminating from further processing, as well as potential matching, those terms that have little value in finding useful documents in response to a customer's query. Since stop words may comprise up to 40 percent of text words in a document, it still has some significance. A stop word list typically consists of those words such as articles (a, the), conjunctions (and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), and forms of the "to be" verb (is, are). To delete stop words, an algorithm compares index term candidates in the documents against a stop word list and eliminates certain terms from inclusion in the index for searching. Term Stemming Stemming is the ability for a search engine to search for variations of a word based on its stem. Stemming removes word suffixes, perhaps recursively in layer after layer of processing. The process has two goals. In terms of efficiency, stemming reduces the number of unique words in the index, which in turn reduces the storage space required for the index and speeds up the search process. In terms of effectiveness, stemming improves recall by reducing all forms of the word to a base or stemmed form. For example, if a user asks for analyze, they may also want documents which contain analysis, analyzing, analyzer, analyzes, and analyzed. Therefore, the document processor stems document terms to analy- so that documents, which include various forms of analy-, will have equal likelihood of being retrieved; this would not occur if the engine only indexed variant forms separately and required the user to enter all. Of course, stemming does have a downside. It may negatively affect precision in that all forms of a stem will match, when, in fact, a successful query for the user would have come from matching only the word form actually used in the query. Systems may implement either a strong stemming algorithm or a weak stemming algorithm. A strong stemming algorithm will strip off both inflectional suffixes (-s, -es, -ed) and derivational suffixes (-able, -aciousness, -ability), while a weak stemming algorithm will strip off only the inflectional suffixes (-s, -es, -ed).

Extract index entries Having completed the above steps, the document processor extracts the remaining entries from the original document. It is then inserted and stored in an inverted file that lists the index entries and an indication of their position and frequency of occurrence. The specific nature of the index entries, however, will vary based on the decision in Step 2 concerning what constitutes an "index able term." For example, the following paragraph shows the full text sent to a search engine for processing: Milosevic's comments, carried by the official news agency Tanjug, cast doubt over the governments at the talks, which the international community has called to try to prevent an all-out war in the Serbian province. "President Milosevic said it was well known that Serbia and Yugoslavia were firmly committed to resolving problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with the participation of the representatives of all ethnic communities," Tanjug said. Milosevic was speaking during a meeting with British Foreign Secretary Robin Cook, who delivered an ultimatum to attend negotiations in a week's time on an autonomy proposal for Kosovo with ethnic Albanian leaders from the province. Cook earlier told a conference that Milosevic had agreed to study the proposal. Steps1 to 6 reduce this text for searching to the following: Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna commun call try prevent all-out war Serb province President Milosevic said well known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia peace Serbia particip representa ethnic commun Tanjug said Milosevic speak meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week time autonomy propos Kosovo ethnic Alban lead province Cook earl told conference Milosevic agree study propos. The output of this step is then inserted and stored in an inverted file that lists the index entries and an indication of their position and frequency of occurrence. The specific nature of the index entries, however, will vary based on the decision in Step 2 concerning what constitutes an "index able term." More sophisticated document processors will have phrase recognizers, as well as Named Entity recognizers and Categorizers, to insure index entries such as Milosevic are tagged as a Person and entries such as Yugoslavia and Serbia as Countries.

BBC NEWS AS “NORMAL”

BBC NEWS_-TEXT ONLY

Term weight assignment

Weights are assigned to terms in the index file. The simplest of search engines just assign a binary weight: 1 for presence and 0 for absence. The more sophisticated the search engine, the more complex the weighting scheme. Measuring the frequency of occurrence of a term in the document creates more sophisticated weighting, with length-normalization of frequencies still more sophisticated .The optimal weighting comes from use of "tf/idf." This algorithm measures the frequency of occurrence of each term within a document. Then it compares that frequency against the frequency of occurrence in the entire database. The ranking takes two ideas into account for weighting. The term frequency in the given document (term frequency = TF) and the inverse document frequency of the term in the whole db (inverse document frequency = IDF). The term frequency in the given document shows how important the term is in this document. The document frequency of the term (the percentage of the documents which contain this term) shows how generally important the term is. A high weight in a TF-IDF ranking scheme is therefore reached by a high term frequency in the given document and a low document frequency of the term in the whole database. Not all terms are good "discriminators" — that is, all terms do not single out one document from another very well. A simple example would be the word "the." This word appears in too many documents to help distinguish one from another. A less obvious example would be the word "antibiotic." In a sports database when we compare each document to the database as a whole, the term "antibiotic" would probably be a good discriminator among documents, and therefore would be assigned a high weight. Conversely, in a database devoted to health or medicine, "antibiotic" would probably be a poor discriminator, since it occurs very often. The TF/IDF weighting scheme assigns higher weights to those terms that really distinguish one document from the others. Create index. The index or inverted file is the internal data structure that stores the index information and that will be searched for each query. Inverted files range from a simple listing of every alphanumeric sequence in a set of documents to a more linguistically complex list of entries, the tf/idf weights, and pointers to where inside each document the term occurs. The more complete the information in the index, the better the search results.

2.Query Processor

Query processing has six possible steps, though a system can cut these steps short and proceed to match the query to the inverted file at any of a number of places during the processing. Document processing shares many steps with query processing. More steps and more documents make the process more expensive for processing in terms of computational resources and responsiveness. However, the longer the wait for results, the higher the quality of results. Thus, search system designers must choose what is most important to their users — time or quality. The steps in query processing are: 1.Tokenize the query terms. 2.Delete stop words 3.Parsing 4.Stems terms 5.Query creation 6.Query term weighting Tokenizing. As soon as a user inputs a query the search engine — whether a keyword-based system or a full natural language processing (NLP) system —must tokenize the query stream, i.e., break it down into understandable segments. Usually a token is defined as an alphanumeric string that occurs between white space and/or punctuation. Parsing. Since users may employ special operators in their query, including Boolean, adjacency, or proximity operators, the system needs to parse the query first into query terms and operators. These operators may occur in the form of reserved punctuation (e.g., quotation marks) or reserved terms in specialized format (e.g., AND, OR). ). In the case of an NLP system, the query processor will recognize the operators implicitly in the language used no matter how the operators might be expressed (e.g., prepositions, conjunctions, ordering).

At this point, a search engine may take the list of query terms and search them against the inverted file. In fact, this is the point at which the majority of publicly available search engines perform the search. Stop list and stemming Some search engines will go further and stop-list and stem the query, similar to the processes described above in the Document Processor section. However, since most publicly available search engines encourage very short queries, the engines may drop these two steps. Creating the query How each particular search engine creates a query representation depends on how the system does its matching. If a statistically based matcher is used, then the query must match the statistical representations of the documents in the system. Good statistical queries should contain many synonyms and other terms in order to create a full representation. If a Boolean matcher is utilized, then the system must create logical sets of the terms connected by AND, OR, or NOT. An NLP system will recognize single terms, phrases, and Named Entities. If it uses any Boolean logic, it will also recognize the logical operators from Step 2 and create a representation containing logical sets of the terms to be AND'd, OR'd, or NOT'd. Query term weighting (assuming more than one query term). The final step in query processing involves computing weights for the terms in the query. Sometimes the user controls this step by indicating either how much to weight each term or simply which term or concept in the query matters most and must appear in each retrieved document to ensure relevance. Leaving the weighting up to the user is not common, because research has shown that users are not particularly good at determining the relative importance of terms in their queries. They can't make this determination for several reasons. First, they don't know what else exists in the database, and document terms are weighted by being compared to the database as a whole. Second, most users seek information about an unfamiliar subject, so they may not know the correct terminology. Few search engines implement system-based query weighting, but some do an implicit weighting by treating the first term(s) in a query as having higher significance. The engines use this information to provide a list of documents/pages to the user. After this final step, the expanded, weighted query is searched against the inverted file of documents.

3.Search and Matching Function How systems carry out their search and matching functions differs according to which theoretical model of information retrieval underlies the system's design philosophy. Searching the inverted file for documents meeting the query requirements, referred to simply as "matching," is typically a standard binary search, no matter whether the search ends after the first two, five, or all seven steps of query processing. While the computational processing required for simple, unweighted, nonBoolean query matching is far simpler than when the model is an NLP-based query within a weighted, Boolean model, it also follows that the simpler the document representation, the query representation, and the matching algorithm, the less relevant the results, except for very simple queries, such as one-word, non-ambiguous queries seeking the most generally known information. Having determined which subset of documents or pages matches the query requirements to some degree, a similarity score is computed between the query and each document/page based on the scoring algorithm used by the system. Scoring algorithms rankings are based on the presence/absence of query term(s), term frequency, tf/idf, Boolean logic fulfillment, or query term weights. Some search engines use scoring algorithms not based on document contents, but rather, on relations among documents or past retrieval history of documents/pages. After computing the similarity of each document in the subset of documents, the system presents an ordered list to the user. The sophistication of the ordering of the documents again depends on the model the system uses, as well as the richness of the document and query weighting mechanisms. For example, search engines that only require the presence of any alpha-numeric string from the query occurring anywhere, in any order, in a document would produce a very different ranking than one by a search engine that performed linguistically correct phrasing for both document and query representation and that utilized the proven tf/idf weighting scheme.

4.Ranking

However the search engine determines rank, the ranked results list goes to the user, who can then simply click and follow the system's internal pointers to the selected document/page. More sophisticated systems will go even further at this stage and allow the user to provide some relevance feedback or to modify their query based on the results they have seen. If either of these is available, the system will then adjust its query representation to reflect this value-added feedback and re-run the search with the improved query to produce either a new set of documents or a simple reranking of documents from the initial search. What Document Features Make a Good Match to a Query? • Term frequency: -How frequently a query term appears in a document is one of the most obvious ways of determining a document's relevance to a query. While most often true, several situations can undermine this premise. First, many words have multiple meanings — they are polysemous. Think of words like "pool" or "fire." Many of the non-relevant documents presented to users result from matching the right word, but with the wrong meaning. Also, in a collection of documents in a particular domain, such as education, common query terms such as "education" or "teaching" are so common and occur so frequently that an engine's ability to distinguish the relevant from the non-relevant in a collection declines sharply. Search engines that don't use a tf/idf weighting algorithm do not appropriately down-weight the overly frequent terms, nor are higher weights assigned to appropriate distinguishing (and less frequently-occurring) terms, e.g., "early-childhood." Location of terms: Many search engines give preference to words found in the title or lead paragraph or in the metadata of a document. Some studies show that the location — in which a term occurs in a document or on a page — indicates its significance to the document. Terms occurring in the title of a document or page that match a query term are therefore frequently weighted more heavily than terms occurring in the body of the document. Similarly, query terms occurring in section headings or the first paragraph of a document may be more likely to be relevant.















Link analysis: Web-based search engines have introduced one dramatically different feature for weighting and ranking pages. Link analysis is based on how well-connected each page is, as defined by Hubs and Authorities, where Hub documents link to large numbers of other pages (out-links), and Authority documents are those referred to by many other pages, or have a high number of "in-links". Popularity: -Google and several other search engines add popularity to link analysis to help determine the relevance or value of pages. Popularity utilizes data on the frequency with which a page is chosen by all users as a means of predicting relevance. While popularity is a good indicator at times, it assumes that the underlying information need remains the same. Date of Publication:-Some search engines assume that the more recent the information is, the more likely that it will be useful or relevant to the user. The engines therefore present results beginning with the most recent to the less current. Length : in a choice between two documents both containing the same query terms, the document that contains a proportionately higher occurrence of the term relative to the length of the document is assumed more likely to be relevant. Proximity of query terms: When the terms in a query occur near to each other within a document, it is more likely that the document is relevant to the query than if the terms occur at greater distance. Some search engines clearly rank documents in results higher if the query terms occur adjacent to one another or in closer proximity, as compared to documents in which the terms occur at a distance. Proper nouns sometimes have higher weights, since so many searches are performed on people, places, or things. While this may be useful, if the search engine assumes that you are searching for a name instead of the same word as a normal everyday term, then the search results may be peculiarly skewed. Imagine getting information on "Madonna," the rock star, when you were looking for pictures of madonnas for an art history class.

CHAPTER 3 SPIDERS

SPIDERS OR ROBOTS

A "software spider" is an unmanned program operated by a search engine that surfs the Web just like you would. As it visits each Web site, it records all the words on each site and notes each link to other sites. It then "clicks" on a link, and off it goes to read, index and store another Web site. The software spider often reads and then indexes the entire text of each Web site it visits into the main database of the search engine it is working for. Recently many engines such as AltaVista have begun indexing only up to a certain number of pages of a site, often about 400, and then stopping. Apparently, this is because the Web has become so large that it's unfeasible to index everything. How many pages the spider will index is not entirely predictable. Therefore, it's a good idea to specifically submit each important page in your site that you want to be indexed, such as those that contain important keywords. A software spider is like an electronic librarian who cuts out the table of contents of each book in every library in the world, sorts them into a gigantic master index, and then builds an electronic bibliography that stores information on which texts reference which other texts. Some software spiders can index over a million documents a day! Search engines determine a site's relevancy based on a complex scoring system that the search engines try to keep secret. This system adds or subtracts points based on things like how many times the keyword appeared on the page, where on the page it appeared, and how many total words were found. The pages that achieve the most points are returned at the top of the search results, the rest are buried at the bottom, never to be found. As a software spider visits your site, it notes any links on your page to other sites. In any search engine's vast database are recorded all the links between sites. The search engine knows which sites you linked to, and more importantly, which ones linked to you. Many engines will even use the number of links to your site as an indication of popularity, and will then boost your ranking based on this factor.

CHAPTER 4 SEARCH ENGINE OPTIMIZATION

SEARCH ENGINE OPTIMIZATION

Search engine optimization is the act of making a website come up higher in the search results of major search engines; i.e., search engine optimization (sometimes called SEO) is the process of increasing the positions of a web site within the search engines, using careful analysis and research techniques. It is the act of making ones website content more search engine friendly to make it rank higher. Each major search engine has a unique way of determining the importance of a given website. Some search engines focus on the content or verbiage. Some review Meta Tags to identify who and what a web site's business is. Most engines use a combination of Meta Tags, content, link popularity, click popularity and longevity to determine a sites ranking. Google bases much of its results on popularity .The procedure of website optimization ensures that a website has all of the necessary ranking criteria to appeal to the individual search engines needs.

Optimizing for the Correct Keywords To get listed correctly in the search engines each page of your site that you want listed needs to be optimized to the best of your ability. Since the keywords that you decide to target will be used throughout the optimization process choosing the right keywords is essential. If you choose the wrong keywords you will not be found in the search engines. If you are not found in the search engines how will anyone find your site? • Think "specific keyword phrases" not "keywords". - Due to the extreme amount of competition for general terms in the search engines, if your keyword phrases are too general it is very unlikely you will rank well in the search engines. You stand a far better chance to rank well for specific phrases where there is less competition. The resulting traffic, since it is more highly targeted, should also be much higher quality too!

Here's an example for a site selling shoes:

Much Too General
1. Shoes

Much Better! Imported Italian shoes

2. Men’s shoes 3. women'sshoes

Men’s leather penny loafers Women’s aerobic sneakers







Try to think like your target audience. - What would they search for when looking for the page you are optimizing? Others will not necessarily use the same keywords as you. You should try to come up with as many keyword phrases as target and click through you can think of that relate to the page you are optimizing. Check out your competition for ideas. - Do a search using keywords that you already know you want to on the top sites that come up. Once on the site view the source HTML code and view the keywords they have in their meta tags - this should give you many more ideas. Make sure to only use keywords that relate to your site or page. To view the HTML code simply click the 'View' at the top of your web browser then select 'Source', or 'Page Source'. You should develop a list of keyword phrases for each page that you optimize for the search engines.

Optimizing Your Title Tag The title tag of your page is the single most important factor to consider when optimizing your web page for the search engines. This is because most engines and directories place a high level of importance on keywords that are found in your title tag. The title tag is also what the search engines usually use for the title of your listing in the search results. Here's the title tag of this page: <TITLE>Your Title Tag </TITLE> The correct placement for the title tag is between the <HEAD> and </HEAD> tags within the HTML

Tag limits: - Your title tag must be between 50-80 characters long including spaces. The length that the different search engines accept varies, but as long as you keep within this limit you should be ok. Tag tips:



• • •

You should include 1-2 of your most important keyword phrases in the title tag, But be careful not to just list keywords. Your title tag should include your keyword phrases while remaining as close to a readable sentence as possible to avoid any problems. Make your title enticing. Don’t forget that even if you get that #1 listing in the search engines your listing still needs to say something that makes the surfer want to click through and visit your site. Since the length of your title tag could be a little long for some engines place the keywords at the beginning of the tag. Each page of your site should have it's own title tag with it's own keywords that related to the page that it appears on.

Optimizing Your Page Copy The copy on your page is also very important in order to achieve better search engine listings. Actually, it is very close to being as important as your title tag so make sure you keep reading. ‘Copy’ means the actual text that a visitor to your site would read. Page text tips: • For best results each page you submit has at least 200 words of copy on it. There is some cases where this much text can be difficult to put on a page, but the search engines really like it so you should do your best to increase the amount of copy where you can. This text should include your most important keyword phrases, but should remain logical and readable. Be sure to use those phrases that you have used in your other tags during the optimization process. Add additional copy filled pages to your site. These types of content pages not only help you in the search engines, but many other sites will link to them too.

• • •

Optimizing your page copy is one of the most important things you could possibly do to improve your listings in the search engines.

Optimizing Your Meta Tags META Tags are hidden text placed in the HEAD section of your HTML page that are used by most major search engines to index your site based on your keywords and descriptions. Meta tags were originally created to help search

engines find out important information about your page that they might have had difficulty determining otherwise. For example, related keywords or a description of the page itself. Many people incorrectly believe that good meta tags are all that is needed to achieve good listings in the search engines, which is entirely incorrect. While meta tags are usually always part of a well-optimized page they are not the be all and end all of optimizing your pages. The search engines now usually look at a combination of all the best search engine tips to determine your listings, not just your metas - some don't even look at them at all! What this means is that your page should have a combination of all the tips implemented on your page - not just meta tags. There are two meta tags that can help your search engine listings - meta keywords & meta description. Description Meta: <META NAME="description" content="This would be your description of what is on your page. Your most important keyword phrases should appear in this description."> Keywords Meta: <META NAME="keywords" content="keywords phrase 1, keyword phrase 2, keyword phrase 3, etc."> The correct placement for both meta tags is between the <HEAD> and </HEAD> tags within the HTML the makes up your page. Their order does not really matter, but most people usually place the description first then the keywords meta. Tag limits: Your Keywords Meta does not exceed 1024 characters including spaces and your description meta tag does not exceed 250 characters including spaces.

Meta description tips: • • Make sure that you accurately describe the content of your page while trying to entice visitors to click on your listing. Include 3-4 of your most important keyword phrases, especially those used in your title tag and page copy.



Try to have your most important keywords appear at the beginning of your description. This often brings better results, and will help avoid having any search engine cut off your keywords if they limit the length of your description.

Meta Keyword tips: • You should only use those keyword phrases that you also used in the copy of your page, title tag, meta description, and other tags. Any keywords phrases that you use that do not appear in your other tags or page copy are likely to not have enough prominence to help your listings for that phrase. Don't forget plurals. For example, a travel site might have both "Caribbean vacation" and "Caribbean vacations" in their keyword meta tag to make sure they show up in both searches. Watch out for repeats! You want to include your most important phrases, but when doing so it can be difficult not to repeat one word many times. For example, "caribbean vacation" and "caribbean vacations" are two different phrases, but the word "caribbean" appears twice. This is okay to do in order to make sure you get the phrases you need in there, but be careful not to repeat any one word excessively. There is no actual limit, but we recommend that no one word be repeated in the keyword meta more than 5 times. If your site has content of interest to a specific geographic location be sure to include the actual location in your keyword meta.

• •



How Long Does it Take to Get Listed? Here's the length of time it currently takes to get listed at each of the major search engines once you have submitted your web page.

MSN Up to 2 months Google Up to 4 weeks AltaVista Up to 1 week Fast Up to 2 weeks Excite Up to 6 weeks Northern Light Up to 4 weeks AOL Up to 2 months HotBot Up to 2 months iWon Up to 2 months

GOOGLE-A SEARCH ENGINE

CHAPTER 5 SPAMMING

What you should not do (spamming the search engines) There are several things, considered "spamming", that you can do to try to get your page listed higher on a search engine results page. An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Basically, you should never try to trick a search engine in any way, or your risk being blacklisted by them. Since the majority of your traffic will come from search engines the risk far outweighs the benefits in the long run. Below is a list of the more common things we recommend that you never do when trying to achieve better listings. • Listing keywords anywhere except in your keywords meta tag. By "list" we mean something like - keyword 1, keyword 2, keyword 3, keyword 4, etc. There are very few legitimate reasons that a list of keywords would actually appear on a web page or within the page's HTML code and the search engines know this. While you may have a legitimate reason for doing this we would recommend avoiding it so that you do not risk being penalized by the search engines. Using the same color text on your page as the page's background color. This has often been used to keyword stuff a web page. Search engines can detect this and view it as spam. Using multiple instances of the same tag. For example, using more than one title tag.

• •

• • •

Submitting identical pages. For example, do not duplicate a page of your site, give the copies different file names, and submit each one. Submitting the same page to any engine more than once within 24hrs. Using any keywords in your keywords meta tag that do not directly relate to the content of your page.

CONCLUSION

CONCLUSION

Search engine is the popular term for an information retrieval (IR) system. While researchers and developers take a broader view of IR systems, consumers think of them more in terms of what they want the systems to do — namely search the Web, or an intranet, or a database. Actually consumers would really prefer a finding engine, rather than a search engine. Optimization is the act of making ones website content more search engine friendly to make it rank higher.

REFERENCES

References
www.searchengines.com www.searchenginewatch.com www.submitexpress.com www.monash.com

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close