Forum Crawler Under Supervision (FoCUS), is a supervised web-scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites.
Comments
Content
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
WEB FORUMS CRAWLER FOR ANALYSIS OF USER
SENTIMENTS
Dr.D.Devakumari1, R.Komalavalli2
Assistant Professor, PG and Research Department of Computer Science,
Government Arts College(Autonomous), Coimbatore, Tamil Nadu, India.
2 Research Scholar ,Department of Computer Science , L.R.G Government Arts College For Women ,
Tirupur, Tamil Nadu, India.
1
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
reported but both show the inefficiency of generic
crawlers. More information about this testing can be
found in Section 5.2.1. Besides duplicate links and
uninformative pages, a long forum board or thread is
usually divided into multiple pages which are linked
by page-flipping links, for example,
see Figs. 2, 3b, and 3c. Generic crawlers process each
page individually and ignore the relationships
between such pages. These relationships should be
preserved while crawling to facilitate downstream
tasks such as page wrapping and content indexing
[27]. For example, multiple pages belonging to a
thread should be concatenated together in order to
extract all the posts in the thread as well as the replyrelationships between posts. In addition to the above
two challenges, there is also a problem of entry URL
discovery. The entry URL of a forum points to its
homepage, which is the lowest common ancestor
page of all its threads. Our experiment “Evaluation of
Starting from Non-Entry URLs” shows that a crawler
starting from an entry URL can achieve a much
higher performance than starting from nonentry
URLs. Previous works by Vidal et al. [25] and Cai et
al. [13] assumed that an entry URL is given.
different forum software packages used on the
Internet. Please refer to [2], [3], [5] for more
information about forum software packages. In
addition, many forums use their own customized
software. A recent and more comprehensive work on
forum crawling is iRobot by Cai et al. [13]. iRobot
aims to automatically learn a forum crawler with
minimum human intervention by sampling pages,
clustering them, selecting informative clusters via an
informativeness measure, and finding a traversal
path by a spanning tree algorithm. However, the
traversal path selection procedure requires human
inspection. Follow up work by Wang et al. [26]
proposed an algorithm to address the traversal path
selection problem. They introduced the concept of
skeleton link and page-flipping link. Skeleton links
are “the most important links supporting the
structure of a forum site.” Importance is determined
by informativeness and coverage metrics. Pageflipping links are determined using connectivity
metric. By identifying and only following skeleton
links and page-flipping links, they showed that
iRobot can achieve effectiveness and coverage.
According to our evaluation, its sampling strategy
and informativeness estimation is not robust and its
tree-like traversal path does not allow more than one
path from a starting page node to a same ending page
node. For example, there are six paths from entry to
threads. But iRobot would only take the first path
(entry ! board ! thread). iRobot learns URL location
information to discover new URLs in crawling, but a
URL location might become invalid when the page
structure changes. As opposed to iRobot, we
explicitly define entry-index-thread paths and
leverage page layouts to identify index pages and
thread pages. FoCUS also learns URL patterns instead
of URL locations to
discover new URLs. Thus, it does not need to classify
new pages in crawling and would not be affected by a
change in page structures. The respective results
from iRobot and FoCUS demonstrated that the EIT
paths and URL patterns are more robust than the
traversal path and URL location feature in iRobot.
Another related work is near-duplicate
detection. Forum crawling also needs to remove
duplicates. But contentbased duplicate detection
[18], [21] is not bandwidthefficient, because it can
only be carried out when pages have been
downloaded. URL-based duplicate detection [14],
[19] is not helpful. It tries to mine rules of different
URLs with similar text. However, such methods still
need to analyze logs from sites or results of a
ISO 9001:2008 Certified Journal
Page 1275
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
previous crawl. In forums, index URLs, thread URLs,
and page-flipping URLs have specific URL patterns.
Thus, in this paper, by learning patterns of index
URLs, thread URLs, and page-flipping URLs and
adopting a simple URL string de-duplication
technique (e.g., a string hashset), FoCUS can avoid
duplicates without duplicate detection. To alleviate
unnecessary crawling, industry standards such as
“nofollow” [6], Robots Exclusion Standard (robots.
txt) [10], and Sitemap Protocol [9], [22] have been
introduced. By specifying the “rel” attribute with the
“nofollow” value (i.e., “rel ¼ nofollow”), page authors
can inform a crawler that the destination content is
not endorsed. However, it is intended to reduce the
effectiveness of search engine spams, but not meant
for blocking access to pages. A proper way is
robots.txt [10]. It is designed to specify what pages a
crawler is allowed to visit or not. Sitemap [9] is an
XML file that lists URLs along with additional
metadata including update time, change frequency
etc. Generally speaking, the purpose of robots.txt and
Sitemap is to enable the site to be crawled
intelligently. So they may be useful to forum
crawling. However, it is difficult to maintain such
files for forums as their content continually changes.
In our experiment more than 47 percent of the pages
crawled by a generic crawler which can properly
understand
these
industry
standards
are
uninformative or duplicates.
3. METHODS
To learn ITF regexes, FoCUS adopts a twostep supervised training procedure. The first step is
training sets construction. The second step is regexes
learning.
3.1. Constructing URLTraining Sets
The goal of URL training sets construction is
to automatically create sets of highly precise index
URL, thread URL, and page-flipping URL strings for
ITF regexes learning. Its use a similar procedure to
construct index URL and thread URL training sets
since they have very similar properties except for the
types of their destination pages; to present this part
first. Page-flipping URLs have their own specific
properties that are different from index URLs and
thread URLs; we present this part later.
Recall that an index URL is a URL that is on an
entry or index page; its destination page is another
index page; its anchor text is the board title of its
destination page. A thread URL is a URL that is on an
index page; its destination page is a thread page; its
anchor text is the thread title of its destination page.
It also note that the only way to distinguish index
URLs from thread URLs is the type of their
destination pages. Therefore, we need a method to
decide the page type of a destination page.
The index pages and thread pages each have
their own typical layouts. Usually, an index page has
many narrow records, relatively long anchor text,
and short plain text; while a thread page has a few
large records (user posts). Each post has a very long
text block and relatively short anchor text.
An index page or a thread page always has a
timestamp field in each record, but the timestamp
order in the two types of pages are reversed: the
timestamps are typically in descending order in an
index page while they are in ascending order in a
thread page. In addition, each record in an index page
or a thread page usually has a link pointing to a user
profile page.
3.3. Page Flipping URL Training Set
Page-flipping URLs point to index pages or
thread pages but they are very different from index
URLs or thread URLs. The proposed “connectivity”
metric is used to distinguish page-flipping URLs from
other loop-back URLs. However, the metric only
works well on the “grouped” page-flipping URLs, i.e.,
more than one page-flipping URL in one page.
But in many forums, there is only one pageflipping URL in one page, which we called single
page-flipping URL. Such URLs cannot be detected
using the “connectivity” metric. To address this
shortcoming, we observed some special properties of
page flipping URLs and proposed an algorithm to
detect page flipping URLs based on these properties.
In particular, the grouped page-flipping URLs
have the following properties:
1. Their anchor text is either a sequence of
digits such as 1, 2, 3, or special text such as “last.”
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
2. They appear at the same location on the
DOM tree of their source page and the DOM trees of
their destination pages.
3. Their destination pages have similar layout
with their source pages. We use tree similarity to
determine whether the layouts of two pages are
similar or not. As to single page-flipping URLs, they
do not have the property 1, but they have another
special property.
4. The single page-flipping URLs appearing in
their source pages and their destination pages have
the same anchor text but different URL strings.
may notice that the k centroids change their location
step by step until no more changes are done. In other
words centroids do not move any more.
The k-means approach to clustering performs
an iterative alternating fitting process to form the
number of specified clusters. The k-means method
first selects a set of n points called cluster seeds as a
first guess of the means of the clusters. Each
observation is assigned to the nearest seed to form a
set of temporary clusters. The seeds are then
replaced by the cluster means, the points are
reassigned, and the process continues until no
further changes occur in the clusters.
The Algorithm is as follows
1. Place K points into the space represented
by the objects that are being clustered. These points
represent initial group centroids.
2. Assign each object to the group that has the
closest centroid.
3. When all objects have been assigned,
recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no
longer move. This produces a separation of the
objects into groups from which the metric to be
minimized can be calculated.
The K-Means Algorithm Process
1. The dataset is partitioned into K clusters
and the data points are randomly assigned to the
clusters resulting in clusters that have roughly the
same number of data points.
2. For each data point:
3. Calculate the distance from the data point
to each cluster.
4. If the data point is closest to its own
cluster, leave it where it is. If the data point is not
closest to its own cluster, move it into the closest
cluster.
5. Repeat the above step until a complete
pass through all the data points results in no data
point moving from one cluster to another. At this
point the clusters are stable and the clustering
process ends.
6. The choice of initial partition can greatly
affect the final clusters that result, in terms of intercluster and intracluster distances and cohesion.
ISO 9001:2008 Certified Journal
Page 1277
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
4. EXPERIMENTAL RESULTS
20
The following Table 5.1 describes
experimental result for proposed system for
downloading the positive command details. The table
contains forum id and corresponding average
number of positive details are shown.
Table 5.1 Positive Forum Command Analysis
(Count)
20
1904
The proposed methodology efficiently analyzes their
sentiments. An incomparable advantage of the
proposed model is that it easily scales to handle
networks with millions of posts. Since the proposed
model is sensitive to the number of social dimensions
as shown in the experiment, further research is
needed to determine a suitable dimensionality
automatically.
The
following
Table
5.2
describes
experimental result for proposed system for
downloading the negative command analysis details.
The table contains forum id and corresponding
average number of negative command details are
shown.
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
17
17
6
18
18
6
19
19
6
20
20
0
www.irjet.net
Negative Forum Command Analysis
The following Fig 5.1 describes experimental
result for proposed system for downloading the
positive command details. The figures contains
forum id and corresponding average number of
positive details are shown.
Negative Command [%]
35
30
25
20
NEGATIVE
PERCENT
15
FORUM ID
10
5
0
1 3 5 7 9 11 13 15 17 19
Positive Forum Command Analysis
Forum ID
Postive Command [%]
Fig 5.2 Negative Forum Command Analysis
(Count)
POSITIVE
PERCENT
FORUM ID
1 3 5 7 9 11 13 15 17 19
Table 5.3 Analyzing average post per forum
and average sentimental value
Forum
Id
Forum
Title
Post
Count
1
34
37
Google
Google+
Digital
Point Ads
Google
AdWords
Yahoo
Search
Marketing
Google
Azoogle
ClickBank
General
Business
Payment
Processing
Copywritin
g
Sites
Domains
eBooks
Content
Creation
Forum ID
38
Fig 5.1 Positive Forum Command Analysis(count)
39
The following Fig 5.2 describes experimental
result for proposed system for downloading the
negative command analysis details. The figures
contains forum id and corresponding average
number of negative command details are shown.
collected from forums.digitalpoint.com which
includes a range of 75 different topic forums.
Computation indicates that within the same time
window, forecasting achieves highly consistent
results with K-means clustering.
Also the forum topics are represented using
graphs. In this graph the is used to represent the
forum titles, thread count, post count, average post
per forum, average sentiment value per forum and
the similarity or relationship between the topics.
5. CONCLUSION
In this thesis, the algorithms are developed to
automatically analyze the emotional polarity of a
text, based on which a value for each piece of text is
obtained. The absolute value of the text represents
the influential power and the sign of the text denotes
its emotional polarity.
This K-means clustering is applied to develop
integrated approach for online sports forums cluster
analysis. Clustering algorithm is applied to group the
forums into various clusters, with the center of each
cluster representing a hotspot forum within the
current time span.
In addition to clustering the forums based on
data from the current time window, it is also
conducted forecast for the next time window.
Empirical studies present strong proof of the
existence of correlations between post text
sentiment and hotspot distribution. Education
Institutions, as information seekers can benefit from
the hotspot predicting approaches in several ways.
They should follow the same rules as the academic
objectives, and be measurable, quantifiable, and time
specific. However, in practice parents and students
behavior are always hard to be explored and
captured.
Using the hotspot predicting approaches can
help the education institutions understand what
their specific customers' timely concerns regarding
goods and services information. Results generated
from the approach can be also combined to
competitor analysis to yield comprehensive decision
support information.
ISO 9001:2008 Certified Journal
Page 1280
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
6. FUTURE ENHANCEMENT
The future, how to utilize the inferred
information and extend the framework for efficient
and effective network monitoring and application
design
The new system become useful if the below
enhancements are made in future.
The application can be web service oriented
so that it can be further developed in any
platform.
The application if developed as web site can
be used from anywhere.
At present, number of posts/forum, average
sentiment values/forums, positive % of
posts/forum and negative % of posts/forums
are taken as feature spaces for K-Means
clustering. In future, neutral replies, multiplelanguages based replies can also be taken as
dimensions for clustering purpose.
In addition, currently forums are taken for
hot spot detection. Live Text streams such as
chatting messages can be tracked and
classification can be adopted.
The new system is designed such that those
enhancements can be integrated with current
modules easily with less integration work. The new
system becomes useful if the above enhancements
are made in future. The new system is designed such
that those enhancements can be integrated with
current modules easily with less integration work.
[10]“TheWeb Robots Pages,”
http://www.robotstxt.org/, 2012.
[11] “WeblogMatrix,”
http://www.weblogmatrix.org/, 2012.
[12] S. Brin and L. Page, “The Anatomy of a LargeScale Hypertextual Web Search Engine.” Computer
Networks and ISDN Systems, vol. 30, nos. 1-7, pp.
107-117, 1998.
[13] R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang,
“iRobot: An Intelligent Crawler for Web Forums,”
Proc. 17th Int’l Conf. World Wide Web, pp. 447-456,
2008.
[14] A. Dasgupta, R. Kumar, and A. Sasturkar, “DeDuping URLs via Rewrite Rules,” Proc. 14th ACM
SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining, pp. 186-194, 2008.
[15] C. Gao, L. Wang, C.-Y. Lin, and Y.-I. Song, “Finding
Question- Answer Pairs from Online Forums,” Proc.
31st Ann. Int’l ACM SIGIR Conf. Research and
Development in Information Retrieval, pp. 467-474,
2008.
[16] N. Glance, M. Hurst, K. Nigam, M. Siegler, R.
Stockton, and T. Tomokiyo, “Deriving Marketing
Intelligence from Online Discussion,” Proc. 11th ACM
SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining, pp. 419-428, 2005.
[17] Y. Guo, K. Li, K. Zhang, and G. Zhang, “Board
Forum Crawling: A Web Crawling Method for Web
Forum,” Proc. IEEE/WIC/ACM Int’l Conf. Web
Intelligence, pp. 475-478, 2006.
[18] M. Henzinger, “Finding Near-Duplicate Web
Pages: A Large- Scale Evaluation of Algorithms,” Proc.
29th Ann. Int’l ACM SIGIR Conf. Research and
Development in Information Retrieval, pp. 284-291,
2006.
[19] H.S. Koppula, K.P. Leela, A. Agarwal, K.P.
Chitrapura, S. Garg, and A. Sasturkar, “Learning URL
Patterns for Webpage De- Duplication,” Proc. Third
ACM Conf. Web Search and Data Mining, pp. 381-390,
2010.
[20] K. Li, X.Q. Cheng, Y. Guo, and K. hang,“Crawling
Dynamic Web Pages in WWW Forums,” Computer
Eng., vol. 33, no. 6, pp. 80-82, 2007.
[21] G.S. Manku, A. Jain, and A.D. Sarma, “Detecting
Near-Duplicates for Web Crawling,” Proc. 16th Int’l
Conf. World Wide Web, pp. 141- 150, 2007.
[22] U. Schonfeld and N. Shivakumar, “Sitemaps:
Above and Beyond the Crawl of Duty,” Proc. 18th Int’l
Conf. World Wide Web, pp. 991- 1000, 2009.
[23] X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin,
“Automatic Extraction of Web Data Records
Containing User-Generated Content,” Proc. 19th Int’l
ISO 9001:2008 Certified Journal
Page 1281
International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 02 Issue: 05 | Aug-2015
p-ISSN: 2395-0072
www.irjet.net
Conf. Information and Knowledge Management, pp.
39-48,2010.
[24] V.N. Vapnik, The Nature of Statistical Learning
Theory. Springer, 1995.
[25] M.L.A. Vidal, A.S. Silva, E.S. Moura, and J.M.B.
Cavalcanti, “Structure-Driven Crawler Generation by
Example,” Proc. 29th Ann. Int’l ACM SIGIR Conf.
Research and Development in Information
Retrieval, pp. 292-299, 2006.
[26] Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and
W.-Y. Ma,“Exploring Traversal Strategy for Web
Forum Crawling,” Proc.31st Ann. Int’l ACM SIGIR
Conf. Research and Development in Information
Retrieval, pp. 459-466, 2008.
BIOGRAPHIES
Dr. D. Devakumari has
received M. Phil degree from
Manonmaniam
Sundaranar
University in 2003 and Ph.D
from Mother Teresa Womens’
University in 2013. Currently
she is working as Assistant
Professor in the PG and Research Department of
Computer Science, Government Arts
College
(Autonomous), Coimbatore,
India. Her research
papers have been published in International journals
including Inderscience, Springer etc. She has
presented papers in National and International
Conferences. Her research interests include Data
Pre-processing and Pattern Recognition.
Ms. R.Komalavalli has received
B.SC(CS) degree from Maharaja
Arts and Science College
and
M.SC(IT) from Maharaja Arts and
Science College. Pursuing her
M.Phil
degree from L.R.G
Government Arts College for
Women. Currently she is working as Assistant
Professor in Department of Computer Science, L.R.G
Government Arts College for Women, Tirupur, India.