Part 1 – Survey
The term Web Data Mining is a procedure used to creep through different web resources to gather required information, which empowers an individual or an organization to advance business, comprehension showcasing elements, new advancements skimming on the Internet, and so forth. There is a developing pattern among organizations, associations and people alike to accumulate information through web data mining to use that information to their greatest advantage.
Because of the heterogeneity and absence of structure of Web data, mechanized revelation of focused or unforeseen learning/information is a testing undertaking. It calls for novel techniques that draw from an extensive variety of fields crossing data mining, machine learning, common dialect handling, insights, databases, and information recovery. In the previous couple of years, there was a quick extension of exercises in the Web mining field, which comprises of Web use mining, Web structure mining, and Web content mining. Web utilization mining alludes to the revelation of client access designs from Web use logs. Web structure mining tries to find valuable information from the structure of hyperlinks. Web content mining plans to concentrate/mine valuable information or learning from Web page contents. For this exceptional issue, we concentrate on Web content mining (DIMITRIOS PIERRAKOS, GEORGIOS PALIOURAS, 2003).
Data Mining is done through different sorts of data mining software. These can be basic data mining software or very particular for point by point and broad errands that will be filtering through more information to choose better bits of information. For instance, if an organization is searching for information on specialists including their messages, fax, phone, area, and so forth. This information can be mined through one of these data mining software programs. This information accumulation through data mining has permitted organizations to make thousands and a large number of dollars in incomes by having the capacity to better utilize the web to pick up business knowledge that assists organizations with settling on key business decisions.
The Web mining can be classified into the following 3 Categories,
1. Web usage mining,
- Web content mining
- Web structure mining.
Web Usage Mining
With the proceeded with development and expansion of e-trade, Web administrations, and Web-based information systems, the volumes of clickstream, exchange data, and client profile data gathered by Web-based associations in their day by day operations has come to galactic extents. Breaking down such data can help these associations focus the life-time estimation of customers, configuration cross-advertising systems crosswise over items and administrations, assess the adequacy of limited time crusades, streamline the usefulness of Web-based applications, give more customized content to guests, and locate the best intelligent structure for their Web space. This sort of investigation includes the programmed revelation of significant examples and connections from a huge accumulation of basically semi-organized data, regularly put away in Web and applications server access logs, and additionally in related operational data sources.
A vital undertaking in any data mining application is the making of a suitable target data set to which data mining and factual calculations can be connected. This is especially critical in Web use mining because of the qualities of clickstream data and its relationship to other related data gathered from various sources and over different channels. The data readiness procedure is regularly the most tedious and computationally serious stride in the Web use mining procedure, and frequently requires the utilization of exceptional calculations and heuristics not normally utilized in different spaces. This procedure is basic to the fruitful extraction of valuable examples from the data. The procedure may include pre-preparing the first data, coordinating data from various sources, and changing the incorporated data into a structure suitable for info into particular data mining operations. On the whole, we allude to this procedure as data arrangement.
Web utilization mining basically has numerous favorable circumstances which makes this innovation appealing to organizations including the administration offices. This innovation has empowered e-business to do customized advertising, which in the long run results in higher exchange volumes. Government offices are utilizing this innovation to arrange dangers and battle against terrorism. The anticipating ability of identifying so as to mine applications can advantage society criminal exercises. The organizations can build up better client relationship by giving them precisely what they require. Organizations can comprehend the client’s needs better and they can respond to client needs speedier. The organizations can discover, draw in and hold clients; they can save money on creation costs by using the procured knowledge of client necessities. They can expand productivity by target estimating in light of the profiles made. They can even locate the client who may default to a contender the organization will attempt to hold the client by giving limited time offers to the particular client, consequently decreasing the danger of losing a client or clients (Integrating business analytics into strategicplanning for better performance, 2011).
Web use mining independent from anyone else does not make issues, but rather this innovation when utilized on data of individual nature may bring about concerns. The most condemned moral issue including web use mining is the intrusion of Privacy is viewed as lost when information concerning an individual is acquired, utilized, or scattered, particularly if this happens without their insight or assent. The acquired data will be dissected, and grouped to shape profiles; the data will be made unknown before bunching with the goal that there are no individual profiles. In this manner these applications de-individualize the clients by passing judgment on them by their mouse clicks. De-individualization, can be characterized as a propensity of judging and treating individuals on the premise of gathering qualities rather than all alone individual attributes and merits. Another imperative concern is that the organizations gathering the data for a particular reason may utilize the data for an entirely unexpected reason, and this basically damages the client’s advantage.
The developing pattern of offering individual data as an item urges website proprietors to exchange individual data acquired from their webpage. This pattern has expanded the measure of data being caught and exchanged expanding the likeliness of one’s security being attacked. The organizations which purchase the data are obliged make it mysterious and these organizations are considered creators of any particular arrival of mining examples. They are legitimately in charge of the discharge’s contents; any errors in the discharge will bring about genuine claims, however there is no law keeping them from exchanging the data (Lin, Jimmy, and Miles Efron, 2013).
Some mining calculations may utilize questionable qualities like sex, race, religion, or sexual introduction to sort people. These practices may be against the counter separation enactment. The applications make it difficult to distinguish the utilization of such disputable qualities, and there is no solid standard against the use of such calculations with such properties. This procedure could bring about refusal of administration or a benefit to an individual taking into account his race, religion or sexual introduction, at this time this circumstance can be stayed away from by the high moral gauges kept up by the data mining organization. The gathered data is being made unknown so that, the acquired data and the got designs can’t be followed back to a person. It may look as though this represents no danger to one’s protection, however extra information can be induced by the application by joining two separate deceitful data from the client.
Web substance mining is the procedure of extricating information from the substance, commonly the content of reports or their portrayals. Web structure mining is the procedure of surmising information from the World Wide Web association and connections in the middle of references and referents in the web. It examines the site’s hyperlink and archive structure. At last, web use mining, otherwise called web log mining, dissects client conduct on the web webpage by extricating intriguing examples from the web server logs.
It is a system for comprehension client conduct as it identifies with the utilization of websites. The consequences of web mining can be utilized to give measurements on the adequacy of an organization’s web website or the accomplishment of a specific battle.
As of late, software sellers and scientists have been concentrating on utilizing the removed examples from web emulating to anticipate the following client solicitation amid an online session with a web website, particularly e-business. Such systems are called recommender systems and are helpful apparatuses to anticipate client demands (Ah-Hwee Tan, 2003).
The method of consecutive example mining endeavors to discover between session examples such that the vicinity of an arrangement of things is trailed by another thing in a period requested arrangement of sessions or scenes. By utilizing this methodology, Web advertisers can anticipate future visit designs which will be useful in putting commercials went for certain client bunches. Different sorts of worldly examination that can be performed on successive examples incorporate pattern investigation, change point discovery, or closeness examination. In the setting of Web use data, successive example mining can be utilized to catch continuous navigational ways among client trails.
Successive examples (SPs) in Web use data catch the Web page trails that are regularly gone to by clients, in the request that they were gone by. In the connection of Web use data, CSPs can be utilized to catch continuous navigational ways among client trails. Interestingly, things showing up in SPs, while protecting the fundamental requesting, need not be neighboring, and consequently they speak to more broad navigational examples inside of the site. The perspective of Web exchanges as groupings of site visits takes into account various valuable and all around considered models to be utilized as a part of finding or breaking down client route designs. One such approach is to show the navigational exercises in the Web webpage as a Markov model: every online visit( (or a classification) can be spoke to as a state and the move likelihood between two states can speak to the probability that a client will explore from one state to the next. This representation takes into account the calculation of various valuable client or site measurements. For instance, one may process the likelihood that a client will make a buy, given that she has performed an inquiry in an online list. Markov models have been proposed as the fundamental displaying hardware for connection forecast and also for Web prefetching to minimize system latencies. The objective of such methodologies is to anticipate the following client activity in view of a client’s past surfing conduct. They have additionally been utilized to find high likelihood client navigational trails in a Web webpage. More modern measurable learning procedures, for example, blends of Markov models, have additionally been utilized to bunch navigational groupings and perform exploratory examination of clients’ navigational conduct in a site (Bing Liu, Kevin Chen-Chuan Chang, 2006).
This section identifies and discusses some of the opportunities the business managers dealing with the aforementioned mining technologies face for gaining competitive advantages for their businesses.
Data mining is essentially utilized for upper hands by organizations with an in number customer core interest. The center of data mining applications amongst the business pioneers has been consistently developing from client examination to relationship investigation.
Progressively focused business and buyer commercial centers make it basic for organizations to draw in clients, as well as to hold them particularly that little rate of exceedingly productive clients. Maintenance techniques for esteemed clients by and large concentrate on monetary and/or administration level motivators to advance dedication. Since just few organizations can appreciate the economies of scale (or speculation capital) to maintain aggressive separation on cost alone, numerous businesses try to amplify client esteem by building faithfulness through brand and administration separation. This methodology put a premium on the nature of each client contact as every connection serves to either assemble mark or pulverize it.
These business pioneers use progressed investigative with data mining to streamline their client connections. Illustrations include: enhancing the adequacy of advertising effort and drawing in new clients, amplifying the estimation of offers to existing clients (cross-offering and up-offering), minimizing client (MICHAEL S. LEW, NICU SEBE, CHABANE DJERABA, RAMESHJAIN, 2006).
The substance of personalization is the flexibility of information systems to the needs of their clients. This issue is turning out to be progressively critical on the Web, as non-master clients are over whelmed by the amount of information accessible on the web, while business Websites endeavor to increase the value of their administrations so as to make faithful associations with their guests clients. By evaluating Web personalization through the crystal of personalization arrangements received by Websites and actualizing an assortment of capacities. In this setting, the territory of Web use mining is a significant wellspring of thoughts and techniques for the usage of personalization usefulness.
Early work in Web use mining did not consider widely its utilization for personalization. Its essential center was on the disclosure of choice bolster learning, communicated regarding expressive data models to be assessed and abused by human specialists. All that is required for the utilization of Web use mining to Web personalization is a movement of center from the customary, choice bolster information disclosure, i.e., the static displaying of use data, to the revelation of operational learning for personalization, i.e., the dynamic demonstrating of clients. This sort of learning can be specifically conveyed back to the clients keeping in mind the end goal to enhance their involvement in the site, without the intercession of any human master. Along these lines, it is presently widely perceived that utilization mining is a significant wellspring of thoughts and answers for Web personalization. Taking into account this perspective of the Web use mining procedure, study of late work for examination in Web personalization. Beginning with an examination of the Web personalization idea and its connection to Web utilization mining, the accentuation thusly, is on the approach embraced in Web use mining, the different arrangements that have been exhibited in the writing and the route in which these routines can be connected to Web personalization systems. In perusing the study, it ought to be remembered that Web utilization mining is not a full grown hunt region. Thus, the overview addresses additionally numerous open issues, both at a specialized and at a methodological level (DIMITRIOS PIERRAKOS, GEORGIOS PALIOURAS, 2003).
The principle issue with content-based is the trouble of investigating the content of Web pages and touching base at semantic likenesses. Regardless of the fact that one overlooks sight and sound content, characteristic dialect itself is a rich and unstructured wellspring of data. In spite of the noteworthy procedure accomplished in the exploration handle that arrangement with the examination of printed data, we are still a long way from getting a machine to comprehend normal dialect the way people do. Content based sifting receives an assortment of factual systems for the extraction of helpful information from literary data. In any case, the examination’s issue of Web content still remains and turns out to be considerably more basic when there is constrained literary content. By diminishing the accentuation on Web content, communitarian sifting addresses this imperative issue. Besides, community sifting strategies encourage the misuse of utilization examples that are not kept to strict semantic limits (Bamshad Mobasher and Olfa Nasraoui, 2006).
Part 2 – Approach towards the Solutions
New network technologies have opened the way for new distributed database architectures and protocols no longer built around a bottleneck in network communications. Current systems are optimized to minimize network communication due to historical bandwidth limitations as well as network communications being the major bottleneck of any distributed system in general.
In this section, we seek to explore the effect of new network technologies on distributed database systems and how distributed databases may be changed with this bottleneck no longer an issue. From the survey completed on this problem it is clear that multiple challenges and solutions exist.
The Challenges in addressing the problem of Web Data Mining are as follows:
- How to fully Utilize Network Resources
- How to work better with new limits? (I.e. propagation delay)
- How to maintain distributed architecture?
- How to enhance performance by keep maintaining ACID properties?
To reduce the impact of propagation delay
To keep the design changes simple
To eliminate obsolete components of the System
The solution to the overall problem can be broken down to addressing each component individually. The overall solution in terms of the database itself is heavily based on the concepts of database replication & migration. In summary, we now have the ability to quickly send large amounts of data reliably. A DDBMS should take advantage of this and send as much data as needed without being concerned with reduction or sending fragments.
Ah-Hwee Tan. (2003). Text Mining:The state of the art and the challenges. Singapore.
Bamshad Mobasher and Olfa Nasraoui. (2006). Web usage mining. Web data mining: Exploring hyperlinks, contents and usage data, p. 12.
Bing Liu, Kevin Chen-Chuan Chang. (2006). Editorial: Special Issue on Web Content Mining. SIGKDD Explorations., 6(1), pp. 1-4.
DIMITRIOS PIERRAKOS, GEORGIOS PALIOURAS. (2003). Web Usage Mining as a Tool for Personalization : A Survey. User Modeling and User-Adapted Interaction, 13, 311-372.
Enrique Flores, Alberto Barr´on-Cedeno, Paolo Rosso,and Lidia Moreno. (2011). Towards the Detection of Cross-Language Source Code Reuse. Natural Language Processing and Information Systems (pp. 250-253). Berlin Heidelberg: Springer.
Eric Brewer, M. D. (2006). The Challenges of Technology Research for Developing Regions. IEEE Pervasive Computing.Volume 5, Number 2, pp. 15-23.
Integrating business analytics into strategicplanning for better performance. (2011). Journal of Business Strategy, 6, pp. 30-39.
Joe F. Hair. (2007). Knowledge creation in marketing: the role of predictive analytics. European BusinessReview, 19(4), pp. 303-315.
Lin, Jimmy, and Miles Efron. (2013). Overview of the TREC-2013 microblog track. TREC vol. 2013. TREC. Retrieved from http://trec.nist.gov/pubs/trec22/papers/MB.OVERVIEW.pdf
Marten Schläfke, Riccardo Silvi, Klaus Möller. (2012). A framework for business analytics in performancemanagement. International Journal of Productivity and Performance Management, 62(1), pp. 110-122.
MICHAEL S. LEW, NICU SEBE, CHABANE DJERABA, RAMESHJAIN. (2006). Content-Based Multimedia Information Retrieval: State of the Art and Challenges. ACM Transactions on Multimedia Computing, Communications and Applications, 2(1), pp. 1-19.
- Cooley, B. Mobasher, and J. Srivastava . (1997). Web Mining: Information and Pattern Discovery on the World Wide Web . University of Minnesota, Minneapolis: IEEE .
Ranjit Bose. (2009). Advanced analytics: opportunities and challenges. Industrial Management & DataSystems, 109(2), pp. 155 – 172.