We present a new method to identify navigation-related Web usability problems based on comparing actual and anticipated usage patterns. The actual usage patterns can be extracted from Web server logs ...routinely recorded for operational websites by first processing the log data to identify users, user sessions, and user task-oriented transactions, and then applying an usage mining algorithm to discover patterns among actual usage paths. The anticipated usage, including information about both the path and time required for user-oriented tasks, is captured by our ideal user interactive path models constructed by cognitive experts based on their cognition of user behavior. The comparison is performed via the mechanism of test oracle for checking results and identifying user navigation difficulties. The deviation data produced from this comparison can help us discover usability issues and suggest corrective actions to improve usability. A software tool was developed to automate a significant part of the activities involved. With an experiment on a small service-oriented website, we identified usability problems, which were cross-validated by domain experts, and quantified usability improvement by the higher task success rate and lower time and effort for given tasks after suggested corrections were implemented. This case study provides an initial validation of the applicability and effectiveness of our method.
The WWW has a big number of pages and URLs that supply the user with a great amount of content. In an intensifying epoch of information, analysing users browsing behaviour is a significant affair. ...Web usage mining techniques are applied to the web server log to analyse the user behaviour. Identification of user sessions is one of the key and demanding tasks in the pre-processing stage of web usage mining. This paper emphasizes on two important fallouts with the approaches used in the existing session identification methods such as Time based and Referrer based sessionization. The first is dealing with comparing of current request’s referrer field with the URL of previous request. The second is dealing with session creation, new sessions are created or comes in to one session due to threshold value of page stay time and session time. So, authors developed enhanced semantic distance based session identification algorithm that tackles above mentioned issues of traditional session identification methods. The enhanced semantic based method has an accuracy of 84 percent, which is higher than the Time based and Time-Referrer based session identification approaches. The authors also used adapted K-Means and Hierarchical Agglomerative clustering algorithms to improve the prediction of user browsing patterns. Clusters were found using a weighted dissimilarity matrix, which is calculated using two key parameters: page weight and session weight. The Dunn Index and Davies-Bouldin Index are then used to evaluate the clusters. Experimental results shows that more pure and accurate session clusters are formed when adapted clustering algorithms are applied on the weighted sessions rather than the session obtained from traditional sessionization algorithms. Accuracy of the semantic session cluster is higher compared with the cluster of sessions obtained using traditional sessionization.
Purpose
The purpose of this paper is to critically analyze the state-of-the-art session identification techniques used in web usage mining (WUM) process in terms of their limitations, features, and ...methodologies.
Design/methodology/approach
In this research, systematic literature review has been conducted using review protocol approach. The methodology consisted of a comprehensive search for relevant literature over the period of 2005-2015, using four online database repositories (i.e. IEEE, Springer, ACM Digital Library, and ScienceDirect).
Findings
The findings revealed that this research area is still immature and existing literature lacks the critical review of recent session identification techniques used in WUM process.
Originality/value
The contribution of this study is to provide a structured overview of the research developments, to critically review the existing session identification techniques, highlight their limitations and associated challenges and identify areas where further improvements are required so as to complement the performance of existing techniques.
Issue Title: Special Issue: Data Mining Lessons Learned The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its ...inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g., sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. The paper briefly reviews the architecture and discusses many lessons learned over the last four years and the challenges that still need to be addressed. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are also widely applicable to data mining domains outside retail e-commerce.PUBLICATION ABSTRACT
Web usage mining has proven to be an important advance for e-business systems, both by finding web user buying patterns and suggesting ways to improve web user navigation. A primary input for web ...usage mining is web user sessions that must be constructed from web server logs (called sessionization) when such sessions are not otherwise identified. We use bipartite cardinality matching and a more general integer program to construct sessions. We also propose several variations of our integer program to provide additional insights into session characteristics. For testing, we retrieve 15 months of web server logs and corresponding real sessions from an academic web site. We compare real sessions, results obtained by our optimization models, and results from a commonly-used timeout heuristic. We find our optimization models dominate the timeout heuristic using several comparison measures. Solution time for a typical month is seven hours for our integer program, 30 minutes for our bipartite cardinality matching, and about 1 minute for the heuristic. Although solution time is significantly greater for the integer program, its variations contribute additional analysis of web user behavior.
The World Wide Web is a popular “tool” for companies. It can be used as a method of communication between companies and their customers; it also allows organizations to setup virtual storefronts that ...can be accessed by customers from all over the world. The ability to understand customers’ behavior is extremely important as companies strive to increase the usability and profitability of their web service. The concept of a session is a popular unit of measurement used to analyze recorded information. However, this concept is currently rather abstract and lacks definition. How we measure a session is a fundamental question for web services utilizing this concept. Currently, this question has no real answer. This paper presents a session timeout threshold model based on empirical observations as an initial answer to this question. The model seeks to provide accurate session data with respect to individual web services.
Metrics derived from user visits or sessions provide a means of evaluating Websites and an important insight into online information seeking behaviour, the most important of them being the duration ...of sessions and the number of pages viewed in a session, a possible busyness indicator. However, the identification of session (termed often ‘sessionization’) is fraught with difficulty in that there is no way of determining from a transactional log file that a user has ended their session. No one logs out. Instead a session delimiter has to be applied and this is typically done on the basis of a standard period of inactivity. To date researchers have discussed the issue of a time out delimiter in terms of a single value and if a page view time exceeds the cut-off value the session is deemed to have ended. This approach assumes that page view time is a single distribution and that the cut-off value is one point on that distribution. The authors however argue that page time distribution is composed of a number of quite separate view time distributions because of the marked differences in view times between pages (abstract, contents page, full text). This implies that a number of timeout delimiters should be applied. Employing data from a study of the OhioLINK digital journal library, the authors demonstrate how the setting of a time out delimiter impacts on the estimate of page view time and the number of estimated session. Furthermore, they also show how a number of timeout delimiters might apply and they argue that this gives a better and more robust estimate of the number of sessions, session time and page view time compared to an application of a single timeout delimiter.
This Event log file is the most common data-sets exploited by many companies for customer behavior analysis. Oftentimes these records are unordered, and need to be grouped by certain key for ...effective analysis. One such example is to group similar user with different session ID to facilitate further analysis. This kind of analysis is known as User Sessionization. In this paper, we propose a distributed framework in combination of Hadoop and MapReduce to analyze event log file and sessionize user based on IP-address and timestamp. The evaluation results show that as the number of nodes increases the execution time decreases and performance increases.
Near Real-Time Tracking at Scale Vasthimal, Deepak Kumar; Kumar, Sudeep; Somani, Mahesh
2017 IEEE 7th International Symposium on Cloud and Service Computing (SC2),
2017-Nov.
Conference Proceeding
Clickstream data analysis involves collecting, analyzing and aggregating data for business analytics. Key business indicators such as user experience, product checkout flows, failed customer ...interactions are computed based on this data. A/B testing 18 or any data experimentation use clickstream data stream to compute business lifts or capture user feedback to new changes on the site. Handling such data at scale is extremely challenging, especially to design a system ensuring little to no data loss, bot filtering, event ordering, aggregation and sessionization of user visit. The entire operation must be near real-time so that computations performed can be fed back into services which can help in targeted personalization and better user experience. Sessions capture group of user interactions within stipulated time frame. Business metrics often computed on these user sessions. User sessions are therefore critical for business analytics as they represent true user behavior. We describe the process of creating a highly available data pipeline and computational model for user sessions at scale.