Application of hybrid of Fuzzy Set, Trust and Genetic Algorithm in query log mining for effective Information Retrieval

The precision of Information Retrieval (IR) System is low due to imprecise user queries as well as because of information overload on web. The Fuzzy set infers the user’s information need from vague and imprecise queries and web recommender systems are used to overcome information overload problem. The performance of recommender system is still low due to data sparsity. The concept of trust is used to deal with data sparseness problem and improves the performance of recommender system. Genetic Algorithm(GA) have been applied in domain of information retrieval for optimization of web search results and effective web search. In this research hybrid of Fuzzy set, GA and Trust has been used together in query log mining for personalized web search based on using fuzzy queries for recommendation of optimal ranked set of trusted documents. Thus the use of hybrid of Fuzzy set, trust and GA together infer the user’s information need from vague and imprecise user’s queries and optimize the web page ranking of trusted web pages for effective personalized web search. The experimental results were analyzed statistically as well as compared with GA IR, and Fuzzy Trust based IR. Hence based on comparative analysis of results, thus hybrid of Fuzzy Set, Trust and GA shows the improvement in average precision of search results and confirms the effective personalization of web search.


Introduction
Web Information is voluminous and accessed using search engines. The search engines retrieve huge set of web documents with very few are relevant to user. The user search queries are vague and imprecise due to which user' information need is difficult to infer and therefore information overload problem arises. Fuzzy set has been used to deal with vague and imprecise queries for inferring the user's information need. Recommender systems are used to overcome information overload problem but the performance of recommender system is low due to data sparseness. Trust based recommender system overcomes the data sparseness and generate the reliable recommendation. Optimization technique like Genetic Algorithm has been used to increase the effectiveness of recommender system. [1][2][3] In this research an approach is proposed using hybrid of Fuzzy set, Trust and Genetic Algorithm together in query log mining to generate the recommendation of relevant documents according to user's information need. An algorithm is developed for web page recommender system which uses the Fuzzy user query to represent the user's information need and cluster is selected using fuzzy query for the recommendation of optimal ranking of trusted documents for effective Information Retrieval. An algorithm uses two phases for its complete execution: Phase I (offline) and Phase II (online). In phase I, web query sessions are preprocessed and keyword vectors are generated for clustering. The W1(termdocument matrix) is formed using tf.idf content of the clicked documents in query sessions. Fuzzy thesaurus FR is built from W1 and W1 T in order to create term-term correlation matrix. The genetic algorithm is executed on clicked URLs selected based on trust threshold value in a given cluster for rank optimization of trusted web pages (Clicked URLs). In phase II, the user input query represented by Fuzzy Set A1 is expanded with related terms based on Fuzzy thesaurus FR. The cluster is selected using fuzzy input query after expansion for the recommendation of optimal ranking of trusted documents. The user response to recommended documents is captured and user profile keyword vector is generated for expansion with related words based on Fuzzy thesaurus. The profile keyword vector after expansion selects the cluster for the recommendation of optimal ranked documents. Thus the expansion of user profile terms and recommendations of optimal ranking of trusted documents is used for effective information retrieval. Experiment was performed on the data set of clustered web user query sessions for web information retrieval and the precision of search results was improved with the use of Fuzzy, Trust and GA IR together in comparison to using only Fuzzy Trust based IR [29] and GA based IR [21].

Related Work
In [4,5] modeling of user preferences had been described with fuzzy profiles. In [6] Fuzzy logics were used to summarize text for extracting the most relevant sentences. In [7] fuzzy queries on the web were processed using both document index and perception index. In [8] Fuzzy logic based personalization of newsletters was introduced. In [9] fuzzy set model was used to define fuzzy queries. In [10,11] fuzzy relationship between query terms and documents was introduced. In [12] fuzzy IR system uses fuzzy logic to retrieve documents similar to the query document. The system was tested on Arabic documents. In [13] fuzzy logic based ranking model was introduced. In [14] trust metric was introduced in collaborative filtering and hence increased its effectiveness. In recommender systems the use of trust outperforms traditional collaborative filtering [15][16][17][18]. In [19] genetic algorithm was used to solve flow shop scheduling problem. In [20] recommender system was proposed based on Genetic algorithm and k-NN algorithm. In [21] recommender system based on collaborative filtering was improved using genetic algorithm. In [22] Genetic algorithm was used to generate the optimal combination of terms for effective information retrieval. In [23] wireless sensor network energy issues and secure routing were solved using trust and genetic algorithm. In [24] hybrid of genetic algorithms and trust had been used in wireless sensor network. In [25] Genetic Algorithm Inspired Load Balancing Protocol was proposed for Congestion Control in Wireless Sensor Networks using Trust Based Routing Framework. In [26] Friend Recommender System for WBSN was proposed. Learning was done using Genetic algorithm and trust propagation was used for solving data sparseness. In [27] novel trust model was proposed which combines both peer profiling and anomaly detection. Genetic algorithm was used to detect the anomalous behaviour. In [28] trust oriented genetic algorithm (TOGA) was proposed for finding a near optimal service composition plan with QoS constraints. In [29] a method was proposed using both Fuzzy set and trust for web information retrieval. It is found that hybrid of Genetic algorithm and Trust had been applied in various domains and results shown are promising. Thus in this research, the hybrid of Fuzzy IR, GA and trust is used for developing an algorithm for web search personalization. The proposed method is novel since there has been no work done combining the advantage of FuzzyIR, Trust and GA together for effective web information retrieval.

Background Information Scent
The Inferring User Need by Information Scent (IUNIS) algorithm infers the user's information need based on user's traversal path of web pages. Information Scent isc id of the clicked web page CP id in session i is given below in (1)(2) where n1 is the number of distinct clicked page in session i and MQ is the total number of query sessions [30][31][32][33].
PF. IPF(CP id ): PF is the normalized frequency f CP id of clicked web page CP id and IPF is the ratio of total number of query sessions MQ in the whole data set to the number of query sessions m CP d that contain the given page CP d . Time(CP id ): Time spent on a clicked web page (CP id )to the total time of session i [34][35][36][37][38][39][40][41][42].
The QV i session keyword vector is modelling the information need of session as given below in (3). The keyword vectors are clustered using k-means. The clusters quality are accessed using score function. ( 1 , 2 ) in (4) expresses the degree to which the meaning of x1 is synonyms with meaning of x2 [43][44][45]. (3)

Fuzzy Set Theory in Information Retrieval (IR) System
Fuzzy information retrieval methods based on fuzzy set handle the vague and imprecise uncertain information and infer the best results to a user query. For implementing the Fuzzy Information Retrieval based on Fuzzy [ ] (4) expresses the degree to which the meaning of x 1 is synonyms with meaning of x 2 . The relationship FR uses synonymous relationship to retrieve documents relevant to user query. The augmented query represented by Fuzzy set B1 on T1 is generated using Fuzzy input query A1 and Fuzzy thesaurus FR as given in eq (5,6). That is, where o is the max-min composition. [29,46] 1 )) ,

Trust
Trust concept has been used widely by research communities in various domains like online recommender system. A trust is modeled on social phenomena based on trust relationship between people in artificial world like web. [47,48] In [49] the general properties of trust in e-services were surveyed and analyzed the general properties of trust listed as follows: • Trust is relevant to specific transactions only. • Trust is a measurable belief. • Trust is directed. • Trust exists in time.
• Trust evolves in time, even within the same transaction. • Trust between collectives does not necessarily distribute to trust between their members. • Trust is reflexive, • Trust is a subjective belief.
In [50] "trust" is a measure of reliability of a partner profile to deliver accurate recommendations in the past. There are two models of trust called profile and item level used for generating reliable and accurate recommendations. Thus the use of trust in collaborative recommendation process generates trust-based weighting and trust-based filtering both at either profile-level or item-level trust metrics. Thus the use of trust values reduces overall prediction error rate therefore improves the prediction accuracy.

Genetic Algorithms (GA)
Genetic Algorithm is based on natural theory of evolution for reproduction. The problem solving using genetic algorithm operates on population of chromosomes for their reproduction and evolves over generations to generate the fittest chromosome as the solution to the problem. The selection of chromosomes for reproduction is based on fitness value. The problem specific fitness function is defined and therefore used for the evaluation of chromosomes. The chromosomes with the high fitness values are selected and evolved using genetic operators like crossover and mutation. There are numerous methods for selection of chromosomes for reproduction such as roulettewheel selection, stochastic universal selection, ranking selection, tournament selection and truncate selection. The crossovers operators used are k-point Crossover, Uniform Crossover, Uniform Order-Based Crossover, Order-Based Crossover and Partially Matched Crossover (PMX). The mutation genetic operator is used for changing the gene at the specific position in the chromosome such as bit wise/point mutation. The replacement techniques used for replacing the current generation of population with next generation during evolution are elitist replacement, generation-wise replacement, steady-state-no-duplicates and steady-state replacement methods [51][52][53].

Proposed Approach
An algorithm is designed for web page recommender system using hybrid of Fuzzy set, Trust and Genetic algorithm and implemented in two phases Phase I and Phase II. During Phase I, the web query sessions keyword vector are generated based on Information Scent and content of clicked URL. The session keyword vectors are clustered and therefore group clicked URLs with similar content in one cluster. The trust value of clicked URLs is initialized using information scent and subsequently calculated based on frequency of clicks to URLs of the total number of recommendations during online web search. Genetic Algorithm is executed on clicked URLs in clusters selected based on trust threshold value for rank optimization using fitness function. In Phase II, the fuzzy user input query after expansion is used for the selection of cluster based on similarity measure and the selected cluster generates recommendation of optimal ranked set of relevant documents. The stepwise execution of Phase I & Phase II is given below.

Phase I: Offline Preprocessing
1. Collection and preprocessing of web query sessions.

Global term document relation W1and the fuzzy thesaurus FR
are constructed using (4).

Information Scent Metric of clicked URL is calculated using (1). 4. The trust of a given clicked URL d in session i is initialized
using Information Scent. = , ∀ ∈ 1. . 1, 1. . 5. Modeling of query keyword vector using trust and TF.IDF content as given below.
n1 is the number of distinct clicked URLs in a given session i 6.
Clustering of Query keyword vectors and clusters represented by mean keyword vector (cluster_mean). 7. Initialization of Clicked count and recommended count to zero for trust computation of each clicked URL. 8. The trust value of a given cluster i is computed as follows Trust1 ( i)={|ClickedURLj|: Trust1(ClickedURLj)> ∀ j ϵ1.. n2 where n2 is number of distinct clicked URLs in cluster i}| 9. For each jth cluster, identify the Lj list of clicked URLs where Trust1(clicked URL)>e. 10. Apply the Genetic Algorithm for optimal ranking on the List Lj of clicked URLs for the generation of optimal ranking s TOR1j ( Trusted Optimal Ranking j). array are selected using Tournament selection and also followed Elitism to copy the best chromosome (or a few best chromosomes) to new population without mutation and crossover. 7. Uniform order based crossover and single point mutation are applied on chromosomes obtained in step 6 but not selected using Elitism. 8. Steady-state-no-duplicates replacement policy replaced the population of parent chromosome with the child chromosomes created in step 6 and 7 in order to generate the next generation of population P. 9. Check for the required number of n2 iterations or terminating conditions where the difference between the optimal Fitness values of last 50 generation is less than the threshold value τ and goto step 10 otherwise. Goto step 3. 10. On terminating condition, the chromosome with the maximum fitness value is selected and optimal ranked list of docid of the m1 selected clicked URLs is stored in the ordered List TOR1j associated with the cluster j.
Identify the most matching cluster j. 6.

Results & Discussion
The experiment was executed on web user query sessions. Architecture was developed on i3 processor, Windows 8 with 120 GB RAM and provide the GUI interface for the web search. During data set generation, the search query was used for the retrieval of Google search results marked with check boxes .The user clicks were stored in database as web query sessions. The tf.idf of web pages was crawled using web sphinx and processed for session keyword vector generation. The clustering and initialization of trust for both clusters as well as clicked URLs was done using agent implemented in JADE. The parameters used for execution of genetic algorithm were MAXGeneration, Length(P) length of Population, crossover rate [0.6-0.8], mutation rate [. 1-.3]. GA was executed for 100 generations on clusters using different mutation rate and crossover rate. The graph show the maximum fitness value of population at different mutation rate in Academics, Entertainment and Sports domain since the mutation operator induces exploration of search space. The results in Fig 1 shows that fitness value converges to its maximum value early as mutation rate increases and become stable with no change as the number of generation increases. The results were optimal at [0.8, 0.25] (crossover rate, mutation rate), threshold value of Information Scent (ρ) at 0.5 and the threshold value of Trust was set to 0.5.  Thus the t value for paired difference of average precision was outside the 95% confidence interval and web Information retrieval based on Fuzzy trust GA shows the high t value both over Fuzzy Trust IR and GA IR. Thus Fuzzy Trust GA IR provides effective method for information retrieval based on web search personalization.

Conclusion and Future Scope
In this research an algorithm is designed using hybrid of Fuzzy set, Trust and Genetic algorithm for web search personalization. The dimensionality of data set used for experimental evaluation was high and influence the performance of the system. The research work proposed in this paper can be further improved using the feature reduction method.