DISCOVERY AND ANALYSIS OF JOB ELIGIBILITY AS ASSOCIATION RULES BY APRIORI ALGORITHM

In data mining, association rule mining method is used to discover rules of type X→Y from transaction databases where X and Y are sets of attributes or items which are disjoint. This means a set of items X if found present in a set of transactions then it is also found that another set of items Y is also present in the same set of transactions. Support and confidence are two quantitative measures for association rules. Support denotes the number of occurrences of a rule in the whole database. Confidence means the conditional probability of presence of the set of items Y subject to the presence of the set of items X. There are numerous areas of applications association rules. In this paper, association rule mining method is applied to discover and analyze eligibility criteria for jobs from a large set of data for choosing career and professional goals effectively. For this the data are collected by conducting a wide survey and is prepared and modelled suitably. Then the a priori algorithm is implemented for discovering the frequent itemsets and the association rules. The discovered rules are then classified based on the kind of jobs and also based on the kinds of qualifications. The discovered results are analyzed and interpreted and the computational performances are also analyzed.


I. INTRODUCTION
A Huge amount of data about career and jobs opportunities are generated and its availability is widely spread out in public domains namely on the internet, news papers, social media and elsewhere. Prospective candidates are required to judiciously analyze such data for selecting better and prosperous career options based on their academic and professional back grounds so as to maximize the scope and growth of their careers. It is observed that there is lack of preparedness among the candidates about extending and maximizing the career opportunities at a very early stage due to ineffective dissemination of knowledge in a suitable manner so that various career options could be evaluated in relation to their future scope, mobility and growth before taking up a particular option. This is recognized as a very critical need in the context of the present day as there are diverge and wide ranging career opportunities [1].

A. Employment Data Analytics
The data about employment are huge, unstructured, varying and growing continuously. In the context of big data there are enough complexities in such data [1]. The patterns of eligibility qualifications required for various jobs and careers can be discovered by using association rule mining technique. Prospective candidates can be guided by information services designed based on this technique for sustainable growth in career as the patterns discovered from the actual data are proofs about bright careers of various people. Such techniques based on analytics are more meticulous and can even be useful for guided public investments for higher education.

B. Association Rule Mining
The objective of data mining is to discover hidden patterns and rules from large databases which are nontrivial, interesting, previously unknown and potentially useful [2]. Association rule mining is a data mining technique. It is used to discover rules of type X →Y from transaction databases where X and Y are sets of attributes or items which are disjoint. The significance of association rule is that the sets of items namely X and Y involved in a rule are found to occur together in a number of transactions. In turn, it means that the occurrence of sets of items X influences the occurrences of the sets of items Y. These occurrences are quantitatively measured by using the parameters support and confidence. Computation of the association rules from large transaction database is expensive as the search space grows exponentially with the increase in the number of the attributes or the database items [3] [4]. There occurs increase in data transfer activity since the association rule mining methods are iterative and require multiple database scans.
There are various research issues in algorithms for association rule mining. These include scalability, controlling exponential growth of the search space, reduction of I/O and multiple database scans; designing efficient internal data structures, optimizing computational overheads, increase in discovered rules with the increase in the number of items in the database.

B.1 Terminology and Notation
An itemset is defined as a set of items. A k-itemset is an itemset of cardinality k. The number of times for which an itemset say X, occurs in a database is called its support count. It is denoted as σ(X). A large or frequent itemset is one if its support is not less than a pre specified minimum support called minsup [6]. It is assumed that L k denotes a set of large or frequent itemset of cardinality k.

B.2 Definition of Association Rule and Measures of Interestingness
The problem of mining Association Rules is defined in [5] [6]. It is described as below.
Consider a set of literals I such that I = {i 1 , i 2 , i 3 , … …. …. i m }. Here the literals i 1 , i 2 , i 3 , … …. …. i m represent the database items. Let D be transaction database. It consists of a set of transactions. The items in each transaction are drawn from the set I such that T I. Let X and Y be sets of database items called itemsets such that X I and Y I. The itemsets X and Y are said to be present in a transaction T if and only is X T and Y T. Now an association rule is defined as X => Y in which X I, Y I and X∩Y = Φ . Such a rule is said to have a confidence c if X is present in c% of the transactions of the database D then Y is also present in those transactions. The support of the rule X=>Y is said to be s if the itemset XUY is present in s% of the transactions in D. Support and confidence are the most widely used measures of interestingness to indicate the quality of an association rule. Correlation or lift is another measure of interest used to express the relationships between the items in a rule [7].
The frequency of occurrence of the itemset XUY in the transactions of the database D is called support count of the rule X=>Y. It is denoted by σ(XUY).
The confidence of a rule X=>Y is calculated as the ratio of σ(XUY)/σ(X) in terms of support count or support(XUY)/support (X) in terms of percentage support. In other words confidence of a rule X=>Y is the conditional probability of presence of the itemset Y in a transaction subject to the presence of the itemset X in the transaction. Confidence indicates the strength of a rule. The higher the support and confidence the stronger are the rules. The objective of association rule mining is to discover such rules for which the values of support and confidence are not less than pre specified threshold values of minimum support (minsup) and minimum confidence (minconf).

C. Organization of the Paper
A model was proposed in [1] to discover eligibility for jobs as association rules. In this paper the proposed model is implemented by using the a priori algorithm to discover the association rules connecting the academic and skill background with prospective career or job opportunities. This helps in finding necessary academic course(s) and skill for various career options.
A data set is prepared with on field data collected for this work. The programs are tested on synthetic data sets before applying on to the prepared data set. The results obtained are analyzed, interpreted and the performance of the implementation of the algorithm is also examined. The following is the organization of the paper. In section 2, relevant works about association rule mining are discussed. Application of data mining approach in employment is also referred here. The approach of mining eligibility for jobs is presented in section 3. The details of the data collection and the preparation of the data for the implementation are discussed in this section. The working of the a priori algorithm on a sample data set for eligibility of jobs is shown in section 4. The implementation details and the experimental set up are given in section 5. In section 6 results of computation are plotted and analyzed. In the end a conclusion with scope for the future is given.

II. RELATED WORKS
To find employment and career opportunities data mining techniques are applied. Association rule based technique is applied to find eligibility for employment in [1] and [8]. A job recommender system is designed by applying classification technique in [9]. In this method job preferences of the candidates are taken into account. For personnel selection in high technology industry data mining techniques are also used [10]. Prediction of students' employability is done by classification technique in [11] and [12]. Several classification techniques are compared to find the most suitable algorithm for predicting the employability of the students in [12]. Employability of the graduates in Malaysia is studied by using model based on data mining [13]. Classification techniques are used for discovering knowledge about skills and vacancy and also to develop an appraisal management system [14] [15].
Apriori algorithm [5] [6] is one of the major algorithms proposed for mining of association rules in centralized databases. This is a level wise algorithm as it employs breadth first search and needs multiple passes over the database equal to the cardinality of the largest frequent itemset discovered.
Reduction of the number of the database scans is one of the major concerns of designing algorithms for mining frequent itemsets and association rules. In partition algorithm [16], it was possible to reduce the number database scans to two. For this, the database is divided into small partitions so that each such partition can be accommodated in the main memory. In AS-CPA algorithm [18] also partitioning technique is applied. It also needs at most two scans of the database. In the DLG algorithm [17], by using the TIDs, itemsets are converted to memory resident bit vectors. Then frequent itemsets are computed by logical AND operations on the bit vectors. In Dynamic Itemset Counting (DIC) algorithm [19], counting of itemsets of different cardinality is done simultaneously along with the database scans. Pincer -Search [20], All MFS [21] and MaxMiner [22] are algorithms for finding maximal itemsets.
In the apriori based approaches the frequent itemsets are discovered after generating the candidate itemsets. There is a change in the conceptual foundation for generating the frequent itemsets in the FP Tree -growth algorithm [23] as it discovered the frequent itemsets without generating the candidate itemsets by using a prefix tree based pattern growth technique. The ECLAT algorithm [4] uses vertical data format to discover frequent itemsets. Parallel algorithms are also designed for association rule mining but there are additional overhead of data transfer and message passing [24]. Sampling based algorithms are also proposed to reduce I/O overhead and to control the explosive growth of the search space [25]. But sampled data are not always the true representation of the actual data [4]. Therefore, the discovered association rules may not have correct values of the measures of interestingness. By mining sets of frequent patterns with average inter itemset distance attempts are made to reduce the huge set of frequent itemsets and association rules [3] [26].
Various quality measures for data mining are discussed in [27] [28]. Reduction of number of discovered association rules using ontologies is proposed in [29].

III. DISCOVERY OF ELIGIBILITY FOR JOBS
In this paper the a priori algorithm is implemented to discover the eligibility conditions for various jobs in the form of association rules from the collected data about employment.

A. Data Collection and Preparation of Data set
The data for performing the experiments are collected through an exhaustive survey from various primary sources by interacting with various individuals as well as from recruitment advertisements of different organizations from their websites and various other print and electronic media.
The collected data is preprocessed for mining. The dataset is prepared with 83 different educational qualifications and 49 different job titles. Based on the requirements of various educational qualifications for jobs the transaction dataset for association rule mining is prepared with qualification code and job code on the items in such a way that the job code appears as the last item in the transactions. Item codes in the transactions are also kept sorted in the ascending order. In this way the semantics of the transactions are prepared. The attributes of the data set are shown in tables 1 and 2 below.  Then the a priori algorithm is implemented with T -Tree as the internal data structure for discovering the association rules showing eligibility for jobs. Then from the transaction data set of eligibility for jobs, the frequent itemsets are generated with respect to a pre specified value of minimum support. The association rules with respect to the pre specified minimum support and minimum confidence are generated from the frequent itemsets in such a way that the name of the job occurs in the consequent and all the required qualifications appear in the antecedent of the rules of the form X→Y where X represents the set of qualifications for the job represented by Y. This is the additional condition applied on the consequents of the rules and only the required rules are considered. The results are interpreted after analysis and conclusion is drawn.
This approach helps the candidates to find suitable employment options based on their qualifications in the form of association rules.
Thereafter, different sets of qualifications required for a job are identified from the large number of discovered rules by grouping the discovered association rules for the same job.
Candidates with a specific set of qualifications are also eligible for different jobs. For this the association rules with same qualifications appearing on the antecedent and different jobs on the consequent are identified and grouped.

IV. THE A PRIORI ALGORITHM
The A priori algorithm uses the downward closure property which states that any subset of a frequent itemset is a frequent itemset [5] [6]. This means if an itemset is found to be infrequent, then there is no need to generate any superset of this as candidate because it will also certainly be infrequent. This is the upward closure property according to which any superset of an infrequent itemset is infrequent.
The algorithm applies bottom up search and prunes away many of the itemsets which are unlikely to be frequent before reading the database at every level. In this method candidate itemsets of a particular size are generated at first, and then the database is scanned to count their supports to check if they are large.
In the first scan of the database, all itemsets of size-1 are treated as candidate itemsets. These candidate itemsets are denoted as C 1 . From these candidate itemsets of size 1 (C 1 ) large itemsets of size 1 denoted as L 1 are computed by using the pre specified threshold value of minimum support (minsup). For this the support counts of the candidate itemsets of size 1 (C 1 ) are counted in the first scan to find the large itemsets of size 1 (L 1 ).
Let C k denotes candidate itemset of cardinality k. During the scan k, at first, candidates of size k (C k ) are discovered. Then by using the pre specified threshold value of minimum support (minsup), the large or frequent itemsets of size k (L k ) are computed. By using these large itemsets found in the scan k, the candidate temsets of size (k+1) for the (k+1) th scan of the database are computed. If all the subsets of an itemset are found to be large, then only such an itemset is considered to be a candidate itemset.
To generate candidates of size k+1, the set of frequent itemsets found in the previous pass, L k-1 , is joined with itself to determine the candidates. An algorithm called A priori -Gen is used to generate candidate itemsets for each pass after the first.
An itemset of size k can be combined with another itemset of the same size if they have (k-1) common items between them. At the end of the first scan, for generating the candidate itemsets of size 2 (C 2 ) every large itemset of size 1 is combined with every other large itemset.of size l.
A subsequent pass, say pass k, consists of two phases: First, the frequent itemsets L k-1 found in the (k-1) th pass (k>1) are used to generate the candidate itemsets C k using the a priori generation procedure described below. Next the database is scanned and the support of each candidate in C k is counted. The set of candidate itemsets is subjected to a pruning process to ensure that all the subsets of the candidate sets are frequent itemsets.
The candidate generation and the pruning processes are discussed below. The A priori algorithm assumes that the data base is memory resident. The maximum number of database scans is one more than the cardinality of the largest frequent itemset.

Candidate Generation: Given L k -1
It is known that if X is a large itemset than all the subsets of X are also large. For example, consider a set of large itemsets of size 3 (L , the set of all frequent (k -1) itemsets (k>1), it is required to generate supersets of the set of all frequent (k -1) itemsets. 3 3 ) and in addition all the 3 -itemset subsets of any candidate -4 itemset (so generated) must be already known to be in L 3 . The first part and the part of the second part are handled by A priori candidate generation method.
Then candidate sets which do not meet the second criterion are pruned by a pruning algorithm. The candidate generation and pruning algorithms are given below: Pruning: Pruning is done to eliminate such candidate itemsets for which all its subsets are not large or frequent. That is, a candidate set is considered to be acceptable if all its subsets are frequent. For example, from C 4 , the itemset {b, c, d, e} is pruned, since all its 3 -subsets are not in L 3 (clearly, {b, d, e} is not in L 3 . The pruning algorithm is described below.
These two functions -candidate generation and pruning are used by the a priori algorithm in every iteration. It moves upward in the lattice starting from level 1 to level k, where no candidate set remains after pruning.
The a priori algorithm is applied on the experimental dataset prepared and various results are obtained. These results are presented in the next section.
Below, the working of the a priori algorithm is shown on the sample dataset of Table 5 for the discovery of the association rules for mining the eligibility criteria for jobs. The pre specified minimum support (minsup) is assumed to be minsup=1/30 = 0.033 = 3.3% i.e. minimum support count = 1.
The attributes (items) of the sample data set for this example is drawn from the tables 1 and 2 above based on the survey carried out and are shown in table 3 and table 4 respectively. The sample data set for this example is drawn from the survey carried out and is shown in table 5 below. Read the database to count support of C 1 to determine L 1 ; L 1 = {frequent -1 itemsets}; k: = 2, // k represents the pass number // while (L k -1 ≠ Φ) do begin C k = gen_Candidate_Itemsets with the given L k -1 ; prune (C k ) ; for all transactions t ϵ T do increment the count of all candidates in t; L k = All candidates in C k with minimum support; k: = k + 1; end; Answer : = U k L k ;  Step 1: Generation of Candidate -1 Itemsets (C 1 ) Step

V. IMPLEMENTATION
The a priori algorithm is implemented on a PC with core 2 duo processor with 1 GB RAM and 100 GB hard disk capacity on windows. The programs are written in Java (JDK 1.2) with T Tree as the underlying data structure.
The data are collected based on a survey conducted among the employed people in various organizations and then the data set is prepared in the transaction format. The steps of the experimentation are as below: 1. The frequent itemsets with respect to a pre specified minimum support (minsup) are discovered. Then those frequent item sets having one item belonging to the list of jobs in table 2 are considered. (No transaction has two items from the table 2 (List of Jobs with job code). 2. Then the association rules are generated with respect to a pre specified minimum support (minsup) and minimum confidence (minconf) and only those discovered association rules are considered which are having only a single item in the consequent and which belongs to the list of jobs in table 2. These rules are analyzed to find the different sets of qualifications required for a job for the pre specified support and confidence. 3. The computing time for the discovery of the rules are plotted for different values of support for the data set of the same size. 4. Frequent itemsets, frequent Itemsets having its last item as job code, the association rules and the Association Rules whose consequent is a job code are discovered from the Job Eligibility Dataset by varying the pre specified minimum support threshold (minsup) at fixed value of confidence (minconf) 1% and corresponding graphs are plotted. 5. Frequent itemsets, frequent Itemsets having its last item as job code, the association rules and the association rules whose consequent is a job code are discovered from the job eligibility dataset by varying the pre specified minimum confidence (minconf) threshold at fixed value of pre specified minimum support (minsup) 3% and corresponding graphs are plotted. 6. By varying the size of the dataset at specific values of pre specified support and confidence the scalability of the algorithm is tested by computing and plotting the computational time of the implementation.
7. After mining the association rules, a classification of the association rules are also done for analyzing the qualifications required for various jobs.

A. The Experimental Set up
The experimental set up after the collection of data is depicted in figure 1 and has the following steps 1. Preparing transaction data set for computing eligibility for jobs.
2. Implement the a priori algorithm with T -Tree as the internal data structure.
3. Discover the association rules.
4. Classify the rules and conclude the discovery with interpretation.

VI. EXPERIMENTAL RESULTS
Several results are obtained from the experiments performed and these are described and analyzed in this section.

A. Experimental Result 1
Frequent Itemsets and the association rules are discovered from the Job Eligibility Dataset by varying the pre specified minimum support threshold (minsup) at fixed value of confidence (minconf). Then the frequent itemsets having the last item belonging to the list of jobs are considered relevant and their numbers at each value of minsup is found. Then such frequent itemsets are plotted against all the frequent itemsets at different values of minsup for a specific value of minconf ( Figure 2). Similarly, the association rules having only a job code as consequent are considered relevant and their numbers at each value of minsup is found and plotted against all association rules discovered ( Figure 3). The execution times are also noted and plotted with respect to the various minsup ( Figure 4). In Table 11, the frequent itemsets discovered with respect to various values of pre specified minimum support (minsup) at a fixed value of pre specified minimum confidence (minconf) 1% are shown. Then only the frequent itemsets having its last item as job code in the list of jobs are considered. Similarly only the association rules with a job code as its consequent from the list of jobs are retained. As can be seen from table 11, the lower the value of pre specified minimum support the more is the number of association rules discovered.. Table 11: Frequent itemsets, frequent Itemsets having its last item as job code, the association rules and the Association Rules whose consequent is a job code are discovered from the Job Eligibility Dataset by varying the pre specified minimum support threshold (minsup) at fixed value of confidence (minconf) 1%.

B. Experimental Result 2
The frequent itemsets and the association rules at varying pre specified minimum confidence (minconf) and at fixed pre specified minimum support (minsup) are discovered form the job eligibility dataset. Then the number of frequent itemsets having its last item as job code and the association rules whose consequent is a job code are discovered from the job eligibility dataset by varying the pre specified minimum confidence (minconf) threshold at fixed value of pre specified minimum support (minsup). The execution times are recorded. These results are shown at Table 12. These are plotted in figures 5 and figure 6 respectively. Table 12: Frequent itemsets, frequent Itemsets having its last item as job code, the association rules and the association rules whose consequent is a job code are discovered from the job eligibility dataset by varying the pre specified minimum confidence (minconf) threshold at fixed value of pre specified minimum support (minsup) 3%.
Discovering Frequent Itemsets and association rules at varying pre specified minimum confidence (minconf) and fixed pre specified minimum support (minsup)of 3% . Confidence (%) Figure 5: variation of number of association rules and the association rules whose consequent is a job code with variation in the pre specified minimum confidence (minconf) threshold at fixed value of pre specified minimum support (minsup) 3%.
Confidence (%) Figure 6: Execution time (Seconds) with variation in the pre specified minimum confidence (minconf) threshold at fixed value of pre specified minimum support (minsup) 3%.

C. Experimental Result 3
In the Table 13, some of the association rules discovered having only a single item in the consequent belonging to the list of jobs in Table 2 of the Job Eligibility Dataset are shown with their meaning along with support and confidence. It is not possible to show all such discovered rules for different combinations of support and confidence. Here only the association rules whose consequent is a job code are shown at fixed value of pre specified minimum support (minsup) 3% and pre specified minimum confidence (minconf) threshold 1%. Similarly, such association rules can be discovered for various values of pre specified minimum support (minsup). are also informative to a certain extent as these contain only a single qualification in their antecedents for the job specified by its consequent. All such rules are discovered as these rules satisfy the pre specified thresholds on minimum support and confidence.
Formally, in the table 14, if there are association rules say X 1 →Y, X 2 →Y, …., X n → Y and X → Y such that X 1 X, X 2 X , …, and X n X then only the rule X → Y is considered as it gives the complete information. However, the rules X 1 →Y, X 2 →Y, …., and X n → Y give certain partial information with their support and confidence. The rules obtained based on this criterion are shown in Table 14 below.

D. Experimental Result 4
A classification of the association rules discovered after mining are done for analyzing the background qualifications required for various jobs is shown in Table 15.

VII. DISCUSSION AND CONCLUSION
In this paper apriori algorithm is implemented with T Tree to discover the eligibility qualifications required for jobs. A survey was undertaken to collect and prepare the data about employment.
As a future work both association and classification or association and clustering techniques shall be applied in a hybrid manner for further refinements of the results. Rules can also be grouped by using clustering based on similarity.
All such association rules which show the same eligibility qualification for different jobs are grouped together under a classification scheme and based on the support and confidence of the rules the rules can be ranked. This approach of classification of rules helps in determining the groups of jobs for which the same eligibility criteria are required.