Mining Patterns for Career path based on Innate Talents
Selecting an appropriate career path is one of the most important decisions in an individual’s life span. People end up getting into a profession where they neither enjoy nor get out of it due to several reasons like financial situation, family pressure, single source of income, cost of education and availability of vast career opportunities. Thus, student may select a wrong career option and the consequences of this wrong decision could be job dissatisfaction. An ultimate motive behind this research is to identify the most suitable career path that fits personality and working environment resulting positive outcome such as job satisfaction by using an appropriate data mining technique and a validated Holland’s theory, which is one of the most popular models used for career personality tests. Apart from this, other three factors will be obtained. Thus, for finding the Intersection, four factors are going to be considered: their personality traits, their interests, market trends and pay scales. The proposed system would help students to select an appropriate career path based on their personality traits by matching their “three-letter code” with the employee’s code.
In Today’s world Career recommendation to the college students is a herculean task.
The awareness of Career among the students is very less. Some students don’t know their abilities. Some students choose the career because his/her friend has chosen the same or their guardian forces them to opt for a career without knowing the actual interests, strengths and abilities in a particular area. Some parents force to satisfy his dream which they have seen in their childhood. Thus, students suffer a whole life.
So, to help students from such conditions people have started the career counselling organizations. They provide guidance regarding career but does not analyse the abilities of the students. So, they allow students to choose career on their own. Here same problems occur that students don’t know their actual interest, abilities and strengths.
Thus, to overcome such situation this project aims at evaluating some patterns by applying data mining techniques on employee’s data that would help students to select an appropriate career path based on their personality traits, their interests, market trends and pay scales.
II. RELATED WORK
There are various websites and web applications over the internet which helps students to know their suitable career path. But most of those systems only used personality traits as the only factor to predict the career, which might result in an inconsistent answer. Similarly, there are few sites that suggest career based on only the interests of the students. But the systems did not consider market trends and pay scales to increase the job satisfaction. None of the system has considered all the four factors namely personality traits, interests, market trends and pay scales.
Also, the suggestion provided by the system for course is much generalized. For example, the results of few systems were a group of courses like data analyst, accountant, law etc. Thus, if a student gets such a recommendation then he/she might again get confused as the above specified course belong to different streams. The paper by  Elakia, Gayathri, Aarthi and Naren J suggest suitable career options for high school students based on every student’s interests, skills, likes, hobbies etc. and they have considered “discipline” as an important factor to continue higher studies and pursue one’s career. hence the chance of a student to get violent in future is predicted. The main objective of the paper by  Avinsh Kumar, Akshat Gawankar, Kunal Borge & Mr Nilesh M Patil is to provide an overview on the data mining algorithm that are been used to predict student profile and personality.
They have created online survey system that will help student to make career choices and understand their personality traits. Another paper by  Gentaneh Berie Tarekegn & Dr. Vuda Sreenivasarao have attempted to use data mining techniques to analyse student’s entrance exam result to predict student’s placement into departments. The paper by  Nikita Gorad, Ishani Zalte, Aishwarya Nandi & Deepali Nayak recommends the student, a career option based on their personality trait, interest and their capacity to take up the course. According to the paper by  Lokesh S. Katore, Bhakti S. Ratnaparkhi & Dr. Jayant S. Umale they have developed the career recommendation system which will recommend the career to the students based on their personality traits. The paper by  Ms. Roshani Ade & Dr. P.R. Deshmukh suggested incremental ensemble of classifiers in which the hypothesis from number of classifiers were experimented and by using ‘Majority voting rule’, the final result was determined.
The basic idea of this research is to acquire the data from the employees and to evaluate some patterns from that data. From that evaluated patterns certain career can be suggested to the students. For evaluating patterns from the employee’s data, four factors are going to be considered: their personality traits, their interests, market trends and pay scales.
Figure 1: Four factors
1. Personality traits: Holland's six personality types are considered here as various personality traits. According to Holland’s theory of career choice most people are one of six personality types:
- Speed - C5.0 is significantly faster than C4.5
- Memory usage - C5.0 is more memory efficient than C4.5
- Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees.
- Support for boosting - Boosting improves the trees and gives them more accuracy.
- Weighting - C5.0 allows you to weight different cases and misclassification types.
- Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful. 
Thus, using these personality types, different careers will be classified.  Here, 42 questions are asked for evaluating personality traits. The “three-letter code” with the highest scores will be determined from these six personality types. Then after this “three-letter code” will be matched with some already defined professions and if there is a match between this profession and a code then it will return “Yes” in “P-E fit” field otherwise “No”. Thus, first factor named “P-E fit” will be evaluated.
2. Interest: Interest in this context means asking employees whether they are doing interest-based job or not. If “yes” then only we will consider their data for pattern evaluation and if “no” then we will simply ignore that entries because we aim to suggest the career on the basis of the employee’s data and if employee is not satisfied with his/her job then that is not the perfect match for him/her also, ultimately they are doing something in what they not even interested so, how can we suggest it to students? So, its mandatory that we verify the data which we are going to use for suggesting the career path to the students. Thus, second factor named “Interest based” will be evaluated.
3. Market trend: Top trending jobs from the market will be taken into consideration. The labour market is changing rapidly. No one can be sure of what will happen in the future, but some trends in the labour market do give clues about what is likely to happen. When making decisions about your education or career, it is important to understand these trends and to make good choices based on this information.  As of now, for this research purpose, its assumed that “Travel agent” is not a trending job as the internet has turned vacationers into their own travel agents. Websites, such as Kayak and Expedia, and Web applications, such as MakeMyTrip, Trivago, TripAdvisor enable travellers to book flights, cruises, and hotel rooms with ease. Hence, no travel agents are needed any more. So, if there is a travel agent in the responses then it will return “No” in “Trending job” field otherwise “Yes”. Thus, third factor named “Trending job” can be evaluated.
4. Pay scale: A pay scale (also known as a salary structure) is a system that determines how much an employee is to be paid as a wage or salary, based on one or more factors such as the employee's level, rank or status within the employer's organization, the length of time that the employee has been employed, and the difficulty of the specific work performed.  For evaluating fourth factor named “Pays well”, we have assumed that 10,000 should be the minimum salary for any employees working in any field, so if their salary is less than 10,000 then it will return “No” in “Pays well” field otherwise “Yes”.
Figure 2: Implementation steps
1. Data collection (using google form-spreadsheet): The first step of implementation was to collect data from employees working in different fields. For this purpose, an online survey was conducted using Google forms. The questions asked in the survey are based on personality traits (42), and two more questions for asking about their interest and income. This data has been collected from the employees working in various job sectors such as State Bank of India(Modasa), Union Bank(Gandhinagar), Travel Infoline(Ahmedabad), Institute for Photography Excellence(Ahmedabad), inifd(Gandhinagar), District court(Gandhinagar), Rajshree Studio(Idar), Torrent Pharmaceuticals Limited (Mehsana) and Nootan Vidyalaya(Kadi). As this is the google form, I shared the link with all my friends and family members and asked them to fill it and forward it in their groups.
Figure 3: Google form sample
Step 2. Downloaded as MS Excel: Responses was downloaded as MS Excel (.xlsx)
Figure 4: Raw dataset
Step 3. Pre-processing (in excel): Then data obtained from the survey had to pre-processed and consolidated into a common format as required by the system in MS Excel. Based on the answers given by employees, three-letter code for each individual was generated.
For example, with a code of RIA you would most resemble the Realistic type, somewhat but less resemble the Investigative type, and somewhat but even less resemble the Artistic type. The types that are not in your code are the types you resemble least of all. Most people, and most jobs, are some combination of two or three of the Holland interest areas.  By using this data “P-E fit”, “Interest based”, “Trending job” and “Pays well” was determined and then after “Intersection” was calculated by considering all these four factors. If all the four factor’s values are “Yes” then “Intersection” field’s value will be “Yes” otherwise “No”. Thus, target attribute named “Intersection” will be evaluated.
Figure 5: Pre-processed dataset
Step 4. DM Tool (RStudio): RStudio is a data mining open source tool for applying data mining algorithms over the data collected from the users.
It is an “Integrated development environment (IDE)” that helps you develop programs in R that means R is a “Programming language” while R studio is a “Platform” to use R.
You can use R without using RStudio, but you can't use RStudio without using R, so R comes first. 
Step 5. DM Algorithm: Data mining is all about extracting patterns from an organization's stored or warehoused data. These patterns can be used to gain insight into aspects of the organization's operations, and to predict outcomes for future situations as an aid to decision-making. 
A. Decision tree algorithm:
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. 
1. ID3: In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan, used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.
2. C4.5: C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. Authors of the Weka machine learning software described the C4.5 algorithm as ""a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date"".
It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper published by Springer LNCS in 2008.
Improvements from ID.3 algorithm:
C4.5 made a number of improvements to ID3. Some of these are:
Handling both continuous and discrete attributes - In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.
Handling training data with missing attribute values - C4.5 allows attribute values to be marked as “?” for missing. Missing attribute values are simply not used in gain and entropy calculations.
Handling attributes with differing costs.
Pruning trees after creation - C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. 
3. C5.0: C5.0 is widely used as a decision tree method. It provides the set of rules which is easy to understand. C5.0 algorithm gives acknowledge on noise and missing data. Problem of over fitting and error pruning is solved by the C5.0 algorithm. In classification technique, the C5.0 classifier can anticipate which attributes are relevant and which are not relevant in classification. 
Improvements in C5.0 algorithm:
C5.0 offers a number of improvements on C4.5. Some of these are:
Adaptive boosting involves making several models that “vote” how to classify an example. To do this you need to add the ‘trials’ parameter to the code. The ‘trial’ parameter sets the upper limit of the number of models R will iterate if necessary. 
4. CART: Classification and Regression Trees (CART) split attributes based on values that minimize a loss function, such as sum of squared errors. 
Classification and regression trees (CART) are a non-parametric decision tree learning technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.
Decision trees are formed by a collection of rules based on variables in the modelling data set:
Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable
Once a rule is selected and splits a node into two, the same process is applied to each ""child"" node (i.e. it is a recursive procedure)
Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.)
Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules. 
5. Random Forest: Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. 
Random Forest is variation on Bagging of decision trees by reducing the attributes available to making a tree at each decision point to a random sub-sample. This further increases the variance of the trees and more trees are required. 
6. This algorithm stands for “Conditional Inference Tree”. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting. This approach results in unbiased predictor selection and does not require pruning. 
Ctree is a non-parametric class of regression trees embedding tree-structured regression models into a well-defined theory of conditional inference procedures. It is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. 
B. Neural Network:
An Artificial Neural Network, often just called a neural network, is a mathematical model inspired by biological neural networks. Neural networks are used to model complex relationships between inputs and outputs or to find patterns in data. 
A neural network is a model characterized by an activation function, which is used by interconnected information processing units to transform input into output. A neural network has always been compared to human nervous system. Information in passed through interconnected units analogous to information passage through neurons in humans. The first layer of the neural network receives the raw input, processes it and passes the processed information to the hidden layers. The hidden layer passes the information to the last layer, which produces the output. The advantage of neural network is that it is adaptive in nature. It learns from the information provided, i.e. trains itself from the data, which has a known outcome and optimizes its weights for a better prediction in situations with unknown outcome. 
C. Naïve Bayes:
The Naive Bayesian classifier is based on Bayes’ theorem with the independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods. 
V. RESULTS OF IMPLEMENTATION
The dataset was then used to derive the results, using the various packages available in R for generating decision tree.
The C5.0 algorithm applied on the dataset had the accuracy of 100%.
The output after plotting the decision tree is shown in Figure 6. This tree is generated by considering all the four factors namely personality traits, their interest, market trends and pay scales.
Figure 6: decision tree with all the four factors
We have also generated the various trees with considering one factor at a time while applying C5.0 algorithm on the dataset. The visualization of the decision trees with “Trending job”, “P-E fit”, “Interest based”, “Pays well” are “Figure 7”, “Figure 8”, “Figure 9”, “Figure 10” respectively.
Figure 7: decision tree with “Trending job”
Figure 8: decision tree with “P-E fit”
Figure 9: decision tree with “Interest based”
Figure 10: decision tree with “Pays well”
From the following graph, we can get a clear idea of the comparison of the five methods.
Figure 11: accuracy with various factors
Thus, this graph shows that for selecting a career of a student all the four factors are important namely personality traits, their interest, market trends and pay scales.
This work has discussed the Holland’s theory and various data mining techniques in relation to observations indicating that some students have difficulty in determining a suitable career. As this affects their performance, productivity and satisfaction, it is critically important to understand how to find a career that fits their personality. The results generated from the employee’s data can be useful for evaluating patterns in order to determine a suitable career path for the students based on the four factors namely personality traits, interests, market trends and pay scales.
 Elakia, Gayathri, Aarthi and Naren J, “Application of Data Mining in Educational Database for Predicting Behavioural Patterns of the Students”, IJCSIT, 2014
 Avinsh Kumar, Akshat Gawankar, Kunal Borge & Mr Nilesh M Patil, “Student Profile & Personality Prediction using Data Mining Algorithms”, IJARIIE, 2017
 Gentaneh Berie Tarekegn & Dr. Vuda Sreenivasarao, “Application of Data Mining Techniques to Predict Students Placement in to Departments”, IJRSCSE, 2016
 Nikita Gorad, Ishani Zalte, Aishwarya Nandi & Deepali Nayak, “Career Counselling using Data Mining”, IJESC, April 2017
 Lokesh S. Katore, Bhakti S. Ratnaparkhi & Dr. Jayant S. Umale, “Novel Professional career prediction and recommendation method for individual through analytics on personal traits using C4.5 algorithm”, IEEE, 2015
 Ms. Roshani Ade & Dr. P.R. Deshmukh, “An incremental ensemble of classifiers as a technique for prediction of student’s career choice”, IEEE, 2014
 Torsten Hothorn, Kurt Hornik and Achim Zeileis “ctree: Conditional Inference Trees”