Recruitment Prediction Model using XGBoost Classifier

5 min readJun 16, 2020

Talent Acquisition is a critical function within HR, especially in sectors such as IT — Internet, Software, or ITES; Retail etc. where relatively higher attrition and business growth generates constant demand for new employees, thereby keeping recruiters on their toes all the time. Moreover, delays in filling up vacancies have significant business impact and directly affect the company’s revenues. So, on one hand there is this pressure on recruiters to hire candidates ASAP on getting the requisition, while on the other hand, most recruiters would say that spotting, screening and selecting ‘talent’ should never be done in a hurry, but through a thorough and reliable process.

The question is this: how do we end this tug-of-war between business and HR and speed up the entire process to suit business needs without compromising on the thoroughness and reliability of the HR processes?

In this article, I focus on the specific point of developing a predictive machine learning model for identifying candidates who are likely to clear the recruitment process based on past data. This way, the recruiters can work only on such candidates, saving countless man-hours that multiple persons from multiple departments would otherwise spend on candidates that would eventually be rejected. This way, Talent Acquisition would not just improve on efficiency, but would also be better aligned with the overall business objectives of the company.

Working of the model:

First i have imported data in Jupyter Notebook,the data contains several features such as Candidate Name, Qualification, Qualification Level, CGPA, Current Employer( Organization name), Current Experience, Current CTC, Expected CTC, Role Id, Source Id, Email Id, Last Stage, Entry Date.

Here, Last stage is the target feature , contains 8 sub values from S1 to S8, S1 means very less chances and S8 is considered as very high chances of recruitment.

Firstly import necessary libraries like pandas and numpy must needed for Data manipulation & arrays, then reading the data from excel file.

Next step is to check null values in the features and also check the shape of data i.e number of rows & columns in the dataset.

Next is to drop unnecessary features from the dataset i.e “CandidateName”,”EmailId,”EntryDate’,”RoleId”,”CurrentEmployer” & also “Qualification because “Qualification level”column is already given so no need of extra categorical feature.

Checking how many count of variables in “SourceId” column and which variable has maximum values.

It looks that it is Monotonic relationship between values in “SourceId”.

As “LastStage” is a target feature, for more easy understanding for classification, we can map into 3 categories [S1-S3] as 1(low chances of selection), [S4-S6](might be applicable for selection)as 2 & [S7-S8] as 3(High chances of selection of candidate).

In “SourceId” there are 7 values i.e LinedIn,Naukri,Indeed,Consultant1,Consultant2,Refrerral & Shine.We can map it as Indeed-1,Shine-2,Naukri-3,LinkedIn-4,Consultant1 & Consultant2–5 & Referral is given as highest priority i.e-6.

Now for placements ,CTC does not play a vital role so dropping “CurrentCTC”,”ExpectedCTC” & LastStage because it is target feature.

You can see all statistics related to the dataset.

Splitting dataset into train data & test data & importing KNN,DecisionTree,RandomForest,GaussianNB,SVC & XGBoost classifiers, also import KFold cross val technique.