-
Notifications
You must be signed in to change notification settings - Fork 0
/
hr.Rmd
1956 lines (1327 loc) · 77.3 KB
/
hr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Human Resources Data Mining"
author: "Kar Ng"
date: '2022-06'
output:
github_document:
toc: true
toc_depth: 4
always_allow_html: yes
---
***
![](https://github.com/raw/KAR-NG/hr/main/pic3_thumbnail.png)
***
## 1 SUMMARY
This project applies a series of data mining techniques including clustering, principal component methods, regression, and classification algorithms to study inner trends hiden the dataset. Numerous visualisation were also applied to aid each of these data mining methods. In this project, 5 analytical tasks have been completed using VAT validated gower-PAM clustering, correspondence analysis (CA), asymmetric-biplot, multiple correspondence analysis (MCA), Chi-Squared test, Regression, and predictive classification models with KNN, SVM, and Random Forest.
Outputs show that there is no statistical evidence (p-value = 0.249) to support the argument that several managers among all are good at training their employees, or vice versa. Instead, most managers are good at training their subordinates reaching the "fully meet" standard. The company do actively hire employees from diverse backgrounds. The company has a good level of overall diversity level at 76%. 40% of the employees in the company are female and 40% are employees from diverse backgrounds. The company recruits employees from 8 sources and diversity job fair is the best choice if the company is keen to hire an employee from a diverse background, and employee-referral being the worst source at hiring an employee with diverse background (Chi-squared test for independence: x-squared = 21.989, df = 5, p-value = 0.0005).
Inferential regression was applied to study the relationships between salary and numerous factors (variables) that would potentially relates to unequal pay, such as age, years of working, race, gender and etc, and the result shows that the company is paying employees equally, supported by extensive visualisation and P-values of higher than 0.05. Finally, this dataset provides sufficient data to train a model with great predictive power. K-Nearest Neighbor (KNN), Polynomial-kernel Support Vector Machine (SVM), and Random Forest were selected as the modeling candidates. Output shows that Random Forest models with 0.405 probability cut-off point is the best algorithm to make prediction for who is leaving the company. It has a reliable overall accuracy rate at 95.7%, sensitivity rate of 93.5% (the metric that we are most interested in) and specificity rate of 96.7%.
*Highlight*
![](https://github.com/raw/KAR-NG/hr/main/pic2_highlights.png)
## 2 R PACKAGES
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(kableExtra)
library(lubridate)
library(skimr)
library(tidytext)
library(factoextra)
library(FactoMineR)
library(cluster) # For daisy function
library(cowplot)
library(Rtsne)
library(gplots)
library(ggrepel)
library(caret)
library(pROC)
library(ggpubr)
library(grid)
library(gridExtra)
library(corrplot)
```
## 3 INTRODUCTION
The aim of this project is to analyse a human resource dataset to answer 5 business questions:
* 1. Is there any relationship between who a person works for and their performance score?
* 2. What is the overall diversity profile of the organization?
* 3. What are our best recruiting sources if we want to ensure a diverse organization?
* 4. Are there areas of the company where pay is not equitable?
* 5. Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?
Different machine learning techniques will be used in this project to answer these questions or during exploratory analysis, they include:
* Applying clustering for mixed-data,
* Several appropriate principal component methods,
* Regression algorithms, and
* Classification algorithms
A quick introduction about the rarer "Principal Components (PC) methods", it belongs to the "unsupervised" branch of machine learning domain. There are 5 main types of principal components methods:
* Principal Component Analysis (PCA)
* Correspondence Analysis (CA)
* Multiple Correspondence Analysis (MCA)
* Factor Analysis of Mixed Data (FAMD)
* Multiple Factor Analysis (MFA)
These PC methods are designed to be used for different type of datasets. For examples, PCA is used for datasets that have only numerical variables (or known as "feature" to describe explanatory variables), or, FAMD and MFA will be used for mixed-data datasets that have both numerical and categorical variables. I will apply the most appropriate one for the dataset used in this project, and which will be decided in later section after I explore the data.
PC methods would help us identifying the most important variables that contribute the most in explaining the variations in a dataset. During computation, PC methods will extract all the variations in the multivariate dataset and express them into a few new variables called principal components (Some other inter-changeable terms with similar meaning are "dims" or "axes"). Then, many special plots of PC will be plotted to study the results. Important to note that the goal of PC methods is to identify main directions along which the variation is maximal (KASSAMBARA A 2017).
Please be expecting this project would be a slightly long project as it is built for skills demonstration purposes.
## 4 DATA PREPARATION
A public dataset called "Human Resources Data Set" prepared by DR.RICH on Kaggle.com has been downloaded for this project. *Kaggle.com* is a popular website for data science community to share datasets, codes and projects.
### 4.1 Data import
Importing the dataset into R:
```{r}
hr <- read.csv("hr_dataset.csv",
fileEncoding = "UTF-8-BOM",
na.strings = T,
header = T)
```
### 4.2 Data description
Following is the data dictionary/description of this dataset, adapted from this link: [Rpubs](https://rpubs.com/rhuebner/hrd_cb_v14), created by the author, Dr. Rich Huebner.
```{r}
Variables <- c("Employee Name",
"EmpID",
"MarriedID",
"MaritalStatusID",
"EmpStatusID",
"DeptID",
"PerfScoreID",
"FromDiversityJobFairID",
"Salary",
"Termd",
"PositionID",
"Position",
"State",
"Zip",
"DOB",
"Sex",
"MaritalDesc",
"CitizenDesc",
"HispanicLatino",
"RaceDesc",
"DateofHire",
"DateofTermination",
"TermReason",
"EmploymentStatus",
"Department",
"ManagerName",
"ManagerID",
"RecruitmentSource",
"PerformanceScore",
"EngagementSurvey",
"EmpSatisfaction",
"SpecialProjectsCount",
"LastPerformanceReviewDate",
"DaysLateLast30",
"Absences"
)
Description <- c("Employee’s full name",
"Employee ID is unique to each employee",
"Is the person married (1 or 0 for yes or no)",
"Marital status code that matches the text field MaritalDesc",
"Employment status code that matches text field EmploymentStatus",
"Department ID code that matches the department the employee works in",
"Performance Score code that matches the employee’s most recent performance score",
"Was the employee sourced from the Diversity job fair? 1 or 0 for yes or no",
"The person’s yearly salary. $ U.S. Dollars",
"Has this employee been terminated - 1 or 0",
"An integer indicating the person’s position",
"The text name/title of the position the person has",
"The state that the person lives in",
"The zip code for the employee",
"Date of Birth for the employee",
"Sex - M or F",
"The marital status of the person (divorced, single, widowed, separated, etc)",
"Label for whether the person is a Citizen or Eligible NonCitizen",
"Yes or No field for whether the employee is Hispanic/Latino",
"Description/text of the race the person identifies with",
"Date the person was hired",
"Date the person was terminated, only populated if, in fact, Termd = 1",
"A text reason / description for why the person was terminated",
"A description/category of the person’s employment status. Anyone currently working full time = Active",
"Name of the department that the person works in",
"The name of the person’s immediate manager",
"A unique identifier for each manager",
"The name of the recruitment source where the employee was recruited from",
"Performance Score text/category (Fully Meets, Partially Meets, PIP, Exceeds)",
"Results from the last engagement survey, managed by our external partner",
"A basic satisfaction score between 1 and 5, as reported on a recent employee satisfaction survey",
"The number of special projects that the employee worked on during the last 6 months",
"The most recent date of the person’s last performance review",
"The number of times that the employee was late to work during the last 30 days",
"The number of times the employee was absent from work"
)
data.frame(Variables, Description) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "bordered"))
```
### 4.3 Data exploration
There are 311 rows and 35 columns in the dataset. Following shows the variable types allocated by R to each of the column (also known as variables or features), along with several starting values of these variables.
```{r}
glimpse(hr)
```
Randomly sample 10 rows of data from the table:
```{r}
set.seed(123)
sample_n(hr, 10)
```
The first column is a column recording employee names. I have made this column the name of each rows (or known as observation). It is the standard format required for analysis using clustering or PC methods.
```{r}
hr <- hr %>%
column_to_rownames(var = "Employee_Name")
```
## 5 DATA CLEANING
The data may seem perfectly to go however numerous important cleaning and manipulation tasks have been identified and will be completed in this section.
### 5.1 Variables removals
I will be removing some irrelevant or duplicated variables that may not be helpful in the analysis. They are:
* EmpID
* MaritalStatusID
* GenderID
* EmpStatusID
* DeptID
* PerfScoreID
* PositionID
* Zip
* ManagerID
* LastPerformanceReview_Date
* MarriedID
* FromDiversityJobFairID
After removal of above features, the numbers of columns have been reduced from 35 to 23.
```{r}
hr2 <- hr %>%
dplyr::select(-EmpID, -MaritalStatusID, -GenderID, -EmpStatusID, -DeptID, -PerfScoreID, -PositionID, -Zip, -ManagerID, -LastPerformanceReview_Date, -MarriedID, -FromDiversityJobFairID)
glimpse(hr2)
```
### 5.2 New Variable: Age
At the year of writing this project was 2022, and therefore the calculation of age will be 2022 minus DOB (date of birth) plus 1 in the dataset. The DOB will be replaced with "Age".
```{r}
hr2 <- hr2 %>%
mutate(yearDOB = str_sub(DOB, -2, -1),
yearbirth = as.numeric(paste0(19, yearDOB)),
Age = (2022 - yearbirth)+1) %>%
relocate(Age, .after = State) %>%
dplyr::select(-DOB, -yearDOB, -yearbirth)
```
Now, the variable "DOB" has been replaced by "Age", and following shows the age of all employees in the dataset.
```{r}
hr2$Age
```
The distribution:
* Maximum: Located at the top tip of the boxplot
* Minimum: Located at the bottom tip of the boxplot
* Median: The horizontal line in the middle of the boxplot
```{r, fig.width=6}
set.seed(123)
ggplot(hr2, aes(y = Age, x = "")) +
geom_jitter(width = 0.1, alpha = 0.5,) +
geom_boxplot(outlier.shape = NA, width = 0.1, size = 2, color = "purple", alpha = 0.5) +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank())
```
### 5.3 New Variable: years_worked
There are two variables in the date set, DateofHire and DateofTermination, are available to synthese a variable recording the service years of each employee.
I will compute the days of working of each employee by using the date of termination (DateofTermination) minus date of hire (DateofHire), and for present employees I will use today's date (5-May-2022) minus the date of hire to obtain the total number of days worked.
```{r}
hr2 <- hr2 %>%
mutate(DateofHire = mdy(DateofHire),
DateofTermination = mdy(DateofTermination),
days_worked = ifelse(is.na(DateofTermination),
today() - DateofHire,
DateofTermination - DateofHire),
years_worked = round(days_worked/365, 1)) %>%
relocate(years_worked, .after = RaceDesc) %>%
dplyr::select(-DateofHire, -DateofTermination, -days_worked)
```
Following shows number of years, with 1 decimal place, worked by each employee in the dataset.
```{r}
hr2$years_worked
```
The distribution:
* Maximum: Located at the top tip of the boxplot
* Minimum: Located at the bottom tip of the boxplot
* Median: The horizontal line in the middle of the boxplot
```{r, fig.width=6}
set.seed(123)
ggplot(hr2, aes(y = years_worked, x = "")) +
geom_jitter(width = 0.1, alpha = 0.5,) +
geom_boxplot(outlier.shape = NA, width = 0.1, size = 2, color = "blue", alpha = 0.5) +
theme_classic() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank()) +
labs(y = "Years Worked")
```
### 5.4 Trim
This section trims the unnecessary leading and trailing white spaces of character variables in the dataset.
```{r}
hr2 <- hr2 %>%
mutate_if(is.character, trimws)
```
### 5.5 Factor conversion
Some "numeric" and "textual" features will need to be converted into "factor" because of their categorical nature. After examination, all textual features that classified as character "chr" in this dataset need to be converted into factor. Following shows all of these textual variables in the datasets.
```{r}
str(hr2 %>%
select(is.character))
```
Following codes complete the conversion task.
```{r}
hr2 <- hr2 %>%
mutate_if(is.character, as.factor)
```
* With regards to numerical features, features that need to be converted into factor type are "MarriedID", "FromDiversityJobFairID", "Termd", and "EmpSatisfaction".
```{r}
str(hr2 %>%
select(-is.factor))
```
Following codes complete the conversion task for these selected numerical features.
```{r}
hr2 <- hr2 %>%
mutate(Termd = as.factor(Termd),
EmpSatisfaction = as.factor(EmpSatisfaction))
```
After conversion, we are able to use following code to summaries the dataset.
```{r}
summary(hr2 %>%
dplyr::select(is.factor))
```
### 5.6 CitizenDesc
There are three categories in the variable "CitizenDesc", which are: (1) 12 employees of Eligible NonCitizen, (2) 4 employees of Non-Citizen, and (3) 295 US Citizen employees. This section will merge "Eligible NonCitizen" to "Non-Citizen", because though the 12 employees are eligible non-Citizen but they are not US citizen yet.
```{r}
hr2 <- hr2 %>%
mutate(CitizenDesc = fct_recode(CitizenDesc,
"Non-Citizen" = "Eligible NonCitizen"))
```
Let's check, and the conversion has been completed.
```{r}
table(hr2$CitizenDesc)
```
### 5.7 HispanicLatino
The variable "HispanicLatino" has following 4 categories.
```{r}
table(hr2$HispanicLatino)
```
The "no" and "Yes" should be a mistake and have to be converted to "No" and "Yes". Following code completes the conversion.
```{r}
hr2 <- hr2 %>%
mutate(HispanicLatino = fct_recode(HispanicLatino,
"No" = "no",
"Yes" = "yes"))
```
Let's check, and the conversion has been completed.
```{r}
table(hr2$HispanicLatino)
```
### 5.8 Missing data check
This section check missing data ("NA") in the dataset.
```{r}
skim_without_charts(hr2)
```
From the above function, there is no any missing data detected by the column "n_missing" or by the column "complete_rate".
Alternatively, I can count the number of missing value ("NA") in each column by following code.
```{r}
colSums(is.na(hr2)) %>%
kbl(col.names = "Missing Value Count") %>%
kable_styling(bootstrap_options = c("striped", "bordered"), full_width = F)
```
There is now no any missing data detected.
## 6 VISUALISATION
This section will help to deliver a preliminary understanding of the data in the dataset.
### 6.1 Numerical variables
```{r, message=FALSE, warning=FALSE, fig.width=12, fig.height=6}
# df
df6.1 <- hr2 %>%
dplyr::select(-is.factor) %>%
pivot_longer(1:7, names_to = "my_var", values_to = "my_values")
# graph
ggplot(df6.1, aes(x = my_values, fill = my_var)) +
geom_histogram(color = "black") +
facet_wrap(~my_var, scales = "free", nrow = 2) +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
labs(title = "Visualisation of Numerical variables",
subtitle = "by Histogram",
x = "Variables",
y = "Count")
```
Insights from the summary:
* There is no an obvious trend in the variable "absence" but it can reach out to a maximum of 20 days from a minimum of 1.
```{r}
summary(hr2$Absences)
```
* Most employees are age between 30 to 55.
* Most employees do not late to work.
* Employees were mostly engaging.
* Salary has a positive skewed but majority of employees have salary at about $60-80k.
* Most employees do not have special project.
* For working years at the company, there is a normal distribution at around 8 years.
### 6.2 Categorical variables
```{r, fig.height=16, fig.width=14, message=FALSE}
# df
df6.2 <- hr2 %>%
dplyr::select(is.factor) %>%
pivot_longer(1:15, names_to = "my_var", values_to = "my_values") %>%
group_by(my_var, my_values) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(label = reorder_within(x = my_values, by = count, within = my_var))
# graph
ggplot(df6.2, aes(y = label, x = count, fill = my_values)) +
geom_bar(stat = "identity") +
geom_text(aes(label = count), hjust = 1) +
facet_wrap(~my_var, scales = "free") +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
scale_y_reordered() +
labs(title = "Visualisation of factor variables 1",
subtitle = "by Bar chart",
y = "Variables",
x = "Count")
```
General insights:
The dataset is dominated by employees from the Massachusetts (MA), US, from the production department. Most employees in the dataset have titles in "Production Technician I" and "Production Technician II". There are many other departments and positions as well but are not too many numbers compared to technician positions in the production department. Employees in the dataset has a mentor system that each of them have a direct manager to learn and work for.
Most employees are white and black or African Americans, followed by Asian and other races. Female employees are slightly more than male employees. Most employees are satisfied at their job at a level of score 3 and above, and most of them fully meet the performance score.
### 6.3 Distribution study of continuous variable
A number of continuous variables do not have a normal distribution and contain outliers.
```{r, fig.width=10}
# set df
df6.3 <- hr2 %>% select_if(is.numeric)
df6.3 <- df6.3 %>%
pivot_longer(c(1:7), names_to = "myvar", values_to = "myval") %>%
mutate(myvar = as.factor(myvar))
# plot
ggplot(df6.3, aes(y = myvar, x = myval, color = myvar)) +
geom_boxplot() +
facet_wrap(~myvar, scales = "free") +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
legend.position = "none")
```
### 6.4 Relationship among Variables
This section explores the relationships between numerical variables.
```{r}
df6.4 <- hr2 %>% select_if(is.numeric)
mycor <- cor(df6.4)
corrplot(mycor, method = "number")
```
Only 2 relationships are detected:
* There is a positive relationship between the salary of an employee and the count of special project that he or she has.
* There is a negative relationship between engagement survey and days an employee late in the last 30 days. A highly engaging employee may be motivated to work and won't be late.
After completing several basic visualisation above, we will now dive into more advanced data mining techniques.
## 7 CLUSTERING
Clustering is a series of different machine learning techniques to find distinct groups of data within a dataset. Clustering will try to group similar observations together, and each group should be distinct from each other if the dataset is clusterable intrinsically.
### 7.1 Distance metrics
The dataset of this project is a mixed dataset, which means the dataset contain numerical and categorical variables. A special type of distance metrics will be used to measure distance between observation, and which is called "Gower distance". For continuous variables in the dataset, the Gower function will use manhattan distance, whereas for categorical variables, Gower function will use dice distance. Gower will then combine all these distances together and form a single distance value, and which is known as Gower distance.
To compute gower distance between observations:
```{r}
gower.dis <- daisy(hr2,
metric = "gower",
type = list(logratio = c("Age", "DaysLateLast30", "Salary", "SpecialProjectsCount"))) # log transformation to which column?
summary(gower.dis)
```
The Gower distance metric has now been computed for 311 rows of observations of this dataset.
Following shows the *most similar* pair of observations in the dataset.
```{r}
gower_mat <- as.matrix(gower.dis)
hr2[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), # min for most similar
arr.ind = T)[1, ], ]
```
Following shows the *most distinct* pair of observations in the dataset.
```{r}
hr2[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), # max for most different
arr.ind = T)[1, ], ]
```
### 7.2 VAT
VAT stands for "Visual Assessment of Clustering Tendency", which is a way to see whether the dataset is clusterable. Red indicates high similarity and blue indicates low similarity. The dataset may be able to be classified into 2 clusters.
```{r}
fviz_dist(gower.dis)
```
### 7.3 Gower with PAM
K-means clustering method cannot be applied on the daisy function, and therefore the clustering algorithms typically used for Gower distance is Partitioning around Medoid (PAM) and Hierarchical clustering.
**Determining Best K**
This section will perform PAM, however, the optimal K will need to be determined first. "K" is the optimal number of clusters clusterable for the dataset. I will use silhouette width to find out what the best K is. Silhouette is one of the internal cluster validation metrics to see the quality of clustering. It has value ranges from -1 to 1, the higher the silhouette metric, the better the clustering. In fact, the overall silhouette metric of a k (number of clustering) is computed using all silhouette metrics of each individual observation within a relevant cluster.
Applying for-loop for all silhouette width of each K:
```{r}
sil_df <- c(NA)
for (i in 2:10) {
res.pam <- pam(gower.dis, diss = T, k = i)
sil_df[i] <- res.pam$silinfo$avg.width
}
sil_df
```
Plot the result:
```{R}
plot(sil_df,
xlab = "Number of Cluster (K)",
ylab = "Silhouette Width",
bty = "n")
lines(sil_df)
```
**PAM for K = 2**
Silhouette plot suggests that the dataset is best to be clustered into 2 groups. Therefore, I will create a PAM to cluster the dataset into 2 clusters (2 K).
```{r}
res.pam <- pam(gower.dis, diss = T, k = 2)
```
All observations in the dataset have been clustered into 2 clusters and each cluster has following size.
```{r}
table(res.pam$clustering)
```
**Add Cluster groups to Data**
In this step, the clustering results will be added into the dataset for further analysis.
```{r}
hr2_pam <- cbind(hr2, cluster = res.pam$clustering) %>%
mutate(cluster = as.factor(cluster)) %>%
relocate(cluster, .before = Salary)
```
**Analyse the result of PAM clustering**
Following is the summary of variables from cluster 1, and this cluster is mainly describing current employees.
```{r}
cluster1_stat <- hr2_pam %>%
filter(cluster == "1")
summary(cluster1_stat[, -1])
```
Following is the summary of variables from cluster 2, and this cluster is mainly describing employees who has left the company.
```{r}
cluster2_stat <- hr2_pam %>%
filter(cluster == "2")
summary(cluster2_stat[, -1])
```
Following is the two medoids used to form the two clusters, and which might be helpful for understanding and interpreting result, or maybe not.
```{r}
hr2[res.pam$medoids, ]
```
In summary, I have obtained knowledge that the dataset is best to be clustered into 2 clusters, one group is characterised by current employees and the other group is characterised by terminated employees. Two clusters are the best number of clusters (k) to be used to group all the observations in the data though more clusters can be selected.
Further exploration can be done to find out the in depth differences between the two clusters. However, I am not going to do that in this project because there are 5 big questions need to be answered in later section and numerous visualisation in these sections will analyse the data thoroughly.
## 8 BUSINESS TASKS
There are 5 business tasks given by the kaggle website for this dataset to answer.
### 8.1 Is there any relationship between who a person works for and their performance score?
**1. Basic trends**
Selecting the variables "Manager name" and "performance score" from the dataset.
```{r}
# df for related variables
df.task1 <- hr2 %>%
select(ManagerName, PerformanceScore) %>%
remove_rownames()
```
Following are the managers in this dataset:
```{r}
levels(df.task1$ManagerName)
```
Following are "performance score" categories.
```{r}
levels(df.task1$PerformanceScore)
```
"PIP" means performance improvement plan, it is a tool to train employee with performance deficiencies.
Following plot the balloon plot for the dataset:
```{r, fig.height=8, fig.width=12, message=FALSE, warning=FALSE}
# dataframe
df.task1 <- df.task1 %>%
group_by(ManagerName, PerformanceScore) %>%
summarise(count = n()) %>%
pivot_wider(names_from = PerformanceScore, values_from = count) %>%
replace_na(list(Exceeds = 0,
'Fully Meets' = 0,
'Needs Improvement' = 0,
'PIP' = 0)) %>%
column_to_rownames(var = "ManagerName")
# Contingency table format for Ballon plot:
con.table <- as.table(as.matrix(df.task1))
# Ballon plot
balloonplot(t(con.table),
dotsize = 6,
dotcolor = "green",
show.margins = T,
main = "",
ylab = "",
xlab = "")
```
Basic trends:
* Brannon Miller has the most employees who worked for him and exceeded performance requirement (the best rank "Exceeds"), but in the mean time, he has also the most employees who had worked for him (4 persons) fell into the PIP category.
* Most employees are in the group of fully met. David Stanley, Elijiah Gray, Kelley Spirea, Ketsia Liebig, Kissy Sullivan are the top managers. They have the most subordinates who worked for them fall into this category, the numbers are 19, 18, 18, 18, and 18. Wbster Bultler is the 6th best manager, he has 17 employees who worked for him fell into the "Fully Meets" category.
**2. Data mining with Correspondence Analysis**
The **Correspondence Analysis (CA)** under the branch of Principal component methods from unsupervised machine learning domain will be applied to analyse the dataset to achieve the goal.
```{r}
# Algorithm
res.ca <- CA(df.task1, graph = F)
```
Result shows that Chi-square test that has been calculated has a p-value of 0.249, and which means a fail to reject the null hypothesis and conclude that the association between two variables "Manager" and "Performance" score is statistically insignificant.
```{r}
res.ca
```
Although there is no statistical evidence of "overall" correlation, however, there might be some correlations between individual categories for each variables. For example, a manager has high amount of subordinates who perform well.
Scree plot, symmetric and asymmetric biplot are plotted to understand the trend inside the data.
```{r, fig.width=12, fig.height=15, warning=FALSE, warning=FALSE}
g1 <- fviz_screeplot(res.ca, addlabels = T, ylim = c(0, 60))
g2 <- fviz_ca_biplot(res.ca,
repel = T) +
labs(title = "Symmetric biplot")
g3 <- fviz_ca_biplot(res.ca,
map = "rowprincipal",
arrows = c(T,T),
repel = T) +
labs(title = "Asymmetric biplot") +
theme_get()
top <- plot_grid(g1, g2)
plot_grid(top, g3, nrow = 2)
```
**Insights**
* Scree plot is showing that the first two dimensions going to used to construct the biplot has very high capability in explaining the variation in the dataset.
* In symmetric biplot, only distance between rows, or distance between column points can really be interpreted.
* David Stanley, Kissy Sullivan, Simon Roup, and Ketsia Liebig are the closest to the point *"Fully Meets"*. They may not necessarily be the managers that have the most subordinates falling into this group when compared to others, But when considering all the categories of performance score, their subordinates are most likely to be inside this group "Fully Meets".
* Whereas, Debra Houlihan and Jennifer Zamora are the closest to *"Needs Improvement"*.
* For asymmetric biplot, it is an alternative plot to help (optional). If an angle between two arrows (rows and columns) is accute (< 90oC), then there is strong association between the corresponding row and column. For examples:
* Brannon Miller has strong association with *Exceeds* and *PIP*
* Lynn Daneault has strong association with *Exceeds* and *PIP*
### 8.2 What is the overall diversity profile of the organization?
Having a team of employees from a diverse background can help a company to gain a variety of different perspectives, creativity, innovations and better employee engagement among employees. It will also help building a better company reputation and an increment in profit (Lovelytics 2020).
According to Lovelytics 2020, defining diversity of a company is to quantify the amount of "male" employees who is "white" in the company as "Non-diverse" and the remaining employees are classified as "diverse".
I will use this rule for our dataset that as long as an employee has following criteria will be classified as non-diverse, or vice versa:
* Has a "Male" sex,
* "No" Hispanic or Latino,
* is "White"
Set up the dataframe and plot the graph:
```{r, message=FALSE, warning=FALSE, fig.height=10, fig.width=10}
# Set up dataframe of this section
df8.2 <- hr2 %>%
dplyr::select(Sex, HispanicLatino, RaceDesc) %>%
remove_rownames() %>%
mutate(diversity = ifelse(Sex == "M" & HispanicLatino == "No" & RaceDesc == "White",
"Non-Diverse",
"Diverse"))
# 1. Diversity
g1 <- df8.2 %>%
dplyr::select(diversity) %>% group_by(diversity) %>% summarise(count = n()) %>%
ungroup() %>%
mutate(total = sum(count),
per = round(count/total * 100)) %>%
ggplot(aes(x = "", y = per, fill = diversity)) +
geom_bar(stat = "identity", color = "black") +
coord_polar(theta = "y", start = 0, direction = -1) +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")) +
geom_text(aes(label = paste0(diversity, "\n",
per, "%", "\n",
"(", count, ")")),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = 4) + labs(title = "Overall Diversity")
# 2. Sex
g2 <- df8.2 %>%
dplyr::select(Sex) %>% group_by(Sex) %>% summarise(count = n()) %>%
ungroup() %>%
mutate(total = sum(count),
per = round(count/total * 100)) %>%
ggplot(aes(x = "", y = per, fill = Sex)) +
geom_bar(stat = "identity", color = "black") +
coord_polar(theta = "y", start = 0, direction = -1) +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")) +
geom_text(aes(label = paste0(Sex, "\n",
per, "%", "\n",
"(", count, ")")),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = 2) + labs(title = "Gender")
# 3. Hispanic / Latino Origin
g3 <- df8.2 %>%
dplyr::select(HispanicLatino) %>% group_by(HispanicLatino) %>% summarise(count = n()) %>%
ungroup() %>%
mutate(total = sum(count),
per = round(count/total * 100)) %>%
ggplot(aes(x = "", y = per, fill = HispanicLatino)) +
geom_bar(stat = "identity", color = "black") +
coord_polar(theta = "y", start = 0, direction = -1) +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")) +
geom_text(aes(label = paste0(HispanicLatino, "\n",
per, "%", "\n",
"(", count, ")")),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = 3) + labs(title = "Hispanic or Latino Origin")
# 4. Hispanic / Latino Origin
g4 <- df8.2 %>%
dplyr::select(RaceDesc) %>% group_by(RaceDesc) %>% summarise(count = n()) %>%
ungroup() %>%
mutate(total = sum(count),
per = round(count/total * 100)) %>%
ggplot(aes(x = "", y = per, fill = RaceDesc)) +
geom_bar(stat = "identity", color = "black") +
coord_polar(theta = "y", start = 0, direction = -1) +
theme_minimal() +
theme(legend.position = "none",
axis.title = element_blank(),
axis.text = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold")) +
geom_label_repel(aes(label = paste0(RaceDesc, "\n",
per, "%", "\n",
"(", count, ")")),
position = position_stack(vjust = 0.5)) +
scale_fill_brewer(palette = 4) + labs(title = "Race")
top <- plot_grid(g1, g2, g3, nrow = 1)
plot_grid(top, g4,
nrow = 2,
ncol = 1,
rel_heights = c(1, 2))
```
Insights:
* The overall diversity is at a good level of 76%.
* Nearly half of the employees are female (43%) and male (57%).
* Only 9% of employees are either Hispanic or Latino.
* 60% of employees are "white" employees, followed by 26% of Black or African American, 9 % of Asian, and less than 5% of other races.
### 8.3 What are our best recruiting sources if we want to ensure a diverse organization?
Two techniques will be used: (1) Visual Exploration (2) Data mining
#### 8.3.1 Visual Exploration
The top 3 recruiting sources for diversity are:
1. Diversity Job Fair, 100% of recruited employees were diverse (total of 29)
2. Google Search, 81.6% of recruited employees were diverse (40 out of 49)
3. LinkedIn, 80.3% of recruited employees were diverse (40 out of 49)
Following 3 sources had the worst diversity score compared among these available options in the dataset: