Introduction
Agriculture is a key sector of the economy, but it exposes workers to various occupational risks, including prolonged exposure to pesticides, dust, and biological agents. These exposures are suspected to increase the risk of certain cancers, making it essential to study the links between agriculture and health. Agricultural occupations are diverse, ranging from field crops to livestock farming, with different exposure levels depending on practices and products used. Using factorial methods, notably Principal Component Analysis (PCA), we identify correlations between agricultural practices and exposures. Then, K-means clustering groups individuals into homogeneous profiles, facilitating the analysis of health risks specific to each group.
Population description
This study is based on the AGRICAN cohort, which aims to assess the impact of occupational agricultural exposure on cancer risk. The cohort includes around 180,000 members of the Agricultural Social Mutuality (MSA) who agreed to participate, out of 567,000 initially eligible people. It includes farmers, farm operators, employees, and workers from sectors linked to agriculture. All participants had been affiliated with the MSA for at least three years and lived in one of the 11 French departments with a cancer registry.
Our study focuses on the 1975 Cohort, which includes 10,463 farmers who started their careers between 1965 and 1985. Through a questionnaire, they provided detailed information on their professional history, including farm type (livestock and/or crop farming), use of phytosanitary products (fungicides, insecticides, herbicides), equipment used, as well as health and lifestyle information.
To explore these data in more depth, we applied Principal Component Analysis (PCA) to a table of activity ratios. The activity practice ratio measures task intensity by comparing practice duration to total professional activity duration.
Data exploration - Principal Component Analysis
To begin, we performed a Principal Component Analysis (PCA) on the 8 crop types and 5 livestock types most represented in our dataset.
Most represented crops
The 8 selected crops are:
- Grasslands
- Wheat or barley
- Maize
- Vineyards
- Rapeseed
- Sunflower
- Sugar beet
- Forage peas
Most represented livestock
The 5 selected livestock types are:
- Cattle
- Sheep/Goats
- Pigs
- Horses
- Poultry
This analysis helps identify the main trends and similarities between these crops and livestock systems, making data interpretation easier.
Axis selection
To select our axes, we aimed to retain 80% of cumulative inertia. As shown in the table below, we reach this target by keeping the first 8 principal components.
Eigenvalue table
| % inertia | % cumulative inertia | |
|---|---|---|
| comp 1 | 23.94 | 23.94 |
| comp 2 | 12.89 | 36.83 |
| comp 3 | 10.16 | 46.98 |
| comp 4 | 9.00 | 55.98 |
| comp 5 | 7.55 | 63.53 |
| comp 6 | 7.02 | 70.55 |
| comp 7 | 6.66 | 77.21 |
| comp 8 | 5.55 | 82.77 |
| comp 9 | 4.77 | 87.54 |
| comp 10 | 4.00 | 91.53 |
| comp 11 | 3.82 | 95.35 |
| comp 12 | 2.73 | 98.08 |
| comp 13 | 1.92 | 100.00 |
Variable map analysis
After identifying the 8 principal components, we plotted the variable map using the first two principal components (PC1 and PC2).
On this map, we observe that the variables "Wheat or barley", "Maize", "Grasslands", and "Cattle" are positively correlated with the first axis.
The variable "Vineyards" is negatively correlated with this axis.
The variables "Forage peas" and "Rapeseed" are positively correlated with the second axis.
Overall, these variables are well represented on the factorial plane.
By contrast, the other variables are poorly represented on this plane, notably "Pigs", "Poultry", and "Horses". Therefore, they are less relevant for our analysis.
Variable map
Correlation analysis and identification of main axes
To obtain a broader overall view, we used the correlation matrix and focused on the first four axes.
In this table, variables highlighted in red are the most strongly correlated (> 60%) with their respective axis. Variables in orange are correlated at least 40%, and those in yellow at least 20%.
For the first axis, the variable "Wheat or barley" is the best represented, with a correlation coefficient of 0.83, indicating a very strong association.
For the second axis, the most correlated variable is "Forage peas", with a coefficient of 0.65.
For the third factorial axis, the best represented variable is "Horses", with a coefficient of 0.48.
Finally, for the fourth axis, the most correlated variable is "Sugar beet", with a coefficient of -0.52.
Correlation matrix between variables and principal components
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | |
|---|---|---|---|---|
| Grasslands | 0.74 | -0.44 | -0.02 | -0.05 |
| Wheat or barley | 0.83 | 0.08 | -0.1 | -0.03 |
| Mais | 0.65 | -0.09 | -0.29 | 0.02 |
| Vineyards | -0.59 | 0.21 | -0.22 | -0.02 |
| Rapeseed | 0.43 | 0.64 | -0.14 | 0.23 |
| Sunflower | 0.36 | 0.42 | -0.42 | 0.51 |
| Bettraves | 0.35 | 0.39 | 0.46 | -0.52 |
| Forage peas | 0.37 | 0.65 | 0.24 | -0.28 |
| Cattle | 0.69 | -0.51 | -0.07 | -0.15 |
| Sheep/Goats | 0.09 | -0.04 | 0.4 | 0.4 |
| Pigs | 0.15 | -0.01 | 0.43 | 0.21 |
| Horses | 0.03 | -0.1 | 0.48 | 0.23 |
| Poultry | 0.13 | -0.05 | 0.36 | 0.48 |
Automatic classification - K-means
After running PCA, we aimed to build classes of individuals in order to understand their main characteristics. To do this, we used k-means. K-means is an unsupervised classification algorithm that partitions a dataset into k groups by minimizing within-cluster variance.
Determining the optimal number of clusters
To determine the optimal number of clusters, we used the elbow method. This approach consists of plotting within-cluster inertia as a function of the number of clusters (k). The point where inertia reduction starts to slow sharply usually corresponds to the optimal number of clusters.
On our graph, a clear elbow is visible from eight clusters. We therefore retain eight as the optimal number of clusters, since beyond this threshold inertia reduction becomes less significant.
Within-cluster variance
Selecting the most representative clusters
By selecting eight clusters, we observe an uneven distribution of group sizes. For this study, we focus only on the four most represented clusters:
- Cluster 1 : 857 individus
- Cluster 2 : 3 497 individus
- Cluster 5 : 710 individus
- Cluster 6 : 3 405 individus
This selection allows us to analyze the groups with the greatest impact while avoiding those with counts too low to be meaningful.
Cluster size table
| Cluster | Count |
|---|---|
| 1 | 857 |
| 2 | 3497 |
| 3 | 655 |
| 4 | 508 |
| 5 | 710 |
| 6 | 3405 |
| 7 | 361 |
| 8 | 470 |
Table of most frequent crops
| Frequency | |
|---|---|
| Grasslands | 6538 |
| Wheat or barley | 5914 |
| Maize | 5503 |
| Vigne | 3253 |
| Rapeseed | 1602 |
| Sunflower | 1478 |
| Sugar beet | 1277 |
| Forage peas | 1024 |
Analysis of the four most represented clusters
After identifying the four most represented clusters, we analyze these groups based on five key variables: count, average age at career start, smoking prevalence, average number of packs consumed per year, and average activity duration. These indicators reveal notable differences in entry age, smoking behavior, and career stability.
Cluster 1:
Individuals start their careers at an average age of 20
years. The majority do not smoke (fewer than one in two), and
among smokers, consumption reaches around 14 packs per
year. This group also has the longest activity duration.
Cluster 2:
This cluster is characterized by a very early career start
(19 years on average) and the lowest smoking
prevalence, with consumption around 12 packs
per year. Activity duration is high, close to 28 years
on average.
Cluster 5:
Individuals begin their careers at around 19.85 years.
Smoking prevalence is high (51.83%), with consumption
of around 15.6 packs per year.
Cluster 6:
This group has the latest career start, the highest smoking
prevalence, and an average consumption of
16 packs per year, although its activity duration is the
shortest.
These observations highlight distinct profiles: some groups show greater tobacco-related exposure, while others display characteristics linked to earlier or later career starts. These results provide concrete guidance to adapt prevention policies to each cluster's specific profile (see Cluster description).
Cluster description
| Cluster | Cluster1 | Cluster2 | Cluster5 | Cluster6 |
|---|---|---|---|---|
| Count | 857 | 3497 | 710 | 3405 |
| Average age at career start (standard deviation) | 20.07 (6.07) | 19.35 (6.28) | 19.85 (5.45) | 22.18 (7.99) |
| Smoking prevalence | 47.84 | 47.24 | 51.83 | 59.06 |
| Average number of tobacco packs consumed per year | 14.17 | 12.47 | 15.6 | 15.99 |
| Average activity duration (standard deviation) | 27.88 (7.68) | 27.84 (9.04) | 27.47 (8.86) | 25.29 (9.68) |
Analysis of agricultural clusters
Cluster 1: Large-scale cereal and oilseed crops
This cluster stands out with very high V-tests (>10) for several crops.
- Over-represented :
- Sunflower (0.678, V-test=77.391)
- Rapeseed (0.515, V-test=55.533)
- Wheat/barley (0.922, V-test=30.598)
- Maize (0.887, V-test=23.014)
- Poultry (0.076, V-test=15.066)
- Forage peas (0.159, V-test=13.814)
- Sheep/goats (0.037, V-test=10.270)
- Sunflower (0.678, V-test=77.391)
- Under-represented :
- Vineyards (0.147, V-test=-11.395)
This cluster represents farms specialized in large-scale cereal and oilseed crops.
Cluster 2: Cattle farming with mixed cropping
- Over-represented :
- Grasslands (0.891, V-test=62.759)
- Cattle (0.879, V-test=60.507)
- Wheat/barley (0.67, V-test=33.810)
- Maize (0.677, V-test=30.247)
- Grasslands (0.891, V-test=62.759)
- Under-represented :
- Sunflower (0.028, V-test=-18.369)
- Rapeseed (0.023, V-test=-19.992)
- Forage peas (0.01, V-test=-19.337)
- Sugar beet (0.031, V-test=-16.689)
- Vineyards (0.107, V-test=-35.422)
- Sunflower (0.028, V-test=-18.369)
This cluster corresponds to a cattle farming system with grasslands and complementary cereal crops.
Cluster 5: Industrial and large-scale crops
- Over-represented :
- Sugar beet (0.832, V-test=80.055)
- Forage peas (0.506, V-test=57.453)
- Wheat/barley (0.928, V-test=27.936)
- Rapeseed (0.258, V-test=20.090)
- Pig farming (0.064, V-test=15.688)
- Sugar beet (0.832, V-test=80.055)
- Under-represented :
- Vineyards (0.057, V-test=-15.864)
This cluster is mainly oriented toward industrial and large-scale crop systems.
Cluster 6: Specialized viticulture
- Over-represented :
- Vineyards (0.615, V-test=51.546)
- Vineyards (0.615, V-test=51.546)
- Under-represented :
- Sunflower (0.006, V-test=-24.824)
- Rapeseed (0.004, V-test=-25.438)
- Wheat/barley (0.055, V-test=-68.561)
- Maize (0.066, V-test=-48.166)
- Poultry (0.008, V-test=-10.862)
- Forage peas (0.002, V-test=-21.554)
- Grasslands (0.074, V-test=-71.576)
- Cattle (0.132, V-test=-64.021)
- Sugar beet (0.007, V-test=-23.373)
- Sunflower (0.006, V-test=-24.824)
This cluster is clearly specialized in viticulture.
Comparative synthesis
The analysis of the four most representative clusters highlights
distinct agricultural systems based on dominant crops and livestock.
These values, statistically very significant (V-test > 10 in absolute
value), reveal four well-defined agricultural systems:
- Large-scale cereal and oilseed crops (C1)
- Cattle farming with mixed cropping (C2)
- Industrial and large-scale crops (C5)
- Specialized viticulture (C6)
Results table
| Variable | Mean c1 | v-test c1 | Mean c2 | v-test c2 | Mean c5 | v-test c5 | Mean c6 | v-test c6 | Moyenne |
|---|---|---|---|---|---|---|---|---|---|
| Sunflower | 0.678 | 77.391 | 0.028 | -18.369 | 0.019 | -7.665 | 0.006 | -24.824 | 0.754 |
| Rapeseed | 0.515 | 55.533 | 0.023 | -19.992 | 0.258 | 20.090 | 0.004 | -25.438 | 0.476 |
| Wheat.or.barley | 0.922 | 30.598 | 0.67 | 33.810 | 0.928 | 27.936 | 0.055 | -68.561 | 0.631 |
| Mais | 0.887 | 23.014 | 0.677 | 30.247 | 0.582 | 6.436 | 0.066 | -48.166 | 0.457 |
| Poultry | 0.076 | 15.066 | 0.019 | -3.067 | 0.056 | 8.271 | 0.008 | -10.862 | 0.038 |
| Pois.fourragers | 0.159 | 13.814 | 0.01 | -19.337 | 0.506 | 57.453 | 0.002 | -21.554 | 0.052 |
| Sheep/Goats | 0.037 | 10.270 | 0.006 | -4.960 | 0.036 | 8.946 | 0.005 | -6.393 | 0.030 |
| Grasslands | 0.627 | 7.547 | 0.891 | 62.759 | 0.629 | 6.924 | 0.074 | -71.576 | 0.317 |
| Pigs | 0.032 | 5.723 | 0.013 | -2.883 | 0.064 | 15.688 | 0.005 | -9.493 | 0.162 |
| Cattle | 0.55 | 1.958 | 0.879 | 60.507 | 0.58 | 3.640 | 0.132 | -64.021 | 0.164 |
| Horses | 0.006 | 0.549 | 0.004 | -1.353 | 0.014 | 6.230 | 0.004 | -2.501 | 0.182 |
| Bettraves | 0.017 | -8.320 | 0.031 | -16.689 | 0.832 | 80.055 | 0.007 | -23.373 | 0.029 |
| Vineyards | 0.147 | -11.395 | 0.107 | -35.422 | 0.057 | -15.864 | 0.615 | 51.546 | 0.096 |
Conclusion
Throughout their daily work and professional careers, farmers are exposed to many factors that may affect their health. These multiple and complex exposures can be viewed as a combination of physical, chemical, biological, and behavioral agents. Our study relied on AGRICAN cohort data to assess the impact of occupational agricultural exposure on cancer risk.
Analysis of these data identified distinct farmer profiles, providing a better understanding of agricultural practices, socio-demographic characteristics, and associated exposures. Principal Component Analysis (PCA) highlighted significant correlations between specific crops. The variables "Wheat or barley," "Maize," "Grasslands," and "Cattle" are strongly associated with the first axis, while "Vineyards" shows a negative correlation with this axis. The second axis is mainly structured around "Forage peas" and "Rapeseed," revealing a coherent organization of agricultural activities.
K-means classification revealed four main clusters representative of the studied agricultural population. Cluster 1 groups farmers focused on large-scale cereal and oilseed crops, with a strong presence of sunflower and rapeseed, career starts around age 20, and the longest activity duration despite moderate smoking. Cluster 2 includes cattle farmers practicing mixed cropping, characterized by an early career start at age 19, low smoking prevalence, and long activity duration of about 28 years. Cluster 5 consists of farmers specialized in industrial crops such as sugar beet and forage peas, with high smoking prevalence (51.83%) and average consumption of 15.6 packs per year. Finally, Cluster 6 includes viticulture specialists, distinguished by a later career start, the highest smoking prevalence, and the shortest activity duration, suggesting specific occupational exposure profiles.
These results highlight the heterogeneity of occupational exposure profiles in the agricultural sector. The combination of factors such as farm type, smoking habits, and activity duration generates distinct risk profiles, requiring tailored prevention strategies. These findings also help guide future epidemiological research by targeting specific populations and refining prevention approaches.