Application of K-Means and Fuzzy K-Means to Rice Dataset in Sierra Leone

As k-means and fuzzy k-means are regarded as unsupervised dimensional reduction learning techniques, we present an application of this technique from the Agronomic data collected in 2015 to demonstrate the efficiency of fuzzy k means over k means of eight different types of rice varieties in Sierra Leone. Also, we identified different rice varieties as outliers from the silhouette clusters (segment).


Introduction
Clustering of k-means forms part of the main topics in machine learning. Machine learning is widely used in physical or natural sciences, as it helps to get an intuition about the structure and pattern of the data. Clustering identifies similar or different subgroups in a given dataset (Hartigan and Wong, 1976). The homogeneity identifies similar clusters according to their data points.
K-means is a clustering of n-observation into partitions. The k-means method uses a prototype (centroids) to represents clusters by optimizing the squared error function (Bradley, et al, 1998). It is considered an iterative algorithm because the data, is partitioned into clusters (subgroups), thereby making the data points as similar (homogeneous) as possible (Bradley & Fayyad, 1998).
Fuzzy k-means, on the other hand, is regarded as a soft (flexible) method than k-means because each point can belong to two centroids with different quality (Bradley & Fayyad, 1998). Fuzzy K-means is more statistically formalized and discovers soft clusters, where a particular point can belong to more than one cluster with a certain probability In this paper, we present an application of clustering analysis to Agronomy, with eight varieties. The paper is divided into five parts. In the next section, a summary of the dataset is presented in the methodology, followed by results and discussion. We give our conclusion in section four.

Methodology
This data was collected in the year 2015. From this dataset, we only considered eight rice varieties which are; Nerica 1, Nerica 3, Nerica 6, Rok 3, Rok 16, Rok 17 and Pa Gbonko. Also, there are nine independent variables, which are; panicle, tillers, plant height, number of filled grains, 50% days to flowering, days to maturity, and grain yield. We consider one independent variable, days to maturity. We chose one independent variable because, the number of observation points is 384, which is sufficient for analysis. Statistical Analysis Systems (SAS), ARiS and XLSTAT were used to generate tables and graphics.

Results and Discussion
Here we present a cluster analysis of eight different types of rice varieties in Sierra Leone. We aim at segmenting these varieties into subgroups to demonstrate similarities among them. From table 1, it shows that all the clusters are closer to '1'and also a mean width of 0.706 implies a good choice as it is going towards '1'.  Table 2 shows a summary of cluster for days of maturity. It is seen that, cluster 5 has the lowest size, while cluster 4 with the highest. It could also be seen that the minimum and maximum distances to the centroid is zero.   Table 3 shows the respective centers for both days to maturity clusters. It could be seen that days to maturity have better means are they are very much closer to '1'. Also, table 4 shows a summary of statistics for days to maturity.  Wilks' lambda statistic was used to test the differences between means of identified groups (clusters) of subjects on a combination of dependent variables (Nicola; 2000). In table 5, 24 iterations were used for days to maturity and the Wilks' lambda is given as 0.018, which is significant at 0.05. A silhouette is a method of validating the consistency of different clusters. It also shows how well one cluster matched as compared to other clusters. In this instance, we consider eight different varieties taking from a huge dataset from two locations in Sierra Leone (Rokupr and Bo). It is also used to measure the degree of separation between clusters. From figure 2, we have five clusters. Cluster 1, 2, 3 and 4 seems to have high values but with different width. Cluster five (5) has few clusters and one outlier. Also, Nerica 3 is an outlier in clusters 1 and 3. ROK16 and Nerica1 are also outliers in cluster 2 and cluster 5 respectively. This implies that the clusters are appropriate because most of the varieties have high values within their clusters.

Conclusion
We presented both k-means and fuzzy k-means clustering for the variable, days to maturity. We had demonstrated that fuzzy k-means is more accurate than k-means because of their flexibility. From the silhouette, we deduce that, Nerica 3, is an outlier in clusters 1 and 3, while ROK 16 and Nerica 1 are also outliers in cluster 2 and cluster 5 respectively. With 24 iterations for days to maturity, the Wilks' lambda showed a highly significant difference at < 0.05.