Development of Big Data Clustering with Apache Spark

Authors

  • Bakhshali Bakhtiyarov Azerbaijan State Oil and Industry University image/svg+xml Author
  • Aynur Jabiyeva Azerbaijan State Oil and Industry University image/svg+xml Author
  • Gunay Hasanova Azerbaijan State Oil and Industry University image/svg+xml Author

DOI:

https://doi.org/10.55549/epstem.1173

Keywords:

Scalable clustering, Apache spark, Resilient distributed dataset, MLLib, MapReduce, Big data

Abstract

The research focuses on clustering technique development and big data in-formation retrieval has increased significantly during recent years. This paper introduces a new approach for distributed clustering which performs adap-tive density estimation. Packaging tests check the performance of method implemented using Apache Spark across various well-known datasets. The initial stage of this algorithm performs data partitioning by utilizing Bayesian LSH as one of its LSH proxies. The partition method decreases superfluous calculations while functioning as a parallel system with straightforward processes. The proposed algorithm demonstrates autonomous operation between its steps because the processing sequence does not introduce bottlenecks in the workflow. The stability of this proposed method increases together with outlier removal because the local structures maintain their integrity. The ordered weighted average (OWA) distance defines density through which clus-ters become more like their internal elements. A computer program evaluates local density peaks through node density computations. The selected peaks produce the cluster center value which determines how remaining points get assigned to proximal group. An assessment of the proposed method and previous research findings in present literature took place. The current method demonstrates greater accuracy together with reduced noise sensitivity through its calculated validity index findings. This method provides many advantages in terms of scalability together with high efficiency at reduced computational complexity. The strategy works for general clustering purposes and researchers have used it successfully in clustering and other problems.

Downloads

Published

2025-10-30

Issue

Section

Articles

How to Cite

Development of Big Data Clustering with Apache Spark. (2025). The Eurasia Proceedings of Science, Technology, Engineering and Mathematics, 36, 220-228. https://doi.org/10.55549/epstem.1173