Development of Big Data Clustering with Apache Spark

Bakhshali Bakhtiyarov; Aynur Jabiyeva; Gunay Hasanova

doi:10.55549/epstem.1173

Authors

Bakhshali Bakhtiyarov Azerbaijan State Oil and Industry University Author
Aynur Jabiyeva Azerbaijan State Oil and Industry University Author
Gunay Hasanova Azerbaijan State Oil and Industry University Author

DOI:

https://doi.org/10.55549/epstem.1173

Keywords:

Scalable clustering, Apache spark, Resilient distributed dataset, MLLib, MapReduce, Big data

Abstract

The research focuses on clustering technique development and big data in-formation retrieval has increased significantly during recent years. This paper introduces a new approach for distributed clustering which performs adap-tive density estimation. Packaging tests check the performance of method implemented using Apache Spark across various well-known datasets. The initial stage of this algorithm performs data partitioning by utilizing Bayesian LSH as one of its LSH proxies. The partition method decreases superfluous calculations while functioning as a parallel system with straightforward processes. The proposed algorithm demonstrates autonomous operation between its steps because the processing sequence does not introduce bottlenecks in the workflow. The stability of this proposed method increases together with outlier removal because the local structures maintain their integrity. The ordered weighted average (OWA) distance defines density through which clus-ters become more like their internal elements. A computer program evaluates local density peaks through node density computations. The selected peaks produce the cluster center value which determines how remaining points get assigned to proximal group. An assessment of the proposed method and previous research findings in present literature took place. The current method demonstrates greater accuracy together with reduced noise sensitivity through its calculated validity index findings. This method provides many advantages in terms of scalability together with high efficiency at reduced computational complexity. The strategy works for general clustering purposes and researchers have used it successfully in clustering and other problems.

Development of Big Data Clustering with Apache Spark

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Article Template

Information

ABSTRACTING / INDEXING