Information and communication and chemical technologies

No. 3 (20) - 2023 / 2023-09-30 / Number of views: 145

ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS

Authors

Kazakh University of Technology and Business
Esil University
L.N. Gumilyov Eurasian National University
L.N. Gumilyov Eurasian National University

Keywords

параллельный анализ, научный текст, большие данные, неструктурированные данные, обработка данных, Apache Spark, распределенные вычисления, математический аппарат

Link to DOI:

https://doi.org/10.58805/kazutb.v.3.20-151

How to quote

Altynbek С., Shuitenov Г. ., C. Santeyeva, and Turginbayeva А. . “ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS”. Vestnik KazUTB, vol. 3, no. 20, Sept. 2023, doi:10.58805/kazutb.v.3.20-151.

Abstract

The article considers the development of an intelligent system for parallel analysis of unstructured data based on the Apache Spark distributed framework. As well as the formation of a mathematical apparatus for the implementation of fast algorithms for the analysis of scientific texts in natural language using methods of probability theory and statistics, information theory and machine learning. Apache Spark is a distributed framework for big data processing and analytics. It provides a fast and versatile engine for large-scale data processing, allowing users to perform parallel computing in distributed clusters. Unstructured data refers to data that does not have a predefined structure, such as text, images, videos, and social media posts. Analyzing unstructured data is a difficult task because it requires extracting meaningful information from data that is not easy to organize into rows and columns. The development of an intelligent system for parallel analysis of unstructured data using Apache Spark involves several stages. First, unstructured data must be loaded into Spark framework. This can be done using various data sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, or any other storage system supported by Spark. An intelligent system can be developed using programming languages such as Scala, Java or Python, which have Spark bindings and provide APIs for interacting with the Spark framework. These APIs allow developers to define data transfer pipelines, configure parallel processing parameters, and perform analysis tasks.