АНАЛИТИКА НАУЧНЫХ ТЕКСТОВ НА ОСНОВЕ РАСПРЕДЕЛЕННЫХ ФРЕЙМВОРКОВ ПАРАЛЛЕЛЬНОЙ ОБРАБОТКИ ДАННЫХ

S.A. Altynbek; G.Zh.  Shuitenov; S.A. Santeyeva; A.S.  Turginbayeva

doi:10.58805/kazutb.v.3.20-151

Information and communication and chemical technologies

No. 3 (20) - 2023 / 2023-09-30 / Number of views: 197

ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS

Authors

S.A. Altynbek⁺⁻
G.Zh. Shuitenov⁺⁻
S.A. Santeyeva⁺⁻
A.S. Turginbayeva ⁺⁻

Kazakh University of Technology and Business

Esil University

L.N. Gumilyov Eurasian National University

Keywords

параллельный анализ, научный текст, большие данные, неструктурированные данные, обработка данных, Apache Spark, распределенные вычисления, математический аппарат

Back

PDF (RU)

Link to DOI:

https://doi.org/10.58805/kazutb.v.3.20-151

How to quote

Altynbek С., Shuitenov Г. ., C. Santeyeva, and Turginbayeva А. . “ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS”. Vestnik KazUTB, vol. 3, no. 20, Sept. 2023, doi:10.58805/kazutb.v.3.20-151.

ACM ACS APA ABNT Chicago Harvard IEEE MLA Turabian Vancouver

Abstract

The article considers the development of an intelligent system for parallel analysis of unstructured data based on the Apache Spark distributed framework. As well as the formation of a mathematical apparatus for the implementation of fast algorithms for the analysis of scientific texts in natural language using methods of probability theory and statistics, information theory and machine learning. Apache Spark is a distributed framework for big data processing and analytics. It provides a fast and versatile engine for large-scale data processing, allowing users to perform parallel computing in distributed clusters. Unstructured data refers to data that does not have a predefined structure, such as text, images, videos, and social media posts. Analyzing unstructured data is a difficult task because it requires extracting meaningful information from data that is not easy to organize into rows and columns. The development of an intelligent system for parallel analysis of unstructured data using Apache Spark involves several stages. First, unstructured data must be loaded into Spark framework. This can be done using various data sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, or any other storage system supported by Spark. An intelligent system can be developed using programming languages such as Scala, Java or Python, which have Spark bindings and provide APIs for interacting with the Spark framework. These APIs allow developers to define data transfer pipelines, configure parallel processing parameters, and perform analysis tasks.

Information and communication and chemical technologies

ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS

параллельный анализ, научный текст, большие данные, неструктурированные данные, обработка данных, Apache Spark, распределенные вычисления, математический аппарат

Altynbek С., Shuitenov Г. ., C. Santeyeva, and Turginbayeva А. . “ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS”. Vestnik KazUTB, vol. 3, no. 20, Sept. 2023, doi:10.58805/kazutb.v.3.20-151. ACM ACS APA ABNT Chicago Harvard IEEE MLA Turabian Vancouver

Abstract

Altynbek С., Shuitenov Г. ., C. Santeyeva, and Turginbayeva А. . “ANALYTICS OF SCIENTIFIC TEXTS BASED ON DISTRIBUTED PARALLEL DATA PROCESSING FRAMEWORKS”. Vestnik KazUTB, vol. 3, no. 20, Sept. 2023, doi:10.58805/kazutb.v.3.20-151.

ACM ACS APA ABNT Chicago Harvard IEEE MLA Turabian Vancouver