Меню
No. 3 (20) - 2023 / 2023-09-30 / Number of views: 145
Authors
Keywords
Link to DOI:
How to quote
The article considers the development of an intelligent system for parallel analysis of unstructured data based on the Apache Spark distributed framework. As well as the formation of a mathematical apparatus for the implementation of fast algorithms for the analysis of scientific texts in natural language using methods of probability theory and statistics, information theory and machine learning. Apache Spark is a distributed framework for big data processing and analytics. It provides a fast and versatile engine for large-scale data processing, allowing users to perform parallel computing in distributed clusters. Unstructured data refers to data that does not have a predefined structure, such as text, images, videos, and social media posts. Analyzing unstructured data is a difficult task because it requires extracting meaningful information from data that is not easy to organize into rows and columns. The development of an intelligent system for parallel analysis of unstructured data using Apache Spark involves several stages. First, unstructured data must be loaded into Spark framework. This can be done using various data sources, such as the Hadoop Distributed File System (HDFS), Amazon S3, or any other storage system supported by Spark. An intelligent system can be developed using programming languages such as Scala, Java or Python, which have Spark bindings and provide APIs for interacting with the Spark framework. These APIs allow developers to define data transfer pipelines, configure parallel processing parameters, and perform analysis tasks.