High Performance Queries Using Compressed Bitmap Indexes
Data that often contain unchanging records is becoming in-creasingly important. Many data sources, such as historical archives, sensor readings, health systems, and machine logs, do not change frequently but are constantly increasing. For this reason, the need to process such datasets more quickly has emerged. Bitmap index that can beneﬁt from multicore and multiprocessor systems is designed to process data that has grown over time but does not change frequently. It has a well-known advantage, particularly in low cardinality data queries. Data such as gender, age, marital status, postal code and even date with low cardinality occupy an important place in datasets. Furthermore, the bitmap index using the compression algorithm can be applied eﬃciently even if the data has a high cardinality. In this study, bitmap index is introduced to improve queries using optimal encoding and it has been shown to perform up to 20x faster queries with an appropriate encoding for data containing frequently unchanging records in a performance comparison against a commonly used relational database system.
Learning Quality Improved Word Embedding with Assessment of Hyperparameters
Beytullah Yildiz and Murat Tezgider
Deep learning practices have a large impact on many areas. Big data and key hardware developments in GPU and TPU are the main reasons behind deep learning success. The recent progress in the text analysis and classification using deep learning has been significant as well. The quality of word representation that has become much better by using methods such as Word2Vec, FastText and Glove has been important in this improvement. In this study, we aimed to improve Word2Vec word representation, which is also called embedding, by tuning its hyperparameters. The minimum word count, vector size, window size, and the number of iterations were used to improve word embeddings. We introduced two approaches, which are faster than grid search and random search, to set the hyperparameters. The word embeddings were created using documents with ap- proximately 300 million words. A deep learning classification model that uses documents consisting of 10 different classes was applied to evaluate the quality of word embeddings. A 9% increase in classification success was achieved only by improving hyperparameters.
Hugo: A Cluster Scheduler that Eﬃciently Learns to Select Complementary Data-Parallel Jobs
Lauritz Thamsen, Ilya Verbitskiy, Sasho Nedelkoski, Vinh Thuy Tran, Vinícius Meyer, Miguel G. Xavier, Odej Kao, and César A. F. De Rose
Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically ﬂuctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary signiﬁcantly depending on the speciﬁc combinations of co-located jobs.
This paper presents Hugo, a cluster scheduler that continuously learns how eﬃciently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines oﬄine grouping of jobs with online reinforcement learning to provide a scheduling mechanism that eﬃciently generalizes from speciﬁc monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded.
FLY: A Domain-Speciﬁc Language for Scientiﬁc Computing on FaaS
G. Cordasco, M. D’Auria, A. Negro, V. Scarano, and C. Spagnuolo
Cloud Computing is widely recognized as distributed computing paradigm for the next generation of dynamically scalable applications. Recently a novel service model, called Function-as-a-Service (FaaS), has been proposed, that enables users to exploit the computational power of cloud infrastructures, without the need to conﬁgure and manage complex computations systems. FaaS paradigm represents an opportunity to easily develop and execute extreme-scale applications as it allows ﬁne-grain decomposition of the application with a much more efﬁcient scheduling on cloud provider infrastructure.
We introduce FLY, a domain-speciﬁc language for designing, deploying and executing scientiﬁc computing applications by exploiting the FaaS service model on different cloud infrastructures. In this paper, we present the design and the language deﬁnition of FLY on several computing (local and FaaS) back-ends: Symmetric multiprocessing (SMP), Amazon AWS Lambda, Microsoft Azure Functions, Google Cloud Functions, and IBM Bluemix/Apache OpenWhisk. We also present the ﬁrst FLY source-to-source compiler, publicly available on GitHub, which supports SMP and AWS back-ends.