nav line

BigData, NoSQL, Resource Utilisation and a Few Words in Conclusion

In previous parts, I promised a closer look at the concept of BigData and individual kinds of non-relational databases. I will add a brief description of resource utilisation to this. In the conclusion, I will quickly summarise what I have mentioned in this three-part series about databases and database models. Unfortunately, due to the extent, I will not get to touch on all the aforementioned topics in this part. Therefore, I will focus on the products of the Apache Software Foundation, for easier data handling, and the promised surprise in the next post.

BigData, NoSQL, Resource Utilisation
and a Few Words in Conclusion

In previous parts, I promised a closer look at the concept of BigData and individual kinds of non-relational databases. I will add a brief description of resource utilisation to this. In the conclusion, I will quickly summarise what I have mentioned in this three-part series about databases and database models. Unfortunately, due to the extent, I will not get to touch on all the aforementioned topics in this part. Therefore, I will focus on the products of the Apache Software Foundation, for easier data handling, and the promised surprise in the next post.

BigData

Although I am very interested in BigData, I did not find an exact definition of this term. Therefore, I prefer the definition of Gartner who define the term BigData using 3Vs - High Volume, High Velocity, High Variety (source 5). As you can see, it is not just about the amount of data, but also the speed of reading/writing and processing and about the variety of structures of stored data. An example for the term BigData may be the CERN Large Hadron Collider which should generate approximately 15 PB (1 PB = 250 B) of data per year (source 4). Although not all data are stored in databases, this is an enormous amount.

Models of non-relational databases

Developments in this sector have been very progressive. For ease of reference, we can divide non-relational databases depending on the database model used. Individual basic models are the document-oriented model, the key-value type model, BigTable model, graph model and the time-series model.

Document-oriented model

The basic element of this model is a document that contains semi-structured data. In addition to the data themselves, the description of their structure (metadata) is also stored in the document. Databases of this type are usually used for storing data in XML, YAML, JSON or BSON formats. The information in the database can be indexed and most databases do this automatically. This makes it possible to access either the key or the actual content of the document. In organising the data, it is possible to group the documents into logical sets (collections) and assign rules to them (such as access rights, etc.). The document structure is independent and various documents with different records may be in one collection. Databases also allow the user to work with only parts of the document. Document-oriented databases are a highly flexible NoSQL solution suitable for storing unstructured data, but, if necessary, with the possibility to define the structure. It is appropriate to use them where emphasis is placed on a universal, scalable solution with the possibility of searching in large texts.

Key-value model

This model can be perceived as a dictionary or as the equivalent of a two-column table in RDBMS. The difference is in the approach to data. While the relational model allows access to both the key and the value, in the non-relational model, it was only possible to inquire the key (however, nowadays, even this rule is broken thanks to the secondary indices, e.g. Riak DB). The biggest advantage of this model is its simplicity and ease of horizontal scalability. There are many databases with the key-value model. Some of them offer full support for ACID transactions and ensure consistency (e.g. HyperDex DB). Therefore, they may replace the relational database in some cases and add the advantage of easy scalability and higher performance to the existing solution. Other, due to their properties, can be used, for example, as a cache (e.g. MemCached).

BigTable Model

Sometimes, it is also called a column-family model. Since these are mainly clones of the BigTable database (developed by Google), I personally prefer the designation of the BigTable model. Databases using this model store data as multidimensional maps. These maps can be imagined as a table where the key consists of an identifier of the "row" and "column". A timestamp is also added to this key and it is used to detect conflicts in the entry and for the expiration and a group, so-called column-family identifier. BigTable databases are perfectly scalable and can handle very large amounts of data. Another advantage is the easy adding of new records. Disadvantages are then lower performance when using a small number of nodes (minimum 3, source 1) and the impossibility of indexing. Thanks to this proposed model, searching without knowing the key is also very inefficient. Today, this model finds application, for example, as geolocation databases or databases to track user behaviour on the web.

Graph model

Graph databases differ from the aforementioned databases. This model represents the structure of the data as a graph. Records are nodes and links are edges, which are oriented and supplemented by attributes and evaluation of the relationship (based on the graph theory). The relationship (relation) between the nodes of the graph always has a certain direction that determines its meaning and attributes providing useful information. Direction can be imagined as relationships between people. Petr likes Šárka, but it does not need to be true that Šárka likes Petr. In some required situations, we can ignore the direction of the relation. Since the graph model focuses on the links between data, these databases can be extremely difficult to scale. When compared with relational databases, scaling with graph databases is more and more efficient given the handling and processing of data. NoSQL is therefore not only about scalability and data structure. It is suitable to use this kind of databases wherever we want to search for the shortest paths, analyse the links between objects and gain additional benefits resulting from graph analyses.

Time-series model

Time series are materially and spatially comparable observations (data) that are uniquely arranged in terms of time from the past to the present (source 4). We can distinguish between several types of time series - interval, instantaneous or implied. Data are usually referred to using name, tag and metadata. The value is usually limited to storing a simple data type without a complex structure. Databases support a list of basic statistics: sum, average, maximum and minimum. Since the collected data have mostly a regular character, they can be aggregated and scaled very well. Given the nature of the data, it is possible to process a large number of records in near real time. Another advantage of the time-series model is that it does not matter which type we use.

Resource utilisation

As mentioned above, the demand for the amount and speed of data processing for presentation is growing. It is best to present data in real time. Therefore, the speed and efficiency of resource utilisation is crucial. If we want to get data from the database, there is a big difference in the speed if the data are provided to us from the internal memory or from physical disks (solid-state or hard drives). Generally, NoSQL databases try to keep as much relevant data as possible in the internal memory. With the appropriate settings and the appropriately chosen data distribution for scaling of the system, we can achieve very efficient retrieval of data when most of them are provided from the internal memory. However, even when using NoSQL databases, we cannot do without working with physically stored data (reading/writing). This is usually a stumbling block. Therefore, new demands for working with data on a physical storage emerge. Performance improvements could be brought by new technologies in the field of file systems. Copy-on-write file systems are developed in the Solaris systems by Sun/Oracle (ZFS file system) or Lixun systems (Btrfs file system). These systems never modify the original blocks but they only modify their copies. They modify metadata in the same way. Upon completion, they invalidate the original blocks and validate the new blocks. The file system is always consistent and avoids duplication of data entry, as is the case of conventional journaling file systems. However, we still have to wait some time to assess their actual contribution in resource utilisation.

Summary

In this mini-series about databases, I went over the history of the development and attempted to outline new possibilities brought to us by the development of this sector. Not everything old is bad and everything new is saving. However, with a suitable combination, it is possible to reach interesting results both in the speed of data processing as well as in resource utilisation savings. New trends are gradually heading towards utilisation and linking the ideas of the non-relational and the relational models which we put into the context of business intelligence and can present a competitive advantage. Furthermore, they can give us savings in the context of the use of system and hardware resources, and they also show new possibilities in the attempt to get most from these resources. Adhering only to the relational model or a complete transition to the non-relational model shows to be less effective in many cases than the connection of the two models. In the next article, in addition to becoming familiar with the Apache Software Foundation tools, I will describe the thought process that led my colleague and I to using a relational database combined with several types of non-relational databases. I will then support some of my statements with graphs and numbers.

I say goodbye with a wish of beautiful Christmas and all the best in the coming year.

Sources:

Daniel G. McCreary and Ann M. Kelly
Making sense of NoSQL: a guide for managers and the rest of us.
(Shelter Island: Manning, 2013. ISBN 978-161-7291-074).
Christof Strauch,
NoSQL Databases, Hochschule der Medien,
(Stuttgart, http://www.christof-strauch.de/nosqldbs.pdf)
Dr. Eric A. Brewer,
Towards Robust. Distributed Systems
(July 19, 2000, https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf)
www.wikipedie.org
www.gartner.com
www.oracle.com
btrfs.wiki.kernel.org

Ranking

BigData, NoSQL, Resource Utilisation  and a Few Words in Conclusion

Discussion

Related articles

Search in the blog

Web integration

Web integration as a new business area of "big" web agencies.

about web integration

Profiles