Semi-structured data in big data?

Semi-structured data in big data?

Semi-structure data in big data

Data is not only becoming easily accessible but also easily understandable to the systems. Most of the big data embedded in data is unpredictable. They usually contains unruly stuffs such as videos, words, images from web and those streamed down from sensor data. But the computer tools are adding knowledge and insights from the internet era to gain access to these unpredictable sources. Artificial Intelligence can be applied in many fields. Example- Google search and ad business, both uses a large number of artificial intelligence tricks. The wealth of new data in turn give birth to new technology in computing like Machine Learning algorithms, Big data analytics, Cloud Computing, Business Intelligence etc.

The process of utilising the data and storage of data are origin exponentially, but soon the importance of having a large number of volume will prohibit to exist. Companies are looking for more relevant and accurate data which will help to gain a big advantage for the company. Thus, utilising these data will be the fundamental key which companies are looking for. Reduce operational costs. Track the current situation and create new metrics for better future.understand its customer at its best level, find new product opening, offerings and opportunities. It is important to understand the different kinds of data coming to these business organisation before going more in details on data analysis. Data can be served especially in three categories. They are structured data, unstructured data and semi-structured data. Different data serve a different purpose for the organisation. Lets elaborate on the meaning and importance of semi-structured data.

Semi-structured data:

Semi-structured data is the third category of data after structured and unstructured data. It is actually the mixture of both types of data. The type of data defined under semi-structured data has some prominent characteristics but does not consider as strong and rigid as that of the relational database. There is some clear difference that makes semi-structured data completely different from that Semi-structured data is a type of structured data that does not follow the common structure of data. These data are associated with relational databases or other types of data tables. Semi-structured data contains different tags and other markers that separate semantic elements and fields within the data. Semi-structured data are sometimes referred to as an as self-describing structure.

In semi-structured data belongs to the same class which differ by their attributes even after being grouped together. Semi-structured data are continuously increasing since full-text documents and databases are not enough to cover the whole data. Semi-structured data helps programmers persisting objects from their applications to a database. It supports nested data which further simplifies data models representing complex relationships between entities. It also provides support for the lists of objects simplifies data models by avoiding long and complicated transaction list into a relational data of the relational database. And that clear difference is semantic tags or metadata.

E-mail messages are the best example of semi-structured data. While the content in email is unstructured, it contains names and email address of the sender, sent time, details etc as a part of structured data. Another example is digital photography. The image is unstructured but it contains essential details such as device ID, geo stagger, time-stamped, date etc as a part of structured data. Once stored into the device, it can have different names such as cat or pet.

Types of semi-structured data:

There are two types of semi-structured data.

  1. JSON
  2. XML

JSON and XML are the types of the file format used to represent data in a textual manner. They are also regarded as the standard type of semi-structured data. XML has been in news from the past few decades. The classical example of XML is HTML. XML is more advanced than HTML. HTML is sometimes termed as XHTML because it is the variant of HTML and XML.

XML is always used to display document like structure whereas, JSON actually represent a tree in semi-structured data where each node has the key-pair value. The branch of the tree is the one and the value which it contains is of other trees. In fact, it is quite difficult to distinguish the difference between JSON and XML files. There is a minor difference between JSON and XML.

XML and JSON are considered as the file format that represents semi-structured data because both of them represent data originally in the tree or in a hierarchical structure. DOM is most commonly used to represent tree structure and most commonly used for HTML. DOM is also attributed to JSON and XML files. This tree structure contains documents in the context of databases and this is the scientific reason why JSON and XML are the examples of semi-structured data. Sometimes, JSON and XML fail to specify what labels or substructure it contain in its document. This is the most common reason why sometimes people consider JSON and XML files as an unstructured data type.

Examples of semi-structured data

It will be unfair to say that semi-structure data never fits the category of perfect data model or schema. Here are some classical examples of human-generated semi-structured data.

Images and videos on YouTube, Pinterest, Instagram, and photo sharing websites.

Some classical examples of machine-generated semi-structured data are:

  • Digital surveillance- digital surveillance contains images and video footage, oil and gas exploration, spatial imagery, etc.
  • Sensory data- Sensory data contain information on traffic, weather, seismography, oceanography, etc.
  • Satellite imagery- Satellite images are used for safety and defence purposes.

Common Characteristics of semi-structured data:

  • Data has some structure but do not follow any gradual pattern of data structure.
  • Similar identity is placed together and is organised in a tree form.
  • Entities in the same group may or may not contain similar properties or attributes.
  • Does not have sufficient metadata for automation and management purpose.
  • Size and type of the same attributes in the group may vary.
  • Data may not be stored in the form of rows and columns as in databases.
  • Semi-structure data contains tags and metadata which are essentially used to group and store data.
  • Lack of proper sources and well-defined structure and therefore cannot be used by computer programs easily.

Sources of semi-structured data

  1. E-mails
  2. XML and other markup languages
  3. Binary executables
  4. TCP/IP packets
  5. Zipped files
  6. Integration of data from different sources
  7. Web pages

Advantages of semi-structured data

  • It does not require any background of SQL.
  • It can deal easily with the heterogeneity of sources.
  • Schema can easily be changed without causing any problem.
  • Data is easily portable.
  • It is sometimes possible to view structured data as semi-structured data.
  • The data is not constrained by a fixed data.

Disadvantages of semi-structured data

  • Difficult to store due to lack of fixed and rigid schema.
  • Interpretation of data is difficult because there is no separation between data and schema.
  • Queries are less efficient as compared to structured data types.