Semi-structured data: definition, types, examples. In recent years’ new software and data analysis techniques are developing allowing you to gather major business insights from qualitative or unstructured and structured data of emails, websites, customer service interactions along with quantitative or structured data of statistics and spreadsheets.
With qualitative data, ways are open for you to go beyond what happened and the reasons why it happened with different techniques including opinion mining and topic analysis. Analysis of semi-structured is quite easy if you have the right processes.
Semi-structured data can be defined as data that is not organized in relational databases nor does it have a strict structural framework but it does not lose all the properties of being categorized as a form of data. It has some structural properties and its organizational framework is loose.
It includes the text that is organized by topic or subject in a hierarchical programming language yet the text in its open-end has no structure.
Some of the examples of semi-structured data are Subject, Date, Receipt, Sender, Email and if categorized by machine learning it can be categorized into folders such as promotions, Inbox, spams, etc.
Meaning
Semi-structured data is intermediate between structured and unstructured data and combines the characteristics of both. It follows various certain consistency and schema and exists to ease clarity. XML, JSON, CSV documents are semi-structured documents. SQL databases are not taken into consideration for the handling of semi-structured data.
Some devices generate structured, unstructured, and semi-structured data among which structured data can be easily processed and managed because of a well-defined structure. On the contrary, unstructured and semi-structured data need data analytics tools for their processing and management. IoT devices are included among these devices whose network data is in motion or transit. For instance, email and web browsing transfer files.
From industry’s perspective, data passes through devices in motion and this data may be filtered and possessed by another device in connection with the same network or may be sent to the data center. If the data is sent to the data center it can be put into the route of processing by real-time data analysis software and response is received by the original devices.
Let’s have a look at the nature of semi-structured data too. Semi-structured data is organized into semantic entities and similar entities are combined, however, entities of the same group don’t need to possess the same attributes and the order of these attributes is not required or they may not be in use at all. Even the type and size of attributes in the same group may differ.
Information can be extracted in different ways from semi-structured data. For the index of data commonly object exchange models and Graph-based models are used. Object exchange models (OEM) allow the data to be stored in graph-based models as they are easier to search and index.
Apart from these two, another option is XML which enables the creation of hierarchies and facilitates search and index. Moreover, data mining tools can be used for the extraction of information from semi-structured data.
If we work properly then the use of semi-structured data is not difficult as it provides us means to integrate data from the various exchanges and sources of different systems. If you consider web-forms, then you may want to modify its forms or enable the capturing of different data for different users.
If you are working without any change in database schema or coding, then the removing or adding of data does not have any effect on dependencies or functionalities.
Types
Maybe your next question would be how semi-structured data get created and what its type is. Some types of semi-structured data sources are XML, binary executables, zipped files, data integrated from different sources, web pages, and other markup languages.
The volume of semi-structured data is increasing due to the growth of different web pages. Alongside this reason the need for flexible presentation of data exchange between contrasting databases. Moreover, a great mix of text and structural data including attributes and annotations can help generate this kind of data.
Where no predefined schema is required semi-structured data can be used over there. This schema may be partial, very large, descriptive, or evolving.
Example
Semi-structured data comes in a variety of formats for the use of individuals among them some have advanced hierarchical construction while some are barely structured.
Some of the examples of Semi-structured data are:
HTML
HTML or HyperText Markup Language is a hierarchical language having similarities with XML but while HTML is used to display data, XML is used for transmitting data. The web pages that we have defined above are created using HTML. Knowing of the semi-structure of HTML it reclines in for the user to display images and text on computer screen however these text and images themselves are unstructured.
Electronic Data Interchange
Electronic Data Interchange or EDI is the electronic transmission of business documents from computer to computer. These documents may be previously transmitted on paper such as invoices, purchase orders, and inventory documents. Several standard formats are used by EDI among them are EDIFACT, ANSI, ebXML, and TRADACOMS. So it is necessary for businesses that when they communicate they must use the same format. EDI is also beneficial as it transmits your documents at a fast speed and less cost.
XML, CSV, and JSON.
XML, CSV, and JSON are three common and major languages when the need to communicate or transmit data from a web server to a client occurs.
XML means ‘Extensible Markup Language’ and it is designed for the communication of data in a hierarchical structure.
CSV stands for ‘Comma Separated Values’ in which data is represented by commas between them such as Jessica, Lucy. Data are expressed in the same way in an Excel file.
JSON stands for ‘JavaScript Object Notation. It was invented in 2001 as an alternative to XML as its process of communication is the same as XML and it is also smaller in size.
Read also: Datos semi estructurados (in spanish) ; Five Generations of Computer ; Ontology in Information Science
External resources: Wikipedia