Long before the invention of modern technologies, data were written down in books or memorized, and later on, as the amount of data increased, people slowly started using the computer to store and process information. As the usage of technology and the internet rises, the quantity of data also increases exponentially, which led to the introduction of the concept of big data, where large sets of data are analyzed and processed.
Through this article, we will discuss what is big data and its characteristics also, we will see what is the future scope of big data.
When we realized that data could no longer be contained in books or physical copies, we started storing information in computers, but with the increase in data production, they could no longer be stored in normal systems, So we needed means to store the produced data, process and retrieve it when needed according to the requirement. Thereby around the 1960s and 70s, the idea of RDBMS (Relational Database Management System), where data were stored, processed, and easily retrieved as needed, was introduced.
During the initial days, data was simple we could categorize those into column format, which could be managed by RDBMS, but as new types of data such as audio, video, etc. emerged, this file couldn’t be stored in RDBMS or any other traditional tools, Thus to save a huge amount of complex data the concept of big data was introduced.
Around 2005, the amount of data released by Facebook, YouTube, and other online platform was enormous. This led to the creation of Hadoop. It is nothing but an open-source framework created to analyze and store big data sets. During the same time, NoSQL also started to gain popularity.
During later years, the quantity of data also skyrocketed, and these open-source frameworks like Hadoop and the recently added spark framework made these gigantic data much simplified, easier to work with, and stored cheaply. These technologies have very less human involvement.
Big data is data that contain different verities and a huge amount of complex data sets from multiple sources, but what decides whether the data can be defined as big data or not? For this, we follow the 5V rule
It stands for the quantity or size of data that needs to be stored. In other words, we can say that the volume of data being generated every second will be larger than terabytes and petabytes, and as the amount of data increases in the further years, the data unit considered will also be larger.
In variety there are three types structured, semi-structured, and unstructured data. Structured data are from sources like product databases, contact lists, and excel sheets. Semi-structured data have sources like HTML code, graphs and tables, and emails, and for unstructured data, sources are social media, voice files, IOT, etc.
Velocity or speed at which data is progressing or changing is also considered. It also says about the speed at which the data need to be generated, processed, and collected. Velocity is important because if a certain issue is found or for an immediate requirement, it should be able to do the action quickly.
The business value of the data, or usefulness of the data given, is termed as value.
The terms bias, noise, and abnormality are used to describe data veracity. Incomplete data, errors, outliers, and missing numbers are also examples .it deal with the consistency and quality of data.
There are mainly 3 types of big data which are:
• Here, the data would be precise, factual, and highly organized.
• They can be displayed in rows and column
• Would be as numbers, strings, or dates.
• Less storage required
• Easier to manage and protect
• Analysis would be quantitative
• Storage and searching are easier (ex: MySQL-use SQL commands to insert, select, delete, etc.)
Structured data sources are OLTP systems, spreadsheets ( excel sheets), online forms, sensors such as GPS or RFID tags, SQL databases, network, and web server logs
• Loosely organized into categories
• Would usually be emailed by inbox, sent draft, spreadsheet, etc.
• In tagged text format, it is stored in relational databases
• Abstracts and figures are mainly used to represent semi-structured data
Semi-structured data sources are CSV files (Comma separated value-saved as structured value separated by commas), HTML, NoSQL database
• Data does not have a predefined structure and can have any form.
• They cannot be displayed in rows, columns, and relational databases.
• Would be as images, audio, video, emails, word processing files etc.…
• More space is needed for storage.
• More protective measures need to be taken to store and secure the data.
• Here, the analysis would be qualitative.
• Need complicated measures to store and process data (Hadoop- which is an open-source framework for storing and processing big data sets)
• Data lakes are used for storage (support various datatypes)
Unstructured data sources are audio files, social media, emails, images, video files, etc.
The four main challenges faced by big data are storage, quality, validation, and collection of data. therefore,
• Security issues: A system can malfunction as a result of information theft, DDoS attacks, ransomware, or other malicious behaviors that can occur online or offline.
• Ethical issue: ownership and privacy need to be ensured
• Storage: Big data is difficult to store, especially due to its unstructured type.
• Collection of data: data need to be efficiently transported to the storage system without variations.
• Unintentional misuse
Big data is an ever-progressing domain that will only increase its importance in the coming future as more and more data are being created and used. Unstructured data production is increasing enormously. Social media, like Facebook (Meta), etc. are depending on bigdata as to store their massive data
Bigdata is involved in almost all verticals, including Retail, Bank, E-commerce, Healthcare, Telecom, etc. Not only these data need to be stored, but they also need to be processed, and as per the necessity and quick data retrieval should be possible
The importance of big data will increase day by day, and it is important to find out newer methods to find and process data. So the demand gets higher and higher, making it one of the most important domains of all time.
I hope the article helped in understanding what is big data and its characteristics also, we read about what is the future scope of big data.
Please share this post if you find this interested. Visit our social media page on Facebook, LinkedIn, Twitter, Telegram, Tumblr, Medium & Instagram, and subscribe to receive updates like this.
You may also like these articles:
Aroma is a cybersecurity professional with more than four years of experience in the industry. She has a strong background in detecting and defending cyber-attacks and possesses multiple global certifications like eCTHPv2, CEH, and CTIA. She is a pet lover and, in her free time, enjoys spending time with her cat, cooking, and traveling. You can connect with her on LinkedIn.
“Knowledge Arsenal: Empowering Your Security Journey through Continuous Learning”
"Cybersecurity All-in-One For Dummies" offers a comprehensive guide to securing personal and business digital assets from cyber threats, with actionable insights from industry experts.
BurpGPT is a cutting-edge Burp Suite extension that harnesses the power of OpenAI's language models to revolutionize web application security testing. With customizable prompts and advanced AI capabilities, BurpGPT enables security professionals to uncover bespoke vulnerabilities, streamline assessments, and stay ahead of evolving threats.
PentestGPT, developed by Gelei Deng and team, revolutionizes penetration testing by harnessing AI power. Leveraging OpenAI's GPT-4, it automates and streamlines the process, making it efficient and accessible. With advanced features and interactive guidance, PentestGPT empowers testers to identify vulnerabilities effectively, representing a significant leap in cybersecurity.
Tenable BurpGPT is a powerful Burp Suite extension that leverages OpenAI's advanced language models to analyze HTTP traffic and identify potential security risks. By automating vulnerability detection and providing AI-generated insights, BurpGPT dramatically reduces manual testing efforts for security researchers, developers, and pentesters.
Microsoft Security Copilot is a revolutionary AI-powered security solution that empowers cybersecurity professionals to identify and address potential breaches effectively. By harnessing advanced technologies like OpenAI's GPT-4 and Microsoft's extensive threat intelligence, Security Copilot streamlines threat detection and response, enabling defenders to operate at machine speed and scale.