In this post, we are diving into one of the most crucial components of Splunk: the 'Indexer', where the magic of event indexing happens. If you recall the Data Input, Data Store, and Data Search phases of the data pipeline, or the Event Processing Tiers of Splunk, we will be focusing on the Data Storage phase, where parsing and indexing of input data take place.
Throughout this article, we will answer fundamental questions about the indexing process, such as:
What exactly is an Index?
How does Splunk store indexes?
How to create and delete indexes using the Web UI and CLI
Adding and removing data from indexes
Taking the backup indexes
By the end of this article, you will have a solid grasp of Splunk indexes and how to manage them effectively to optimize your Splunk environment. So, let's get started on this exciting journey of exploring Splunk indexes!
In Splunk, an index is a repository for data that is going to be stored in an indexer. The Splunk instance is configured to index local and remote data, which can then be searched through a search app. Indexes live under the var/lib/splunk
directory by default.
Splunk comes with multiple preconfigured indexes. The "main" index is the first index you should know about, because it is the default index Splunk stores the incoming data. However, you have the flexibility to create and specify other indexes for different data inputs, allowing you to organize your data more effectively. Scroll down if you want to know how to create a custom index. We have covered both Web UI and CLI procedures.
There are a few more indexes comes as preconfigured along with the index "main". Here you see:
_internal
: This index stores the logs resulting from the internal processing of the Splunk instance. For example, Splunk daemon logs are stored under _internal
.
_audit
: This index stores the audit trail logs and any other optional auditing information.
_introspection
: This index is used for system performance tracking, such as Splunk resource usage and data parsing performance. It contains information related to CPU and memory usage.
_thefishbucket
: This index contains checkpoint information for all the files that are going to be monitored for data input. This is relevant for both forwarders and indexers.
main
: The main
index is the default index for data input and is located under the defaultdb
directory.
You might wonder why we would want to create many indexes instead of storing all data in a single index. The answer lies in the following advantages:
Fast search retrieval: By segregating data types into different indexes, you can avoid putting an extra load on the Splunk instance to search through all the data when you only need to search for a specific data type, such as firewall logs.
Retention management: Based on the customer's requirements and data importance, you can set different retention periods for each index. For example, you may want to keep security logs for 12 months, while web logs might only be kept for 3 months.
Access control: You can control who has access to what data by setting permissions at the index level. For instance, you can grant the security group access to the security logs index while restricting access for other departments.
Now that you know what indexes are, let's dive into how Splunk actually stores these indexes. When data is fed into Splunk, the indexer processes it and stores it in a series of directories and files, collectively known as an Indexes.
Under the hood, each index is comprised of a set of directories called "buckets." These buckets are organized by age, with the most recently indexed data stored in the "hot" bucket, slightly older data in the "warm" bucket, and the oldest data in the "cold" bucket. This age-based organization helps Splunk efficiently manage the data lifecycle and optimize performance.
Here's a quick breakdown of the bucket types:
Hot Buckets: When new data arrives, it is first written to hot buckets. These buckets are always open for writing and are optimized for fast data ingestion. Hot buckets have the naming format hot_v1_<GUID>
, where <GUID>
is a unique identifier. Well, hot buckets are stored in the $SPLUNK_HOME/var/lib/splunk/[index_name]/db/
directory.
Warm Buckets: As hot buckets reach certain thresholds (e.g., maximum size, age, or number of buckets), data is rolled over to warm buckets. Warm buckets are read-only and are optimized for searching. The naming format for warm buckets is db_<timestamp>_<GUID>
, where <timestamp>
represents the earliest and latest timestamps of the events within the bucket. Worm buckets also share the same location as hot buckets in the directory hierarchy.
Cold Buckets: When warm buckets reach their respective thresholds, data is moved to cold buckets. Cold buckets are also read-only and are typically stored on slower, less expensive storage media. The naming format for cold buckets is the same as warm buckets, but they are stored in a separate directory called colddb
. Path: $SPLUNK_HOME/var/lib/splunk/[index_name]/colddb
Thawed Buckets: When data needs to be restored from an archive, it is moved back into thawed buckets.
Within each bucket, Splunk stores the indexed data in a compressed format, along with associated metadata files that help facilitate fast searching and retrieval. These files include:
.data
files: Contain the raw indexed data
.tsidx
files: Contain indexes that point to the raw data, enabling fast searching
By default, Splunk will automatically manage the movement of data through the bucket lifecycle based on configurable policies. However, you can also manually manage buckets in the indexes.conf
file.
Let's see the key factors, criteria and thresholds that influence data movement:
Hot to Warm: Data rolls from hot to warm buckets when the maximum size, age, or number of hot buckets is reached.
Warm to Cold: Data moves from warm to cold buckets when the maximum number of warm buckets is reached.
Cold to Frozen: Data transitions from cold to frozen buckets when the cold bucket storage limit is exceeded, or the retention period is reached.
The specific thresholds and criteria for data movement are defined using the following settings in indexes.conf
:
maxHotBuckets
: The maximum number of hot buckets.
maxHotSpanSecs
: The maximum age of hot buckets in seconds.
maxHotIdleSecs
: The maximum idle time for hot buckets before rolling to warm.
maxDataSize
: The maximum size of a bucket.
maxTotalDataSizeMB
: The maximum total size of an index.
frozenTimePeriodInSecs
: The retention period for the entire index.
Now that you have a solid understanding of what indexes are and how Splunk stores them, let's explore how you can create and delete indexes using the Splunk Web UI. The Web UI provides a user-friendly interface for managing indexes, making it easy for you to organize your data and keep your Splunk environment tidy.
Log in to your Splunk Web UI and navigate to "Settings" > "Indexes".
Click on the "New Index" button.
Enter a name for your new index (e.g., "security_logs").
(Optional) Specify the index type as "Events" or "Metrics". For most use cases, "Events" is the default selection.
(Optional) Set the "App" field to specify which app this index should be associated with. This helps with organizing and managing indexes.
(Optional) Configure the following settings based on your requirements:
"Max Size": The maximum size of the entire index.
"Max Hot Buckets": The maximum number of hot buckets.
"Max Warm Buckets": The maximum number of warm buckets.
"Max Cold Buckets": The maximum number of cold buckets.
"Retention Period": The time period for which data should be retained in the index.
4. (Optional) Specify the "Home Path", "Cold Path", and "Thawed Path" if you want to store the index data in custom locations.
5. Click on the "Save" button to create the new index.
Your new index is now created and ready to receive data. You can start forwarding data to this index by configuring data inputs or forwarders.
Note: If in case you don't see your index in the list, got to https://splunkhost:8000/en-US/debug/refreshreload the configuration.
Log in to your Splunk Web UI and navigate to "Settings" > "Indexes".
Locate the index you want to delete from the list of indexes.
Click on the "Delete" button next to the index you want to remove.
A confirmation dialog will appear. Click "Delete" to confirm the action.
Please note that deleting an index will permanently remove all data associated with that index. Make sure you have backed up any important data before proceeding with the deletion.
While the Splunk Web UI provides a convenient way to manage indexes, sometimes you may need to work with indexes from the command line. Whether you're automating index management tasks or simply prefer the flexibility of the CLI, Splunk has got you covered. In this section, we'll walk through the steps to create and delete indexes using the Splunk CLI on Linux and Mac operating systems.
1. Connect to your Splunk instance using SSH or a terminal.
2. Navigate to the $SPLUNK_HOME/etc/apps/<app_name>/local/
directory where the indexes.conf file exist. The default, $SPLUNK_HOME
in Linux is /opt/splunk
, and in Mac is /Applications/splunk
.
3. Navigate to the application underneath you need to create an index. Applications are stores underneath $SPLUNK_HOME/etc/apps/
. We are going to create an index inside the "search" app.
cd /Applications/splunk/etc/apps/search/local/
If in case you want to create an index underneath a new application. You should create a new directory SPLUNK_HOME/etc/apps/
and inside the directory again create local
directory.
Note: Read How to Create Apps and Add-Ons in Splunk in this article.
Edit the file named indexes.conf
or create a new one underneath $SPLUNK_HOME/etc/apps/<app_name>/local/
using a text editor (e.g., vi or nano):
The content of indexes.conf something looks like in the above picture. Each stanza represents an index. We have two indexes macbookpro and security_logo in the indexes.conf file. Let's create another one.
[audit_logs]
coldPath = $SPLUNK_DB/audit_logs/colddb
homePath = $SPLUNK_DB/audit_logs/db
maxTotalDataSizeMB = 51200
thawedPath = $SPLUNK_DB/audit_logs/thaweddb
We created another index named "audit_logs" underneath "search" application. Your new index is now created and ready to receive data. You can start forwarding data to this index by configuring data inputs or forwarders.
Restart the Splunk or debug the configuration here: https://splunkhost:8000/en-US/debug/refresh. If everything goes perfectly, your index should be listed in the web console.
Connect to your Splunk instance using SSH or a terminal.
Navigate to the $SPLUNK_HOME/bin
directory:
Stop the Splunk instance: ./splunk stop
Navigate to the $SPLUNK_HOME/etc/apps/<app_name>/local/
directory where the indexes.conf file exist. The default, $SPLUNK_HOME
in Linux is /opt/splunk
, and in Mac is /Applications/splunk
.
Remove the index stanza from the indexes.conf
file.
Start the Splunk instance: ./splunk start
The index and its associated data are now permanently deleted from your Splunk instance.
Splunk provides various methods to add and remove data from indexes. In this section, we will explore how to add data to an index using the Web UI and remove data from an index using the CLI.
Log in to your Splunk Web UI.
Navigate to "Settings" > "Add Data".
Choose the method you want to use to add data, such as "Upload", "Monitor", "Forward", or "TCP/UDP". For this example, let's select "Upload". We will cover other methods in a different articles.
Click on "Select File" and choose the file you want to upload.
Configure the "Source Type" and "Host" fields according to your data.
In the "Index" dropdown, select the index where you want to store the uploaded data.
Click on the "Review" button to review your settings.
If everything looks correct, click on "Submit" to start the data upload.
Splunk will now process and index the uploaded data, making it available for searching and analysis.
1. Connect to your Splunk instance using SSH or a terminal.
2. To remove data from an index, you can use the splunk remove
command followed by the index name and the specific data you want to remove.
3. For example, to remove data from the "my_index" index based on a search query: This command will remove all data from the "my_index" index where the source matches "/path/to/file.log".
./splunk remove index="my_index" source="/path/to/file.log"
This command will remove all data from the "my_index" index where the source matches "/path/to/file.log".
You can also use time-based searches to remove data within a specific time range: This command will remove data from the "my_index" index that is older than 30 days.
./splunk remove index="my_index" earliest=-30d@d latest=now
This command will remove data from the "my_index" index that is older than 30 days.
1. After issuing the splunk remove
command, Splunk will display a summary of the data to be removed and prompt for confirmation.
2. Review the summary and, if satisfied, type "yes" to proceed with the data removal.
Please note that removing data from an index is a permanent action and cannot be undone. Make sure to carefully review the data being removed before confirming the operation.
Perform a search that matches the data you want to remove.
In the search results, click on the "Event Actions" dropdown and select "Delete Events".
Confirm the deletion by clicking "Delete" in the pop-up window.
In this demo, we have logs from two hosts: Linux & Mac. Let's delete the events of Linux by filtering the Linux logs.
This method is suitable for removing smaller subsets of data based on specific search criteria.
It's important to exercise caution when removing data from indexes, as it can impact search results and any associated reports or dashboards. Always ensure that you have appropriate backups and consider the implications of data removal before proceeding.
By default any account including admin doesn't have "can_delete" permission. You should add "can_delete" Role to the user to delete events from the search app.
Go to Settings -> Users -> Edit -> Assign Role -> can_delete
Taking regular backups of your Splunk indexes is crucial to ensure data protection and recoverability in case of system failures, data corruption, or accidental deletions. In this section, we will walk you through the step-by-step process of taking backups of your Splunk indexes.
Make note of the index names and their corresponding paths (e.g., $SPLUNK_DB/index_name/db
). The default, $SPLUNK_DB
in Linux is /opt/splunk/var/lib/splunk/[index_name]/db/
, and in Mac is /Applications/splunk/var/lib/splunk/[index_name]/db/
.
1. Connect to your Splunk instance using SSH or a terminal.
2. Navigate to the $SPLUNK_HOME/bin
directory:
3. Stop the Splunk instance: Stopping the Splunk instance ensures that no data is being actively written to the indexes during the backup process.
./splunk stop
1. Use the cp
command to copy the index directories to a backup location: Replace index_name
with the actual name of the index you want to back up and /backup/location/
with the path where you want to store the backup.
2. Repeat the above step for each index you want to back up.
cp -R ../var/lib/splunk/index_name /backup/location/
After completing the backup, start the Splunk instance: This will resume normal operation of your Splunk instance.
./splunk start
Navigate to the backup location where you copied the index directories.
Verify that the index directories and their contents are present and complete.
You can also compare the size and timestamp of the backed-up directories with the original index directories to ensure the backup was successful
We recommend to automate the backup process by scheduling using cron or at schedulers.
We hope this article helps understanding the fundamentals of Splunk indexes, including what they are, how they store data, and how to manage them effectively. We covered creating and deleting indexes using both the Web UI and CLI, adding and removing data from indexes, and taking backups to ensure data protection. By understanding and implementing these concepts, you can optimize your Splunk deployment, ensure data accessibility, and maintain a robust data management strategy. Happy Splunking!
We are going to end this article for now, we will cover more information about the Splunk in the up coming articles. Please keep visiting thesecmaster.com for more such technical information. Visit our social media page on Facebook, Instagram, LinkedIn, Twitter, Telegram, Tumblr, & Medium and subscribe to receive information like this.
You may also like these articles:
Arun KL is a cybersecurity professional with 15+ years of experience in IT infrastructure, cloud security, vulnerability management, Penetration Testing, security operations, and incident response. He is adept at designing and implementing robust security solutions to safeguard systems and data. Arun holds multiple industry certifications including CCNA, CCNA Security, RHCE, CEH, and AWS Security.
“Knowledge Arsenal: Empowering Your Security Journey through Continuous Learning”
"Cybersecurity All-in-One For Dummies" offers a comprehensive guide to securing personal and business digital assets from cyber threats, with actionable insights from industry experts.
BurpGPT is a cutting-edge Burp Suite extension that harnesses the power of OpenAI's language models to revolutionize web application security testing. With customizable prompts and advanced AI capabilities, BurpGPT enables security professionals to uncover bespoke vulnerabilities, streamline assessments, and stay ahead of evolving threats.
PentestGPT, developed by Gelei Deng and team, revolutionizes penetration testing by harnessing AI power. Leveraging OpenAI's GPT-4, it automates and streamlines the process, making it efficient and accessible. With advanced features and interactive guidance, PentestGPT empowers testers to identify vulnerabilities effectively, representing a significant leap in cybersecurity.
Tenable BurpGPT is a powerful Burp Suite extension that leverages OpenAI's advanced language models to analyze HTTP traffic and identify potential security risks. By automating vulnerability detection and providing AI-generated insights, BurpGPT dramatically reduces manual testing efforts for security researchers, developers, and pentesters.
Microsoft Security Copilot is a revolutionary AI-powered security solution that empowers cybersecurity professionals to identify and address potential breaches effectively. By harnessing advanced technologies like OpenAI's GPT-4 and Microsoft's extensive threat intelligence, Security Copilot streamlines threat detection and response, enabling defenders to operate at machine speed and scale.