blog に戻る

2020年04月22日 Chas Clawson

Best Practices for Data Tagging, Data Classification & Data Enrichment

Overview

Introduction Data Classification and Metadata tagging

Data classification can be broadly defined as the process of organizing and tagging data by categories so that collected data may be used and protected in the most efficient way possible. Sumo Logic is an analytics platform that can ingest almost any type of machine data. This data can be structured or unstructured, come in the form of event logs and messages or as time-series metrics. Once the data has been ingested into the platform, it can be leveraged for a wide variety of use cases. As such, different teams can use real-time search, visualizations and alerting to address different challenges.

An operations team may use event and metric data to monitor and troubleshoot issues and

outages, monitor end-to-end service levels and detect system anomalies. CPU, memory usage, system errors, and network traffic can all be cross-correlated for quick root cause analysis without having to log into individual systems or tools. With efficient data collection and data tagging, searching and correlation across these diverse datasets can reduce outages from days to hours or minutes.

A development team may leverage application and infrastructure logs while designing, building, testing and deploying new features that will be delivered to customers or web-based applications hosted in the cloud. With Sumo Logic, front-end developers can access real-time business analytics to assess the impact of GUI changes on user behavior. Back-end developers can monitor the latency, volume and overall performance of requests to the application's back-end, ensuring that code is optimized to deliver the best user experience. Proper data tagging, such as defining which stage in the application development systems are, such as dev, QA, prod or test, help developers and operations teams track changes easily.

A security team can use Sumo Logic as a cloud security intelligence platform to improve the organization's security posture, risk management and threat hunting capabilities. Its ideal use cases include compliance, security, and configuration for modern cloud architectures. Mean-time-to-response (MTTR) and Mean-time-to-detection (MTTD) of security incidents can be significantly reduced when leveraging solutions that provide threat detection and incident response for modern IT environments such as hybrid, multi-cloud, and microservices. Proper normalization and tagging of data, either at collection time (through metadata tags and Field Extraction Rules) or at search time usings Sumo Logic’s query language, can greatly improve the SecOps teams ability to correlate and alert on security events of interest.

Regardless of the primary use of Sumo Logic, when developing your data collection strategy, it’s important to properly classify, tag and store this data based on clearly defined requirements, policies and business objectives. This is generally known as data classification and data enrichment. Proper classification can improve data security, reduce storage and backup costs and significantly speed searching through large data stores. Implementing a data classification or enrichment strategy in Sumo Logic is best done by applying metadata tags to fields at the time of collection as data is brought into the platform. These metadata fields with their tag values are then used to determine who can access the data, how long it should be retained and how it can most efficiently be retrieved through search queries and dashboards. In other words, within the Sumo Logic platform, there is often a need to limit the “Scope” of particular rules and policies. This scope is defined through metadata fields such as Source Category. This document covers some considerations and best practices of data classification and tagging as it pertains to Sumo Logic.

Field Names

Historically, customers leveraged the reserved “Source Category” field name exclusively. The current platform also allows the creation of custom metatag fields as shown below. Customers can choose to continue using Source Category, adopt a complete custom field approach, or the recommended hybrid model, in which a mix of both are used, with the more general categorization done with Source Categories and custom tags used for detailed information on a per use case basis.

Reserved Field Names

The following are built-in metadata fields applied to data at the time of ingest:

Field

Values

Most commonly used fields

Collector

_collector

The name of the Collector entered at activation time.

Source

_source

The name of the Source entered when the Source was created.

Source Category

_sourceCategory

This is a completely open tag determined by your entry to the Category field when you configure a Source. Maximum of 1,024 characters.

Source Host

_sourceHost

For Remote and Syslog Sources, this is a fixed value determined by the hostname you enter in the Hostname field (your actual system values for hosts). For a Local File Source, you can overwrite the host system value with a new value of your choice. Maximum of 128 characters

Source Name

_sourceName

A fixed value determined by the path you enter in the "File" field when configuring a Source. This metadata tag cannot be changed.

Other built-in or generated fields

Message Count

_messageCount

A sequence number (per Source) added by the Collector when the message was received.

Message Time
_messageTime

Count
_count

Approximate Count
_approxcount

Raw Message

_raw

The raw log message.

Receipt Time

_receiptTime

The time the Collector received the message in milliseconds.

Size

_size

The size of the log message in bytes.

_asocfoward

Used to tag events that should be forwarded to Cloud Siem Enterprise

_siemmessage

Used for extracting data from raw messages into a format clear format before passing to Cloud Siem Enterprise

Format

_format

The pattern used for parsing the timestamp. (Link)

Time Slice

In the Messages tab, each message displays its metadata tags:

Custom Field Names

Customers may also create their own key-value metadata fields. The custom fields in the metadata streams are then automatically extracted for searching, querying, and graphing. This allows you to view results for intuitively referenced subsets not traditionally tagged as source categories.

Both the custom field (key) and the value are defined at the collector or source level. By default, there is a limit to how many custom fields an OrgID can create, 200 currently, but this can be extended by working with Sumo Logic Support.

As an example, data may be tagged based on one or more of the following categories:

Field

Values

Development Stage

Dev → Test → Staging → Production etc.

Application Source

Apache, CloudTrail, Windows OS etc.

Application Message Type

Firewall, WebApp, Proxy etc.

Geographic Location / Region

LA_HQ, US_East_1, London etc.

Business Unit

HR, Marketing, Point_of_sale etc.

Data Sensitivity

Public, Confidential, Secret, PII etc.

* See metadata naming conventions for more info.

Keep in mind that these fields are user defined and are not limited to any specific category.

In some cases, it may make sense to use one field such as Source Category, to classify the data across multiple categories. For example “_sourceCategory=dev/App1/Apache/Access” aggregates multiple categories into a single value. In other cases, it would be best to create independant custom fields/values for these classifications: _devStage=dev, _appSource=App1, _appType=Access.

When opting to merge tags into a single metadata field, it is recommended to go from least descriptive to most descriptive.

Example 1:

_souceCategory = Prod/MyApp1/Apache/Access_location = US_East_1

Example 2:

_souceCategory = US_East_1/MyApp1_stage = Prod
_appMessageType = Apache/Access

The most mature organizations adopt a common schema or taxonomy so that all users of Sumo Logic use the same field names. This allows for easier cross correlation of data sets and combining different data streams into single queries or dashboards, assuming the field names match. Both the custom and the built-in field names can be viewed by administrators under the “Fields” tab.

Collector Considerations

Syslog Source Categories

Because meta-tags are first defined at the collector, there are some considerations that should be made when designing your collection architecture. It may make sense to have like-data sent to a shared collector so that the data streams inherit the same tags. For some data, such as Syslog, you may wish to set up multiple syslog sources on a collector listening on different ports to provide easier tagging of data. Note that the “Source Host” field would be unique for each sending syslog system, however all messages would share the same Source Category of the collector, which may not be ideal. The Source Category could be overwritten within a Field Extraction Rule (FER) if multiple syslog collectors is not an option.

In other words, in these cases where a global value for a meta-tag is applied to a collector at ingest time, you can still override these tag values under certain conditions as defined in a parse expression within an FER. In this example, a GCP collector applies the _sourceCategory=”Labs/GCP/appengine”, however when the JSON field Severity is “INFO”, the FER will overwrite _sourceCategory to the value “Labs/GCP/appengine2”, otherwise it will leave it unchanged (accomplished through the IF operator).

Windows Source Categories

Collector versions 19.216-22 and later allow you to define Source Category and Source Host metadata values with system environment variables from the host machine. When configuring your Source, specify the system environment variables by prepending sys. and wrapping them in double curly brackets {{}} in this form {{sys.VAR_NAME}} where VAR_NAME is an environment variable name.

For example: {{sys.PATH}}-{{sys.YourEnvVar}} as configured below.

Data Storage & Retention

Data Partitions

Data Partitions allows for the segregation of data into smaller, logical sets of data with their own indexes. A fundamental principle of search efficiency with big data analytics is, the smaller the index or partition, the quicker the results will return. This is most useful when this data is commonly searched in isolation of other partitions. These partitions can be set based on the Source Category or other custom metadata tags defined in the messages. The downside of multiple partitions or indexes is users then have to be aware and declare in which index the data resides in order to take advantage of the efficiency gains. In some cases, Sumo Logic is able to optimize searches automatically when your query conditions match a predefined partition. This is done transparently in the background.

Best practices for creating data partitions include:

  • No data overlap
  • Group data that is searched together most often and have the same retention
  • Keep the number of partitions to less that 20
  • Ideally between 1% and 30% of total volume

For example, in one scenario, company X has a large volume of data flowing into the Sumo Logic platform. A small subset of this data (5%) belongs to a couple of mission critical apps, of which the data should be retained longer than most other data. In this case, two partitions can be created for _sourceCategory=Prod/CriticalApp1* and _sourceCategory=Prod/CriticalApp2* to improve efficiency and custom data policies.

As shown in the screenshot above, the metadata tag is referred to as “Routing Expression” when defining new partitions.

Note: Scheduled Views are also a powerful way to segregate data by having a defined query (that uses metadata tags) whose aggregate results are then saved to a “view”. Views tell Sumo Logic to automatically run queries in the background to create new smaller, aggregated view data buckets can greatly improve efficiency when accessing this data, and similar to partitions, these views can also have customized retention. Partitions are not aggregated and contain all messages.

By properly tagging all data flowing into the platform, it becomes very easy for organizations to track their data consumption and usage. For example, a prebuilt app from the catalog called Data Volume has a dashboard called Data Volume (Logs) by various metadata fields, which visualizes exactly what the name suggests. (In order to enable data volume tracking metrics for your account to be written to a Data Volume Index, enable tracking under Administration-->Account-->Data Management.)

Data Search & Data Forwarding

(Section describing how data classification and data enrichment relates to search efficiency, forwarding and creating FERs. Difference between an extracted field and a custom meta field? ).

Using metadata tags helps you easily and effectively define the scope of your search. Tags can also be used with wildcards in order to find the subset of data you need without adding any Boolean logic (OR).

For example, if you use either _sourceCategory value:

_sourceCategory=Networking/Firewall/* (all firewall data)_sourceCategory=Networking/*/Cisco/* (all Cisco data)

There is almost an art to query optimization, but the most significant efficiency gains are done through using defined metadata fields and pre-extracted or parsed fields (through field extraction rules). These queries below are in order of least optimized to most optimized.

Optimization

Reason

Search

Inefficient

Partition
Meta Tag
Pre-extracted fields (FER)

Keyword

valueX

Acceptable

Partition
Meta Tag
Pre-extracted fields (FER)

Keyword

_sourceCategory=prod/foo

| parse "vendorfield *" as somefield

| where somefield="valueX"

Good

Partition
Meta Tag
Pre-extracted fields (FER)

Keyword

_sourceCategory=prod/foo and valueX

| parse "vendorfield *" as somefield

| where somefield="valueX"

Better*

Partition
Meta Tag
Pre-extracted fields (FER)

Keyword

_sourceCategory=prod/foo and fielda=valueX

Best

Partition
Meta Tag
Pre-extracted fields (FER)

Keyword

_index=prod AND _sourceCategory=prod/foo AND fielda=valueX

* Note: If partitions were defined using “_sourceCategory=prod/…” as the scope, Sumo Logic's behind-the-scenes Query Rewriting is intelligent enough to understand that the data being queried is included within _index=prod; therefore at runtime, it will rewrite the query as _index=prod AND _sourceCagegory=prod/foo.

See Optimizing Your Search with Partitions for more information.

Data Analytic Tiers

Once data is properly classified, tagged and/or parsed, organizations can achieve significant cost savings by routing their data into the different analytic tiers available. Below is a short summary of the current tiers and what they provide.

Data Access Controls

Sumo Logic fully supports role based access controls (RBAC). Once the roles have been created, a search filter is applied to control what log data a user with that role can access. You can define a search filter using keywords, wildcards, metadata fields, and logical operators. At search time, Sumo Logic transpantly uses boolean logic to include or exclude the data as appropriate to that role. In cases where users below are assigned to multiple roles, the conditions are cumulative (OR) as shown below.

Role

Filter

Role A

_source="GCP Audit" AND _collector="GCP"

Role B

_sourceCategory=Prod/MayApp1*

Cumulative query condition

((_source="GCP Audit" AND _collector="GCP") OR _sourceCategory=Prod/MayApp1*)

AND <users-query>

As you can see, proper tagging of data makes it very easy to map data to internal data sensitivity controls, and then in practice, control how users of the platform access the data.

There are some search filter limitations to be aware of:

  • Role filters cannot include vertical pipes (|).
  • Scheduled Views and Partitions are not supported in role filters, due to potentially conflicting field names and value types.
  • Role filters apply to log searches, not metric searches.
  • If one or more of your FERs override the out-of-the-box metadata tags you use in your search filters for a role, LiveTail can still provide access to data outside of the scope intended in your search filter. You should either avoid overriding out-of-the-box metadata tags in your FERs or avoid overridden tags in your search filters.

Getting Started

Implementing best practices can help you manage and structure your data efficiently to optimize its searchability and value. Tagging strategies should not be left to engineers discretion. Prior to implementation, an assessment should be done to determine the sensitivity and criticality of data being collected, and how access to this data matches the platform users. In addition, a determination should be made on data criticality and what the retention should be and what cost saving can be achieved.

Below is a set of possible steps you may consider:

  1. Identify and classify the data your organization collects
  2. Know which legal requirements apply
  3. Determine if any data masking or RBAC controls need to be implemented
  4. Determine data retention requirements (Hot, Warm, Cold) and map them to partitions and Sumo Logic Data Tiers
  5. Work with system owners, users and analysts to determine what tagging will provide the most efficiency gains.
  6. Create and share internally a schema or set of field names for consistent content creation.

Summary

Tagging used to be the exclusive domain of the urban graffiti artists; now geeks can be cool too and learn the art of metadata tagging in Sumo Logic.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Sumo Logic cloud-native SaaS analytics

Build, run, and secure modern applications and cloud infrastructures.

Start free trial
Chas Clawson

Chas Clawson

Field CTO, Security

As a technologist interested in disruptive cloud technologies, Chas joined Sumo Logic's Cyber Security team with over 15 years in the field, consulting with many federal agencies on how to secure modern workloads. In the federal space, he spent time as an architect designing the Department of Commerce ESOC SIEM solution. He also worked at the NSA as a civilian conducting Red Team assessments and within the office of compliance and policy. Commercially, he has worked with MSSP practices and security consulting services for various fortune 500 companies. Chas also enjoys teaching Networking & Cyber Security courses as a Professor at the University of Maryland Global College.

More posts by Chas Clawson.

これを読んだ人も楽しんでいます