Bucket Inventory Reports

Overview

Starting with RELEASE.2025-12-20T04-58-37Z, AIStor supports creating inventory reports on objects and related metadata in a bucket. The AIStor inventory feature provides a fully-integrated solution with equivalent utility to the previously announced AIStor Catalog while expanding functionality with improved scheduling, filtering, and integration options.

Each inventory configuration is bucket-scoped and user-defined, allowing you to create multiple inventory jobs per bucket with different filters, schedules, and output formats. You can schedule jobs to run once, hourly, daily, weekly, monthly, or yearly, with each execution creating a timestamped output folder.

Inventory reports include object metadata such as size, last modified date, storage class, encryption status, tags, and user metadata. You can filter objects by prefix, age, size, name patterns, tags, or custom metadata to generate targeted reports. The inventory system supports CSV, JSON, and Parquet output formats with optional compression, making it suitable for compliance reporting, data analytics, and integration with downstream systems.

Use the mc inventory commands to create and manage inventory jobs.

Migrating from batch catalog

If you have existing batch catalog YAML configurations, you can use mc inventory migrate-from-batch to convert them to the inventory format.

Quick start

Before creating an inventory job, ensure you have the following:

s3:PutInventoryConfiguration permission on the source bucket
s3:GetInventoryConfiguration permission on the source bucket
s3:ListBucket permission on the source bucket
Write permissions on the destination bucket
An alias configured for your AIStor deployment (for example, myaistor)

Generate configuration template

Use mc inventory generate to create a YAML configuration template that serves as the starting point for your inventory job.
```
mc inventory generate ALIAS/SOURCE_BUCKET INVENTORY_JOB_ID > inventory-config.yaml
```
Replace ALIAS with your AIStor alias, SOURCE_BUCKET with the bucket to inventory, and INVENTORY_JOB_ID with a unique identifier for this job.

The command creates a YAML file with all available configuration options and comments explaining each field. See the configuration reference for more complete documentation on available fields.
Edit the job configuration

Open the generated inventory-config.yaml file and configure the job to reflect your desired outcome. The following example generates a daily CSV report on the current version of all objects in the SOURCE_BUCKET and outputs it as a csv to the inventory-reports/documents-inventory bucket and prefix:
```
apiVersion: v1
id: daily-report
destination:
  bucket: inventory-reports
  prefix: documents-inventory/
  format: csv
  compression: on
schedule: daily
mode: fast
versions: current
```
Save your changes to the configuration file.
Create the inventory job

Use mc inventory put to upload the configuration and add the inventory job for your AIStor deployment.
```
mc inventory put ALIAS/SOURCE_BUCKET inventory-config.yaml
```
AIStor validates the configuration and schedules the job according to your specified schedule. For one-time jobs (the default schedule), the job begins execution immediately. For recurring jobs, the first execution starts based on the schedule type.
Monitor job status

Use mc inventory status to track progress and completion of the job:
```
mc inventory status ALIAS/SOURCE_BUCKET INVENTORY_JOB_ID
```
The status output includes:
- The job state
- objects scanned
- records written
- execution time
- errors encountered.
Add the --watch flag to the command to continuously monitor job progress in real time.

Processing job output

Inventory jobs write output to the destination bucket in a structured folder hierarchy:

DESTINATION_BUCKET/
  PREFIX/
    SOURCE_BUCKET/
      INVENTORY_JOB_ID/
        YYYY-MM-DDTHH-MMZ/
          files/
            file-001.csv.zst
            file-002.csv.zst
          manifest.json

Each execution creates a timestamped folder containing data files and a manifest. The timestamp reflects when the job started. The manifest contains metadata useful to downstream consumers who want observability into the inventory process.

Manifest file structure

The manifest file provides metadata about the inventory execution and lists all data files produced. It uses the AWS S3 Inventory manifest format with a MinIO extension that includes job status, objects scanned, and objects matched by filters. The manifest also includes MD5 checksums for data files.

{
  "sourceBucket": "my-bucket",
  "destinationBucket": "dest-bucket",
  "version": "2016-11-30",
  "creationTimestamp": "1736943600",
  "fileFormat": "CSV (ZSTD compressed)",
  "fileSchema": "Bucket,Key,Size,LastModifiedDate,...",
  "files": [
    {"key": "prefix/bucket/job-id/2025-01-15T10-30Z/files/file-001.csv.zst", "size": 1024, "MD5checksum": "abc123"}
  ],
  "minioExtension": {
    "status": "completed",
    "scannedObjects": 12500,
    "matchedObjects": 8300,
    "partialResultsAvailable": false
  }
}

Use the manifest to programmatically discover and validate inventory output files.

The minioExtension object provides additional details about the inventory execution:

Key	Value
`status`	Indicates the final state of the job. Possible values: `"completed"` - Job finished successfully. `"canceled"` - Job cancelled using `mc inventory cancel` `"suspended"` - Job suspended using `mc inventory suspend`
`scannedObjects`	Total count of objects examined by the inventory job.
`matchedObjects`	Count of objects matching the configured filters and included in the output.
`partialResultsAvailable`	Indicates whether output files contain complete results. - `true` indicates partial results from a cancelled or suspended job. - `false` indicates complete results of a completed job.

Working with data files

Data files contain one row per object for CSV and JSON formats, or use columnar storage for Parquet. AIStor compresses the files by default using ZSTD compression. CSV files include field names in the first row. JSON files use JSON Lines format with one object per line.

Data files follow the naming pattern file-NNN.{format}.{compression}, where:

NNN - a zero-padded sequence number (001, 002, etc.)
{format} - the configured output format (csv, json, or parquet)
{compression} - the compression method (zst for ZSTD compression, or no extension when disabling compression)

For example, file-001.csv.zst or file-002.parquet indicates a compressed CSV and uncompressed Parquet file respectively.

Process the data files using standard tools for the chosen format. For example, use pandas for CSV and Parquet in Python, jq for JSON, or load the files into a data warehouse or analytics platform. Parquet-formatted files typically integrate directly with tools like Apache Spark, Presto, and Trino.

The default fields in the data files include bucket name, object keys, object size, and last modified data. Some default fields only populate on buckets with versioning enabled. You can modify the report to include optional fields. See the output field list for more complete documentation.

Scheduling jobs

The scheduler process manages all job states, including planning of subsequent runs for recurring jobs, rescheduling failed jobs, and cleaning up after job execution. The scheduler can run on no more than one node in the cluster at a time, while maintaining state within the cluster to allow other nodes to resume the process if a scheduler node fails.

AIStor schedules jobs based on the schedule field in the job configuration. One-time jobs with schedule : once run once and either complete or fail with no further runs. Repeating jobs with any other schedule value run repeatedly until suspended or deleted.

The scheduler calculates the next run time for repeating jobs based on when the previous execution completed:

hourly: One hour after the last completion
daily: Next midnight (00:00 UTC) after the last completion
weekly: Next Sunday at midnight (00:00 UTC) after the last completion
monthly: First Sunday of the next month at midnight (00:00 UTC) after the last completion
yearly: First Sunday of the next year at midnight (00:00 UTC) after the last completion

The scheduler processes each job once regardless of the number of missed runs between the last and current run. For example, suspending a daily job on Monday and resuming it on Friday does not result in AIStor running the missed Tuesday-Thursday runs.

Job states

Each inventory job state indicates its current phase in the scheduling and execution lifecycle. The scheduler transitions jobs between states based on execution progress, scheduling requirements, and control operations.

State	Description
Pending	Waiting for processing by an executor.
Running	Actively scanning objects and writing output files. The job remains in this state until completion, failure, or a suspend/cancel operation.
Sleeping	Waiting for next scheduled run. This state only applies to repeating jobs. The scheduler transitions the job to `Pending` at the scheduled time.
Completed	Completed without errors. Terminal state for one-time jobs. Repeating jobs transition through `Completed` and return to Sleeping.
Failed	Exceeded the maximum number of retry attempts (3) after encountering errors. Terminal state for one-time jobs.
Errored	Failed run with remaining retry attempts. The scheduler marks the job for retry after a 10 minute delay.
Canceled	Stopped with the `mc inventory cancel` command. Terminal state for one-time jobs. For repeating jobs, the scheduler transitions the job as “Pending” at the next scheduled time.
Suspended	Paused with the `mc inventory suspend` command. The job remains in this state until explicitly resumed.

Use mc inventory status to check a job’s current state and execution details.

Executing jobs

The executor process manages jobs that the scheduler marks as ready to run. The executor runs on every node in the cluster to ensure the parallel processing of jobs, maximizing throughput and minimizing execution time.

The executor typically runs jobs within 15-30 minutes of the scheduler marking the job as pending.

A failed job may successfully write partial output prior to encountering the error. Similarly, cancelling or suspending a job mid-run may produce partial output. Check the job manifest partialResultsAvailable field to determine whether the output contains incomplete data.