Object Storage Usage

About S3

S3, or Simple Storage Service, is a cloud storage service developped by Amazon Web Services (AWS). It allows users to store and retrieve any amount of data at any time, from anywhere on the web.

Here are some key points:

  • Object Storage: S3 uses an object storage model, where each file is treated as an object and stored in a "bucket" (a storage container).
  • Scalability: S3 is designed to scale easily, allowing you to store anything from small amounts to very large quantities of data without any issues.

To authenticate, a user need to get ACCESS_ID and SECRET_ID. Mésocentre administrator team will provide you those information.

Objects

Objects storage does not store file in a global metadata hierarchy. For example, creating a directory in object storage infrastructure has no sense as there are no data associated to it.

Instead of that, any file are its own metadata, one of them is the name where you can have / to map a classic hierarchy view. On datastore, it's not consider as a file but some CLI use that to simulate a hierarchy.

Buckets

Buckets have a key role on object storage data management. A bucket is a container which stores data that are related. That relation can be:

  • experience run name where you will put all data related to that specific run, per example, s3://run-20241001 for the run of October 1st 2024
  • services where you will put all data related to that specific service, per example s3://website.domain.org where you will put all data related to the website http://website.domain.org

It is up to you to choose your bucket organisation and how many buckets you want to have.

s5cmd

s5cmd is a powerful S3 CLI which provide basic tools to manage your data and some advanced features to create a dataflow.

environment variable credentials

s5cmd can use credential file to store data, but it's not very convenient as you have to specify --file-credential option and s3 endpoints can't be store in that file. We recommend to use env variable instead.

$ cat > ceph-meso.sh <<EOF
export AWS_ACCESS_KEY_ID='<your-access-key-id>'
export AWS_SECRET_ACCESS_KEY='<your-secret-access-key>'
export S3_ENDPOINT_URL='<s3-endpoint-url>'
$ source ./ceph-meso.sh

create a bucket

Let's say for the tutorial that you want to have a bucket per experience. I have two experiences on my lab foo and bar. I will create two buckets

$ s5cmd mb s3://foo
mb s3://foo
$ s5cmd mb s3://bar
mb s3://bar

list available buckets

$ s5cmd ls  
2024/10/09 11:28:31  s3://bar
2024/10/09 11:28:28  s3://foo

upload a file

As describe here object storage is different from posix storage. Files are just dropped into a bucket with their own metadata. Hopefully, s5cmd manage posix-like hierarchy by interpreting / in file name as a directory separator.

For the tutorial, imagine you want to have two directory-like in your bucket, one for input parameters, the other one for output parameters. Input could be the scientific plateforme parameters used to run the experience. Output could be the results of your experience.

Let's upload the input and output for today's run for experience bar

$ touch input.txt
$ s5cmd cp input.txt s3://bar/input/$(date +%Y%M%d).txt
cp input.txt s3://bar/input/20241009.txt

$ touch output.txt
$ s5cmd cp output.txt s3://bar/output/$(date +%Y%M%d).txt
cp output.txt s3://bar/output/20241209.txt

Note

As already said, but it's important to understand, the file name is input/20241009.txt You do not have to create a directory.

listing the content of a bucket

Well, now let's see what's inside your bucket

$ s5cmd ls s3://bar/
                                  DIR  input/
                                  DIR  output/

WHAT !! Directories ??

s5cmd will let you have a posix-like representation of your bucket. The directory exists only because at least one file is named input/xxx and another is named output/xxx. Of course, you can list what's inside the input directory with command s5cmd ls s3://bar/input/.

You can also list the whole content of the bucket with

$ s5cmd ls s3://bar/*
$

removing a file

Ok, now what happens if you remove a file

$ s5cmd rm s3://bar/input/20241009.txt
rm s3://bar/input/20241009.txt
$  s5cmd ls s3://bar/  
                                  DIR  output/

As you can see, the directory input has disappeared, it's just because there is no more object named input/xxx.

restic

Restic is a simple backup tools. It has S3 backend support that can provide to user a simple way to backup local data to Ceph@Mésocentre.

env variable

restic use environment variable to manage configuration, here a example of a basic configuration that you can source every time you want to use restic.

By default, restic only read one file at a time. In case you have a lot of small file, your filesystem will probably not provide enough bandwitch to minimiza the backup duration, you can improve it with the option RESTIC_READ_CONCURRENCY which will make restic reading multiple file at the same time.

$ cat > ~/.config/restic << EOF
export AWS_ACCESS_KEY_ID='<your-access-key-id>'
export AWS_SECRET_ACCESS_KEY='<your-secret-access-key>'
export RESTIC_REPOSITORY=s3:<s3-endpoint>/restic
export RESTIC_READ_CONCURRENCY=20
EOF
$ source ~/.config/restic

Note

You can add the export RESTIC_PASSWORD='<password>' line to your configuration file :warning: if you feel secure about it :warning:.

Initialize bucket

To use restic you have to create a specific bucket for each directory you want to backup. restic will create a secure bucket to store your backup.

Note

The bucket will NOT be human-readable, restic use its own way to organize your data.

$ restic init
enter password for new repository:
enter password again:
created restic repository 84acbf7a6f at s3:http://<endpoint>/restic

Please note that knowledge of your password is required to access
the repository. Losing your password means that your data is
irrecoverably lost.

Backup

Starting a backup is just a one line command as followed, if you have registered RESTIC_PASSWORD you can put it on a cron job to avoid manual starting.

$ restic backup ${directory}
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
created new cache in ${directory}
no parent snapshot found, will read all files
[0:00]          0 index files loaded

Files:          89 new,     0 changed,     0 unmodified
Dirs:           57 new,     0 changed,     0 unmodified
Added to the repository: 190.764 KiB (109.085 KiB stored)

processed 89 files, 124.248 KiB in 0:00
snapshot 85d90fdb saved

Snapshots

Every time you start a backup, restic will create a unique snapshot. A snapshot referred all files existed at the time of the snapshot even those that were not be modified.

listing snapshots

$ restic snapshots
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
ID        Time    Host      Tags        Paths  Size
----------------------------------------------------------
85d90fdb  <date>  <hostnme>             <dir>  124.248 KiB
----------------------------------------------------------
1 snapshots

listing data in a snapshot

$ restic ls 85d90fdb
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
[0:00] 100.00%  1 / 1 index files loaded
snapshot 85d90fdb of [<dir>] at [...]:
[...]

comparing snapshots

$ restic diff 6937cc80 9f69a047
comparing snapshot 6937cc80 to 9f69a047:

M    /restic/cmd_diff.go
+    /restic/foo
M    /restic/restic

Files:           0 new,     0 removed,     2 changed
Dirs:            1 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:     14 new,    15 removed
Tree Blobs:      2 new,     1 removed
  Added:   16.403 MiB
  Removed: 16.402 MiB

You can also compare the content of a specific directory between snapshots:

restic diff 6937cc80:/AIA/2024/11/ 9f69a047:/AIA/2024/11/
comparing snapshot 6937cc80 to 9f69a047:

[...]
+    /22/94/2024_11_22__08_59_11_122__SDO_AIA_AIA_94.jp2
+    /22/94/2024_11_22__08_59_47_122__SDO_AIA_AIA_94.jp2
+    /22/94/2024_11_22__09_00_23_122__SDO_AIA_AIA_94.jp2
+    /22/94/2024_11_22__09_00_59_122__SDO_AIA_AIA_94.jp2

Files:        7172 new,     0 removed,     0 changed
Dirs:            0 new,     0 removed
Others:          0 new,     0 removed
Data Blobs:  10071 new,     0 removed
Tree Blobs:     12 new,    12 removed
  Added:   7.202 GiB
  Removed: 275.983 KiB

restoring data from a snapshot

$ restic restore <tag> --target <destination>
enter password for repository:
repository 84acbf7a opened (version 2, compression level auto)
[0:00] 100.00%  1 / 1 index files loaded
snapshot 85d90fdb of [<dir>] at [...]:
[...]

<tag> can be the ID of the snapshot, or latest for the latest snapshot

Removing old snapshots

restic do NOT delete snapshot, to avoid storage saturation, you can remove old snapshot by asking restic to only keep last <x> snapshots. Only files that were deleting before the <x>st snapshot will be deleted.

$ restic forget --keep-last <nb> --prune
[...]

<nb> the number of snapshot you want to keep