Azure Storage Backup and Delete Tool - Tutorial

Backing Up an Azure Storage Account

I always wanted to try my hand at writing a programming tutorial so this is my first attempt at a one. If you’re strictly interested in the code, you can find it on my Gitlab repo. Before you download and run it, keep in mind that it does contain destructive commands and bulk operations. This can cause data loss and incur high transfer charges depending on the location of the storage accounts you’re transfering to and from. It has no error handling and unless you know what the code does, it can yield strange results. However, for me it doubled as the learning ground for Azure Storage (AS). If you’re interested in an overview of the code itself and some thoughts on the current state of Azure libraries made by Microsoft, keep scrolling.

Picture of Seagate Momentus Thin internal hard drive

The curse of choice

On a recent project I had to work with AS and I found that there were some confusing concepts early on. The biggest issue I had was with the libraries used to interact with AS. An AS account isn’t just strictly for storage and provides a multitude of ways to interact with its contents.

This complexity is daunting and because it was my first larger project working with it, I felt lost at times. As of this moment, if you search for ‘azure storage’ on nuget.org you will get 1,215 results back, a number that can add to the confusion if you are starting out. Which of these packages are useful, which will help you solve the task at hand and which of these are still updated and maintained?

One of the reasons for the multitude of packages comes from the fact that there are a number of .Net frameworks that need to be supported. The project I worked with was using Microsoft.WindowsAzure.Storage which is currently deprecated, but worked fine for the .Net 4.6 solution it was used in. I was writing my solution in .Net Core 3.1 and I hoped to use the latest packages so that if I do expand on my code later on, the packages associated won’t be deprecated by that time.

If you are new to .Net development, the large number or results can still be quite confusing if you look for the replacement packages for WindowsAzure.Storage. You have Microsoft.Azure.Storage.Blob to work with blobs and Microsoft.Azure.CosmosDB.Table to work with tables. Why isn’t this second one also part of the Storage package? And what’s ComosDB? This inconsistency in naming is raising questions as I search for the latest libraries to use - why is one Azure.Storage and the other Azure.CosmosDB? Are these the the two packages I should use to work with tables and blobs and they just happen to have different namespaces? If that’s the case and they are, why is the version number different, shouldn’t similar packages evolve at roughly the same rate? Besides that, the notes on the CosmosDB.Table package say that this library is “in maintenance mode and it will be deprecated soon. Please upgrade to the new .Net Standard Library Microsoft.Azure.Cosmos.Table”.

Lost? Me too.

At this point I thought that I will use these two packages - Azure.Cosmos.Table and Azure.Storage.Blob. I looked at the release version and Microsoft.Azure.Storage.Blob is currently on v11.2.3 and Microsoft.Azure.Cosmos.Table is on version v1.0.8. Has one been around for a lot longer and the other one just got added? Not really, they were both released in 2018 but the version numbers start off differently (I’m sure there is a guide or article explaining this but my argument is that this version and naming mismatch causes a lot of confusion, especially among us, newer .Net developers).

At this point you might say “I should just use AzCopy”. It seems to do what I wanted, which is to back up an AS account to another account. It also eliminates the issues around packages. The problem is that if you want to copy Azure Tables then you need to make sure you use version 7.3 for AzCopy but if you click on the link Microsoft provide in the AzCopy documentation then you are taken to a generic Azure Tables article. So can AzCopy be used to backup Azure Tables? Apparently it can, but according to this response on Stack Overflow it can’t do direct table to table copy. You first need to export the source table(s) to local disk or blob storage according to Zhaoxing Lu, the person providing the accepted answer on the thread. What now?

I went back to nuget.org and searched for ‘azure storage’ again and I started the whole process from scratch. This time I found another package, Azure.Storage.Blobs. This is different from the previous ones, but it seems to be the most up to date. We have progress! But what will I do with the tables?I spent some time looking for Azure.Storage.Tables, but this doesn’t exist. What I should have looked for was instead Azure.Data.Tables. Although this quest spans a few rows in this article, this was a few hours of my time trying to understand what packages to use to interact with the storage accounts. I’m writing this in the hopes that someone stumbles across the post and it saves them a bit of time.

Out of the two packages, Azure.Data.Tables is in prerelease and shouldn’t be used for production applications. When I started this project I used version Beta 7 and because of the changes made to methods within the package, my code broke. The changelog did not mention those changes and after rasing a problem on the project’s Github page, Microsoft were quick to fix it and add more items to the Changelog. It is worth keeping in mind if you read this and Beta 9 comes out or the final production version is release as some methods might be changed again.

Putting things together

This is the brief story of how I ended up using a library with 86,000 downloads and another one with 15,500,000 downloads. Microsoft might have prioritized development for the Blob storage library, but if anything, they should have mentioned that another library is currently in development and what the recommended library for interacting with Table storage is. Maybe they did, but due to the sheer number of active projects, this is very difficult to find.

I understand that people might not start development with the same freedom I did and you might be forced to use one library over another because your project is .Net 4.6 for example. That still doesn’t necessarily justify the confusion that is caused by all the libraries which don’t share a naming convention and address different versions and builds of .Net yet address the same problem context - AS.

With the two libraries selected it’s time to understand the structure of Azure Tables and Blobs and how to interact with them through the two packages.

Azure Tables is used to access the content inside an Azure Table and Azure Blobs is used to access the content inside an Azure Blob.

The two nuget packages are installed with dotnet add package Azure.Storage.Blobs and dotnet add package Azure.Storage.Tables --prerelease. The --prerelease flag needs to be used since the package is still in beta (v8 right now) and if you do not use it when adding the package you will get a prompt to use it.

Both have similar key concepts as described on their documentation pages:

Azure Tables

Azure Blobs

The Main Class

With the background out of the way, it’s time to dive into the code itself. This is split into three files - Program.cs, BlobOperations.cs and TableOperations.cs. The main method is inside Program.cs and triggers the start of the program with the other two files holding operations to download, upload, transfer and delete blobs and tables.

Some of the code below is taken from a couple of the samples provided by Microsoft on the Azure Data Tables Github Repo and samples for Azure Storage Blob mass upload and download pages. Learning more about batch operations and async uploads and downloads is an entire topic on its own that deserves a separate article, but the links provided can help shed light on the actual process and methods used.

Below is the full code from Program.cs and the flow of the application goes like this: START -> delete any blobs/tables in the destination -> wait 15 seconds for the delete operation to finish -> download blobs from origin to local storage -> upload blobs to destination storage -> transfer tables -> END.

using System;
using System.Threading.Tasks;
using System.Threading;

namespace AzureStorageBackup
{
    class Program
    {
        static async Task Main(string[] args)
        {
            string originConnectionString = "";

            string destinationConnectionString = "";

            // clear both blob and table storage
            // blob delete operation can still be ongoing despite the command to delete them being
            // sent over! as such, wait 15 seconds before attempting to create/upload anything
            await BlobOperations.DeleteBlobs(destinationConnectionString);
            await TableOperations.DeleteTables(destinationConnectionString);

            TimeSpan ts = new TimeSpan(0, 0, 15);
            Console.WriteLine("Waiting 15 seconds for delete blob operation to finish."
                + " Create blob operations will fail if run right after a delete!");
            Thread.Sleep(ts);

            // download blobs async to a local folder
            await BlobOperations.DownloadBlobs(originConnectionString);

            // upload downloaded blobs async to destination account
            await BlobOperations.UploadBlobs(destinationConnectionString);

            // transfer tables from an origin account to a destination account
            await TableOperations.TransferTables(originConnectionString,
                destinationConnectionString);
            
        }
    }
}

The class starts off with two connection strings being defined. These can be hard coded, but there is no reason why this cannot be moved into a settings file or even be read from the console when the application starts. For the purpose of simplicity I did not do that, and everything is just kept in one big file. Next, Blob Containers and their contents are deleted from the destination storage account and Azure Tables are cleared from the same destination account. This ensure that the account is empty and no data is updated/overwritten. Again, the purpose of the code is to create a specific backup of another storage account.

Due to the nature of Azure Storage, if you do end up deleting anything in the destination account, you will get an error if you try to write to it too soon. A simple timer waits 15 seconds before beginning the next round.

Next, the blobs are downloaded (this also helps to ensure that there is another copy locally and because there isn’t a method to do the in-memory copy from one storage account to another within this package). This can be done with AzCopy though, so if that’s your use-case, then AzCopy will do the trick.

Once the blobs are downloaded, they get uploaded asynchronously and the tables are transfered once everything is done in batches of up to 100 table entities, which is the limit for batch operations on tables. Keep in mind that for a batch operation to work, the entities need to be of the same type, otherwise the code will fail.

The first portion of code that I will look at are the delete methods for Blobs and Tables.

/// <summary>
/// Method to delete all blob containers in the destination - this ensures that only a clean
/// copy of the origin is made to an empty storage account
/// </summary>
/// <param name="connectionString"></param>
/// <returns></returns>
public static async Task DeleteBlobs(string connectionString)
{
    // Timer that measures how long the entire operation takes
    Stopwatch timer = Stopwatch.StartNew();
    int blobCount = 0;
    BlobServiceClient destinationBlobServClient =
        new BlobServiceClient(connectionString);
    Pageable<BlobContainerItem> destinationBlobContainers =
        destinationBlobServClient.GetBlobContainers();
    Console.WriteLine($"Deleting blobs from {destinationBlobServClient.AccountName}");

    foreach (BlobContainerItem bci in destinationBlobContainers)
    {
        await destinationBlobServClient.DeleteBlobContainerAsync(bci.Name);
        blobCount++;
    }
    timer.Stop();
    Console.WriteLine($"Deleted {blobCount} blobs from "
        + $"{destinationBlobServClient.AccountName} in {timer.Elapsed.Seconds} seconds");
    }
}

The method inside the BlobOperations class starts a timer, sets the blob count to 0, creates a Blob Service Client (which is used to complete container level operations) and retrieves a list of all the containers available in the given storage account. Then, for each Blob Container inside this storage account, the code deletes the Blob Container using its name as reference. As such, a container named Bills would have the Name property set to Bills and the blob service client deletes the container with this name.

A counter returns how many containers were deleted and prints this to the console along with the time it took to delete them.

The table operations are similar as can be seen in the method body:

        /// <summary>
        /// Method clears a storage account of all the tables and table entities given the 
        /// connection string to that storage account
        /// </summary>
        /// <param name="connectionString"></param>
        public static async Task DeleteTables(string connectionString)
        {
            // Timer that measures how long the entire operation takes
            Stopwatch timer = Stopwatch.StartNew();
            int tableCount = 0;

            // Delete all tables in the destination 
            TableServiceClient tableServiceClient = new TableServiceClient(connectionString);

            Pageable<TableItem> destinationTables = tableServiceClient.Query();
            System.Console.WriteLine($"Deleting tables from {tableServiceClient.AccountName}");
            foreach (TableItem ti in destinationTables)
            {
                await tableServiceClient.DeleteTableAsync(ti.Name);
                tableCount++;
            }
            timer.Stop();
            Console.WriteLine($"Deleted {tableCount} tables in "
                + $"{timer.Elapsed.TotalSeconds} seconds");
        }

Just like the Blob variant of the method, the Table Service Client is used to complete table level operations. A list of all the tables is returned and then the name of each table is passed through and deleted. For a table named Bills the Name property would have the value of Bills. This is used by the DeleteTableAsync method which then removes this table from the destination container. A counter is incremented and a timer returns how long the operation took.

Blob Operations - Download and Upload

The Blob\BlobOperations.cs class is broken down into three methods - one to download blobs, one to upload blobs and one to delete blobs.

public static async Task DownloadBlobs(string connectionString)
{
    // Timer that measures how long the entire operation takes
    Stopwatch timer = Stopwatch.StartNew();

    // Path to where all the blobs will be downloaded
    string downloadPath = Directory.GetCurrentDirectory() + "\\download\\";

    // Specify the StorageTransferOptions - need to read up on these :) 
    var options = new StorageTransferOptions
    {
        // Set the maximum number of workers that 
        // may be used in a parallel transfer.
        MaximumConcurrency = 8,
        // Set the maximum length of a transfer to 50MB.
        MaximumTransferSize = 50 * 1024 * 1024
    };

    // the service clients allow working at the Azure Storage level with Tables and Blobs
    BlobServiceClient originBlobServClient =
        new BlobServiceClient(connectionString);

    // To work with individual blob containers, I need to retrieve the BlobContainerItems
    Pageable<BlobContainerItem> originBlobContainers =
        originBlobServClient.GetBlobContainers();
    Console.WriteLine($"Downloading blobs from {originBlobServClient.AccountName}.");

    // Create a queue of tasks and each task downloads one file
    var tasks = new Queue<Task<Response>>();
    int downloadedBlobCount = 0;
    foreach (BlobContainerItem bci in originBlobContainers)
    {
        if ((bci.Name.Contains("-") || bci.Name.Contains("$")) == false)
        {
            Directory.CreateDirectory(downloadPath + bci.Name);
        }
    }

    // go through all Blob containers in the origin
    foreach (BlobContainerItem bci in originBlobContainers)
    {
        if ((bci.Name.Contains("-") || bci.Name.Contains("$")) == false)
        {
            System.Console.WriteLine("Container name: " + bci.Name);
            BlobContainerClient originBlobContainerClient =
                originBlobServClient.GetBlobContainerClient(bci.Name);

            Pageable<BlobItem> originBlobsInContainer =
                originBlobContainerClient.GetBlobs();

            Console.WriteLine($"Downloading blobs from {bci.Name} container.");
            foreach (BlobItem bi in originBlobsInContainer)
            {
                //add each BlobItem to the queue to be downloaded
                string fileName = downloadPath + bci.Name + "\\" + bi.Name;
                BlobClient blob = originBlobContainerClient.GetBlobClient(bi.Name);
                tasks.Enqueue(blob.DownloadToAsync(fileName, default, options));
                downloadedBlobCount++;
            }
        }
    }
    // run all downloads async
    await Task.WhenAll(tasks);
    timer.Stop();
    Console.WriteLine($"Downloaded {downloadedBlobCount} in"
        + $" {timer.Elapsed.TotalSeconds} seconds");
}

A path is setup to where the blobs will be downloaded, this will be inside the \downloads folder inside the application package. The StorageTransferOptions provides properties to configure concurrency for parallel transfers nad maximum transfer size. These are left to the defaults used in the Microsoft sample. A BlobServiceClient is setup to perform operations on the blob storage account itself.

Next, all blob containers are loaded and a queue of tasks is created. The queue will hold a collection of tasks with each task being an async download. For each container inside the account, I do a check to see if the folder name contains a dash or a dollar sign. These tend to be folders created that hold logs and other metrics and I want to ignore them for the purpose of the backup. Folders are created for each container so that for a container named Bills, all its blobs will be downloaded to downloads\Bills.

Next, the code goes through each container and for each container a list of BlobItems is retrieved. Each blob is saved inside the folder that matches the container it came from into a file with the same name as the blob. A blob named Bill1.csv located inside the Bills container will be downloaded to downloads\Bills\Bill1.csv.

After all the individual blob downloads are added to the queue, the tasks are run. The code waits until all the downloads are completed and then the timer is stopped and a message is printed to the screen.

The upload operation is largely the same, with the difference that I now create containers in the destination storage account and upload the contents asynchronously.

/// <summary>
/// Method uploads all the blobs for a given storage account from the `downloads` folder
/// and names each container according to the name of the folder container the blob
/// EX: \downloads\container1\1 will create a container "container1" and upload a 
/// blob named "1"
/// </summary>
/// <param name="connectionString"></param>
/// <returns></returns>
public static async Task UploadBlobs(string connectionString)
{
    // Timer that measures how long the entire operation takes
    Stopwatch timer = Stopwatch.StartNew();
    // Path where all the blobs were downloaded to
    string downloadPath = Directory.GetCurrentDirectory() + "\\download\\";
    BlobServiceClient destinationBlobServClient =
        new BlobServiceClient(connectionString);

    // pattern to retrieve the blob name, which in my case is just a number
    string blobPattern = @"\\[0-9]{1,}";
    // Specify the StorageTransferOptions
    BlobUploadOptions uploadOptions = new BlobUploadOptions
    {
        TransferOptions = new StorageTransferOptions
        {
            // Set the maximum number of workers that 
            // may be used in a parallel transfer.
            MaximumConcurrency = 8,

            // Set the maximum length of a transfer to 50MB.
            MaximumTransferSize = 50 * 1024 * 1024
        }
    };
    var uploadTasks = new Queue<Task<Response<BlobContentInfo>>>();
    int uploadedBlobCounts = 0;

    string[] directories = Directory.GetDirectories(downloadPath);
    // create the blob containers to match the origin blob containers
    // based on the folder names downloaded in the download step!
    foreach (string s in directories)
    {
        DirectoryInfo di = new DirectoryInfo(s);
        destinationBlobServClient.CreateBlobContainer(di.Name);
    }

    Pageable<BlobContainerItem> destinationBlobContainers =
        destinationBlobServClient.GetBlobContainers();

    // the upload is slowed down, since I will run each folder at a time
    foreach (BlobContainerItem destBci in destinationBlobContainers)
    {
        Console.WriteLine($"Uploading blobs to {destBci.Name} container.");
        BlobContainerClient destinationBlobContainerClient =
            destinationBlobServClient.GetBlobContainerClient(destBci.Name);
        string[] filePaths = Directory.GetFiles(downloadPath + destBci.Name);
        foreach (string s in filePaths)
        {
            Match blobName = Regex.Match(s, blobPattern);
            string blobString = blobName.Value.Substring(1);
            BlobClient blob = destinationBlobContainerClient.GetBlobClient(blobString);
            uploadTasks.Enqueue(blob.UploadAsync(s, uploadOptions));
            uploadedBlobCounts++;
        }
    }
    // run all uploads async
    await Task.WhenAll(uploadTasks);
    timer.Stop();
    Console.WriteLine($"Uploaded {uploadedBlobCounts} files in "
        + $"{timer.Elapsed.Seconds} seconds");
}

Table Operations - Table Transfer

At this point, the blobs were downloaded locally and uploaded to the destination account. The next stage happens in one go, as the code cycles through all the tables and the entities within them and transfers them over to the new account.

/// <summary>
/// Method transfers all tables from an origin Azure Storage account to a destination
/// Azure Storage account. If tables exist, they are left intact and if they are missing
/// they are created and populated in batch operations of 100 entities each
/// </summary>
/// <param name="originConnectionString"></param>
/// <param name="destinationConnectionString"></param>
public static async Task TransferTables(string originConnectionString,
    string destinationConnectionString)
{
    // Timer that measures how long the entire operation takes
    Stopwatch timer = Stopwatch.StartNew();

    TableServiceClient originTableServClient =
        new TableServiceClient(originConnectionString);
    TableServiceClient destinationTableServClient =
        new TableServiceClient(destinationConnectionString);

    Pageable<TableItem> originTables = originTableServClient.Query();
    Pageable<TableItem> destinationTables = destinationTableServClient.Query();
    int entityCounter = 0;
    Console.WriteLine($"Copying tables from {originTableServClient.AccountName} to "
        + $"{destinationTableServClient.AccountName}.");
    foreach (var ti in originTables)
    {
        System.Console.WriteLine("Table Item name - " + ti.Name);
        if ((ti.Name.Contains("-") || ti.Name.Contains("$")) == false)
        {
            // for each table that is in the origin create the same table in the destination
            await destinationTableServClient.CreateTableAsync(ti.Name);

            // create table clients to interact with origin and destination tables
            TableClient originTableClient =
                originTableServClient.GetTableClient(ti.Name);
            TableClient destinationTableClient =
                destinationTableServClient.GetTableClient(ti.Name);

            List<TableEntity> originEntities = 
                originTableClient.Query<TableEntity>().ToList();

            if (originEntities.Count < 100)
            {
                List<TableTransactionAction> addEntitiesBatch = 
                    new List<TableTransactionAction>();
                addEntitiesBatch.AddRange(originEntities.Select(oe =>
                    new TableTransactionAction(TableTransactionActionType.Add, oe)));
                Response<IReadOnlyList<Response>> response = await
                    destinationTableClient.SubmitTransactionAsync(addEntitiesBatch).ConfigureAwait(false);
            }
            else if (originEntities.Count > 100)
            {
                for (int i = 0; i < originEntities.Count; i = i + 100)
                {
                    List<TableTransactionAction> addEntitiesBatch = new List<TableTransactionAction>();
                    List<TableEntity> originEntitiesBatch =
                        originEntities.Skip(i).Take(99).ToList();
                    addEntitiesBatch.AddRange(originEntitiesBatch.Select(oe =>
                        new TableTransactionAction(TableTransactionActionType.Add, oe)));
                    Response<IReadOnlyList<Response>> response = await
                        destinationTableClient.SubmitTransactionAsync(addEntitiesBatch).ConfigureAwait(false);
                }
            }
            System.Console.WriteLine($"Copied {originEntities.Count} entities to " +
                $"{ti.Name}.");
            entityCounter += originEntities.Count;
        }
    }
    timer.Stop();
    Console.WriteLine($"Transfered {entityCounter} entities in "
        + $"{timer.Elapsed.TotalSeconds} seconds.");
}

The method starts off by setting up two TableServiceClients. A list of TableItems is made based on the contents of the origin storage account and the destination storage account. After this, I cycle through each of the TableItems in the origin tables. If the name of the table contains - or $ the table is being ignored, since these are tables that we don’t want copied across - they can be logs or metrics tables. If you want those copied accross, the code would need to be changed at this point.

Using the destinationTableServClient I create tables in the destination based on the name of the table from the origin. Using the previous examples, for a table named Bills in origin, a new one with the same name is created in the destination storage account. To work with the individual tables, we need to have access to a TableClient. As previously described, the service client works on a Tables level and the table client works with an individual table. Next up, I create a list of TableItems which contain all the entities for the given table from the origin.

If the number of entities is under 100 then I create a new batch of TableTransactionActions. All the transactions are submitted through the destination table client. If there are more than 100 entities, then a list is created which goes in batches of 100 entities and executes the action of copying the entries over for each batch. At the end, the counter returns the total number of transferred entities to make it easier to compare what was taken from the origin to the destination.

Wrap Up

It feels a bit strange that Microsoft did not include a library (or libraries) to achieve these transfers from one account to another and that the process seems unnecessarily complicated at times. While there are tools to achieve some of these tasks (AzCopy for example), it’s the lack of consistency that is slightly annoying. The confusion around packages and the fact that the documentation seems disjointed at times adds complexity and additional difficulties to new developers.

A discussion could be had within Microsoft to decide a way forward, to simplify the overall structure of their packages (which increases each year with all the new features added) to aid new developers. It won’t be easy for experienced developers either especially when working on projects that might have different components using different .Net runtimes and thus requiring different solutions to work with AS.

I hope that this will be improving with time, but in the meantime it is essential for guides to be built and for Microsoft to update its documentation. They need to create roadmaps that clarify these things for junior and senior developers alike.

Catalin