In the world of network automation, we rarely are so fortunate as to have only a single authoritative source of data to work with. More often, we are responsible for and responsive to data in multiple distinct systems. When these systems overlap in their domains of responsibility, or in the information that they manage, we find it necessary to compare data between them and resolve any differences. Doing so manually is tedious and error-prone — what’s needed here is automation! This is what has led us at Network to Code to develop the Python library, DiffSync.
A few examples of this kind of problem:
These and countless other real-world data comparison and synchronization problems are what led us to develop DiffSync. In short, DiffSync is a generic utility library that can be used to compare (“diff”) and synchronize (“sync”) different data sets. It’s free and open source, and we’ve been using it for some time now to save development time on projects both internal and external. DiffSync is the engine underlying tools such as Network Importer and the Nautobot NetBox Importer plugin for Nautobot.
This will be the first in a short series of blog posts about DiffSync — in this post I’ll introduce what DiffSync is and the basics of how it works, and walk you through a short example of writing a Python script that will use DiffSync to identify, report, then resolve the differences between a pair of JSON files.
DiffSync is at its most useful when you have multiple sources or sets of data to compare and/or synchronize, and especially if any of the following are true:
DiffSync acts as an intermediate translation layer between all of the data sets you are diffing and/or syncing. In practical terms, this means that to use DiffSync, you will define a set of data models as well as the “adapters” needed to translate between each base data source and the data model. In Python terms, the adapters will be subclasses of the DiffSync
class, and each data model class will be a subclass of the DiffSyncModel
class.
Once you have used each adapter to load each data source into a collection of data model records, you can then ask DiffSync to “diff” the two data sets, and it will produce a structured representation of the difference between them. In Python, this is accomplished by calling the diff_to()
or diff_from()
method on one adapter and passing the other adapter as a parameter.
You can also ask DiffSync to “sync” one data set onto the other, and it will instruct your adapter as to the steps it needs to take to make sure that its data set accurately reflects the other. In Python, this is accomplished by calling the sync_to()
or sync_from()
method on one adapter and passing the other adapter as a parameter.
Let’s start with an intentionally basic example. Our “data sets” will be two JSON files, each of which contains a random subset of the numbers 1 through 50. We can generate these files with a little Python:
>>> import json
>>> import random
>>>
>>> data1 = [i for i in range(1, 50) if random.randint(0, 1) == 1]
>>> with open("file1.json", "w") as file_handle:
... json.dump(data1, file_handle)
...
>>> data2 = [i for i in range(1, 50) if random.randint(0, 1) == 1]
>>> with open("file2.json", "w") as file_handle:
... json.dump(data2, file_handle)
...
>>>
DiffSync is available on PyPI, so all that’s needed is to create a Python virtual environment and install DiffSync into it:
$ python3.6 -m venv diffsync_virtualenv
$ source diffsync_virtualenv/bin/activate
(diffsync_virtualenv) $ pip install --upgrade pip
(diffsync_virtualenv) $ pip install diffsync
Note that DiffSync requires Python 3.6 or later!
When creating a new data model (subclass of DiffSyncModel
) there are a few metadata properties, prefixed with _
, that you may need to define for it to work properly. For this simple data model, the only two we need to be concerned about are _modelname
(a descriptive label for this model) and _identifiers
(a tuple of data fields that uniquely identify a single record of this model).
Here we’re defining a model called Number
, which has one attribute, value
, that also serves as the unique identifier of each instance of this model.
from diffsync import DiffSyncModel
class Number(DiffSyncModel):
"""Simple data model, storing only a number."""
# DiffSync metadata fields
_modelname = "number"
_identifiers = ("value",) # must be a tuple, not a single stand-alone value!
# Data attributes on each instance of this model, including the above-listed identifier(s):
value: int
In case you’re unfamiliar with the
value: int
syntax, this is the syntax used in Python 3.6 and later for variable type annotations. DiffSync is based on a library called Pydantic, which uses these type annotations to actually construct a data model that enforces the data types you’ve declared.
In this case, the two “data sets” we are going to compare share the same underlying “data source”, a simple JSON file. So here we actually only need to create a single adapter class that we can use for both data sets. In a more complex example involving differing databases or systems, you would need to create a separate adapter class for each one, but the same concepts apply.
Here we’re defining an adapter called NumbersJSONAdapter
, specifying that it knows about Number
data records, and implementing logic for it to load these from the specified JSON file.
import json
from diffsync import DiffSync
class NumbersJSONAdapter(DiffSync):
# Tell DiffSync that when we refer to a model called "number", we will use the Number class
number = Number
# Tell DiffSync which base/parent model(s) it should start with when diffing or syncing
top_level = ["number"]
def __init__(self, filename, *args, **kwargs):
super().__init__(*args, **kwargs)
with open(filename, "r") as source_file:
data = json.load(source_file)
for input_value in data:
# Create a Number record representing this value
record = Number(value=input_value)
# Add this record to our internal data store
self.add(record)
Let’s put it all together into a single self-contained Python script:
# numbers_script.py
import json
import pprint
from diffsync import DiffSync, DiffSyncModel
from diffsync.logging import enable_console_logging
class Number(DiffSyncModel):
"""Simple data model, storing only a number."""
# DiffSync metadata fields
_modelname = "number"
_identifiers = ("value",) # must be a tuple, not a single standalone value!
# Data attributes on each instance of this model, including the above-listed identifier(s):
value: int
class NumbersJSONAdapter(DiffSync):
# Tell DiffSync that when we refer to a model called "number", we will use the Number class
number = Number
# Tell DiffSync which base/parent model(s) it should start with when diffing or syncing
top_level = ["number"]
def __init__(self, filename, *args, **kwargs):
super().__init__(*args, **kwargs)
self.filename = filename
with open(filename, "r") as source_file:
data = json.load(source_file)
for input_value in data:
# Create a Number record representing this value
record = Number(value=input_value)
# Add this record to our internal data store
self.add(record)
def load_data():
"""Load both data sets and return the populated DiffSync adapter objects."""
data1 = NumbersJSONAdapter("file1.json", name="file1")
data2 = NumbersJSONAdapter("file2.json", name="file2")
return (data1, data2)
def diff_data():
"""Generate and print the diff between two data sets."""
data1, data2 = load_data()
diff = data1.diff_to(data2)
print(diff.str())
pprint.pprint(diff.dict())
if __name__ == "__main__":
enable_console_logging(verbosity=1)
diff_data()
If you save this script and execute it, you should get output similar to the following:
(diffsync_virtualenv) $ python numbers_script.py
2021-05-03 13:32.47 [info ] Beginning diff calculation [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:32.47 [info ] Diff calculation complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
number
number: 9 MISSING in file2
number: 11 MISSING in file2
number: 12 MISSING in file2
number: 13 MISSING in file2
number: 20 MISSING in file2
number: 22 MISSING in file2
number: 32 MISSING in file2
number: 38 MISSING in file2
number: 44 MISSING in file2
number: 48 MISSING in file2
number: 49 MISSING in file2
number: 1 MISSING in file1
number: 2 MISSING in file1
number: 4 MISSING in file1
number: 5 MISSING in file1
number: 8 MISSING in file1
number: 15 MISSING in file1
number: 18 MISSING in file1
number: 21 MISSING in file1
number: 23 MISSING in file1
number: 30 MISSING in file1
number: 34 MISSING in file1
number: 39 MISSING in file1
number: 42 MISSING in file1
number: 45 MISSING in file1
number: 47 MISSING in file1
{'number': {'1': {'-': {}},
'11': {'+': {}},
'12': {'+': {}},
'13': {'+': {}},
'15': {'-': {}},
'18': {'-': {}},
'2': {'-': {}},
'20': {'+': {}},
'21': {'-': {}},
'22': {'+': {}},
'23': {'-': {}},
'30': {'-': {}},
'32': {'+': {}},
'34': {'-': {}},
'38': {'+': {}},
'39': {'-': {}},
'4': {'-': {}},
'42': {'-': {}},
'44': {'+': {}},
'45': {'-': {}},
'47': {'-': {}},
'48': {'+': {}},
'49': {'+': {}},
'5': {'-': {}},
'8': {'-': {}},
'9': {'+': {}}}}
The exact numbers reported will, of course, be different since we randomly generated these two files!
As you can see, DiffSync has identified the numbers that differ between the two files as a DiffSync Diff
object, which can be converted to a string representation or a dictionary representation as needed.
DiffSync’s built-in logging uses
structlog
to generate log messages with structured data attached. By usingdiffsync.logging.enable_console_logging()
in our script, we’re converting those log messages to standard Python log messages, but it’s possible in more advanced integrations to access the structured logs directly, allowing you to do various forms of log processing without needing to parse free-text log messages.
For reporting and human-directed analysis, the above diff generation might be all that you need. But for automation, the next step is to be able to automatically resolve the diff and bring the two data sets into sync. Doing this as a dry run (without actually making any changes to the data sets) is as simple as calling sync_to()
instead of diff_to()
:
# number_script.py
# ...
def sync_data():
"""Sync the changes from data1 onto data2."""
data1, data2 = load_data()
data1.sync_to(data2)
# Show that *within DiffSync* there is now no longer any diff between the two data sets!
print(data1.diff_to(data2).str())
if __name__ == "__main__":
enable_console_logging(verbosity=1)
sync_data()
(diffsync_virtualenv) $ python numbers_script.py
2021-05-03 13:49.35 [info ] Beginning diff calculation [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:49.35 [info ] Diff calculation complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:49.35 [info ] Beginning sync [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=9
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=11
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=12
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=13
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=20
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=22
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=32
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=38
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=44
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=48
2021-05-03 13:49.35 [info ] Created successfully [diffsync.helpers] action=create diffs={'+': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=49
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=1
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=2
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=4
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=5
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=8
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=15
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=18
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=21
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=23
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=30
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=34
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=39
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=42
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=45
2021-05-03 13:49.35 [info ] Deleted successfully [diffsync.helpers] action=delete diffs={'-': {}} dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> model=number src=<NumbersJSONAdapter "file1"> status=success unique_id=47
2021-05-03 13:49.35 [info ] Sync complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:49.35 [info ] Beginning diff calculation [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 13:49.35 [info ] Diff calculation complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
(no diffs)
Note again that this is essentially a dry run — because we haven’t yet written any code telling DiffSync how to actually write changes back to the dataset, the “Created successfully” and “Deleted successfully” log messages are referring only to successful changes within DiffSync’s own representation of the data! We can see at the end that, at least within DiffSync itself, there are no longer any diffs between the two data set adapters.
Note also that because DiffSync identifies the diff set before performing a sync, only the numbers that differ between the two data sets are being modified — numbers that exist in both data sets are left completely untouched! This is a key feature of DiffSync, as particularly on subsequent re-syncs between data sets, existing and fully synchronized data should not be modified unnecessarily.
To make this more useful, of course, we need to write some code to actually output back to disk the changed data set. In this example, since our data set is a flat file rather than a collection of individual database records, we won’t implement individual create()
, update()
, and delete()
methods on the Number
class. We will instead implement sync_complete()
, which is a callback function that DiffSync automatically calls at the end of a successful sync operation, providing us the opportunity to perform a bulk write of the entire, fully updated data set back to disk:
# numbers_script.py
from diffsync import Diff, DiffSync, DiffSyncModel, DiffSyncFlags
# ...
class NumbersJSONAdapter(DiffSync):
# ...
def sync_complete(self, source: DiffSync, diff: Diff, flags=DiffSyncFlags.NONE, logger=None):
"""Callback after a sync has completed, updating the model data of this instance.
Note: this callback is **only** triggered if the sync actually resulted in data changes.
"""
numbers = self.get_all("number")
data = sorted([number.value for number in numbers])
target_filename = f"{self.filename}.new"
logger.info("Creating new output file", filename=target_filename)
with open(target_filename, "w") as target_file:
json.dump(data, target_file)
logger.info("Output file created successfully", filename=target_filename)
After adding the above callback function, if you run the script again, you’ll see a few more lines of output from the new logging calls you added, and can then confirm that your new data file was successfully created and is identical to file1.json
:
(diffsync_virtualenv) $ python numbers_script.py
2021-05-03 14:00.07 [info ] Beginning diff calculation [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
...
2021-05-03 14:00.07 [info ] Sync complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 14:00.07 [info ] Creating new output file [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> filename=file2.json.new flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 14:00.07 [info ] Output file created successfully [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> filename=file2.json.new flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 14:00.07 [info ] Beginning diff calculation [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
2021-05-03 14:00.07 [info ] Diff calculation complete [diffsync.helpers] dst=<NumbersJSONAdapter "file2"> flags=<DiffSyncFlags.NONE: 0> src=<NumbersJSONAdapter "file1">
(no diffs)
(diffsync_virtualenv) $ diff file1.json file2.json.new
(diffsync_virtualenv) $
I hope this has been a helpful introduction to why and how to use DiffSync. In a future blog post we’ll dive into more advanced examples, including how to handle data sets consisting of multiple distinct database records or files, how to handle hierarchical or tree-like data sets, and how to handle data sets for which not all data attributes apply to all data sets.
-Glenn
Share details about yourself & someone from our team will reach out to you ASAP!