How to Format all JSON Files in a Git Repository

by Christoph Schiessl on DevOps, Git, and Python

I'm working on a project that depends on many JSON files for configuration. These files are deployed with the application and, therefore, are kept in the project's Git repository. Internal tools automatically generate some JSON files, while others are manually maintained and updated. Suffice it to say that there is no consistent formatting across these files, but it would be a win for developer productivity if this were the case. At the very least, it would make diffs easier to look at. Long story short, I decided to take matters into my own hands, clean up the formatting, and also update the project's CI to ensure that it stays consistent. There were several steps needed to make this happen ...

Detect all JSON files

Since I'm working in a Git repository, it's a natural choice to use the git ls-files command to get a listing of all JSON files. The official documentation explains this Git command as follows:

Show information about files in the index and the working tree.

By default, git ls-files returns only files that Git already tracks. However, I want to handle tracked and untracked files. To accomplish this, I had to use the --cached option to request tracked files and the --others option to request untracked files.

git ls-files --cached --tracked

This command has a problem because it also returns files excluded from tracking, which is not what I want. For instance, the directory containing my Python virtual environment is listed in a .gitignore file and, therefore, excluded from tracking. To tell git ls-files to apply the .gitignore rules as usual, we must add the --exclude-standard option.

git ls-files --cached --tracked --exclude-standard

All right, we now get a list of tacked and untracked files not excluded by a .gitignore file. The last missing piece is to limit the listing to files with a json extension. Fortunately, filtering is another built-in feature of git ls-files, meaning we can append a filter like -- '*.json' to achieve the desired effect.

git ls-files --cached --others --exclude-standard -- '*.json'

If I run this command in my test repository, it detects four JSON files — two that are tracked, two that are untracked, and none of the ones that are excluded by .gitignore files (e.g., in a node_modules directory).

$ git ls-files --cached --others --exclude-standard -- '*.json'
subdirectory/untracked.json
untracked.json
subdirectory/tracked.json
tracked.json

Iterate over detected files

Next, we need to iterate over all detected files to process each one. This could be done with a shell script, but it's easier with Python. The interesting part here is that we need a way to execute our git ls-files command in a subshell and capture its standard output. Luckily, there's the built-in subprocess module providing the check_output function, which is exactly what we need. This function executes the given command, captures its standard output, and then returns the captured output. By default, it's not using a subshell, but this can easily be enabled by setting its shell parameter to True.

from pathlib import Path
from subprocess import check_output

COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"

listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
    print(f"Processing {json_file}")

Note that we have to decode() the subprocess's output to convert it from bytes to a UTF-8 str. Next, I'm using splitlines() to separate it into individual lines (without trailing line break characters). Finally, I'm mapping all file paths to Path objects, which will be convenient during the next step. When you run this script, you get the expected output.

$ python json-formatter.py
Processing subdirectory/untracked.json
Processing untracked.json
Processing subdirectory/tracked.json
Processing tracked.json

Format with the json module

To re-format a file, we need to read the file's content, parse the content as JSON, and then serialize the JSON again with consistent indentation and so on. To read the file, we can use the read_text() function of the Path class. This is why I mapped the strings representing the file paths to proper Path objects in the previous step. In any case, once we have the unformatted content in a str variable, we can pass it along to json.loads() for parsing into more meaningful data structures like, for instance, a dict object. Lastly, we take this data structure and serialize it back to a plain str using the json.dumps() function.

import json
from pathlib import Path
from subprocess import check_output

COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"

listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
    print(f"Processing {json_file}", end="")
    unformatted_content: str = json_file.read_text()
    formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
    if unformatted_content == formatted_content:
        print(" => correctly formatted.")
    else:
        print(" => incorrectly formatted.")

The json.dumps() function takes a parameter called indent to control the number of spaces used to indent nested structures in the str output. I'm using four spaces, but any positive integer will do ...

Check-Only feature for CI integration

If the script is used in a CI environment, it shouldn't attempt to format any files on disk because these changes would be lost again when the CI job is finished. Instead, the script should only fail (i.e., terminate with a non-zero exit status) if it detects incorrectly formatted files. To support this, I'm adding a --check-only option to trigger this behavior when running in CI. I'm deliberately not using a library to parse the command line argument because I would rather keep it as simple as possible, instead, I'm looking at sys.argv directly.

import json
import sys
from pathlib import Path
from subprocess import check_output

COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
CHECK_ONLY = len(sys.argv) >= 2 and sys.argv[1] == "--check-only"
exit_with_failure = False

listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
    print(f"Processing {json_file}", end="")
    unformatted_content: str = json_file.read_text()
    formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
    if unformatted_content == formatted_content:
        print(" => correctly formatted.")
    else:
        print(" => incorrectly formatted.")
        if CHECK_ONLY:
            exit_with_failure = True

if exit_with_failure:
    sys.exit(1)

If the --check-only has been given, I'm setting a boolean variable exit_with_failure to True if the formatting of the file currently being processed is incorrect. Then, once all files are processed, I check the variable to determine if there were any incorrectly formatted files. If that is the case, then I use the sys.exit() to terminate the script with a non-zero exit code and thereby make the whole script fail. This works well with most CI systems because their jobs usually fail when one of the steps defining them fails, and our script will be one such step.

In any case, if we rerun the script with --check-only, you'll first see that it tells us whether the formatting is correct for each JSON file. Secondly, it fails with exit status 1 because there are multiple incorrectly formatted files.

$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ echo $? # print exit status of the previous command ...
1

Write formatted JSON back to file

Lastly, we have to write the formatted JSON back to disk if the script has been started in normal mode — without the --check-only option. This is pretty easy, though, because the Path class also has a counterpart for read_text(), which is unsurprisingly called write_text(). We need only two extra lines of code to accomplish this ...

import json
import sys
from pathlib import Path
from subprocess import check_output

COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
CHECK_ONLY = len(sys.argv) >= 2 and sys.argv[1] == "--check-only"
exit_with_failure = False

listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
    print(f"Processing {json_file}", end="")
    unformatted_content: str = json_file.read_text()
    formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
    if unformatted_content == formatted_content:
        print(" => correctly formatted.")
    else:
        print(" => incorrectly formatted.")
        if CHECK_ONLY:
            exit_with_failure = True
        else:
            json_file.write_text(formatted_content)

if exit_with_failure:
    sys.exit(1)

Conclusion

Now, you can try out the finished script ...

$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ echo $? # command failed because files are not yet correctly formatted
1
$ python json-formatter.py
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => correctly formatted.
Processing tracked.json => correctly formatted.
$ echo $? # command succeeded because files are now correctly formatted
0

Anyway, that's everything I had to say for today. Thank you for reading, and see you soon!

Ready to Learn More Web Development?

Join my Mailing List to receive 1-2 useful Articles per week.


I send two weekly emails on building performant and resilient Web Applications with Python, JavaScript and PostgreSQL. No spam. Unscubscribe at any time.

Continue Reading?

Here are a few more Articles for you ...


Word Level Levenshtein Distance

This article explains how to use the Levenshtein algorithm to compare sentences by looking for word differences instead of character differences.

By Christoph Schiessl on Python

How to Find the First/Last Day of a Year from a given Date

Learn how to find the first and last day of a year with Python's datetime module. This article explains step by step what you need to know.

By Christoph Schiessl on Python

Orphaned Branches in Git

Learn about Git's internal data structure and how orphaned branches can be used to create separate histories with their own root commits.

By Christoph Schiessl on DevOps and Git

Christoph Schiessl

Christoph Schiessl

Independent Consultant + Full Stack Developer


If you hire me, you can rely on more than a decade of experience, which I have collected working on web applications for many clients across multiple industries. My involvement usually focuses on hands-on development work using various technologies like Python, JavaScript, PostgreSQL, or whichever technology we determine to be the best tool for the job. Furthermore, you can also depend on me in an advisory capacity to make educated technological choices for your backend and frontend teams. Lastly, I can help you transition to or improve your agile development processes.