Learn how to modify Python's
Soroco is excited to write its first blog post on how we modified the Python language to make deploying Python code more secure in production.
Though Python is a powerful and productive language for building systems, deploying it in production presents a large attack surface that allows a malicious user to modify or reverse engineer potentially sensitive business logic.
At Soroco, we always place protecting our client's business logic first. This protection should be from both malicious attackers, and natural sources such as disk-based bit rot, or network transmission errors. Below, we describe how we encrypt and append an HMAC to each Python file that we ship for these purposes.
This project is open-source and available on GitHub: https://github.com/soroco/pyce Further, you can get started with PYCE by downloading from PyPI:
pip install pyce
We also talked about this at PyCon Delhi:
Common techniques to protect code in production are binary signing, obfuscation, or encryption. But, these techniques typically assume that we are protecting either a single file (EXE), or a small set of files (EXE and DLLs).
When we work with Python programs, there isn't a single binary to sign. Unlike compiled languages, such as C, Rust, or Go, you don't get a single executable to defend in the field.
With Python, your attack surface is much larger. So we can't just sign a binary, obfuscation doesn't actually protect things, and there are typically many files that would require encryption. Not to mention the fact that the Python interpreter does not have the capability of loading signed or encrypted files.
Yes, there are static compilers for Python such as Cython, but their language features always lag the reference Python implementation. We wanted a solution that would work with the reference Python implementation to take advantage of the latest features.
Thus, our requirements were threefold:
This led us to a pure Python solution using authenticated cryptography.
On our production servers, you'd find a peculiar file extension:
.pyce. You're probably used to
.pyc—compiled Python bytecode—and
.py—raw Python source.
But you've never seen
.pyce before. That's because it only exists at Soroco!
Figure 1. The
.pyce on disk format.
We use an encrypt-then-MAC construction as shown in Figure 1. This format encrypts Python bytecode using AES-256 and stores the ciphertext with an HMAC appended in plaintext. The HMAC is generated using a secret key and SHA-512 as the cryptographically secure hash function.
At runtime, we verify the HMAC before decrypting the ciphertext. This ensures the integrity—each bit is exactly as intended—of the code as well as its authenticity—it came from an entity controlling the secret key. It also ensures that we waste no time decrypting or executing even a single bit that was not intended by our engineers.
But how do we get a reference implementation
Python.exe running the
It turns out Python's interpreter has import
machinery that you can hook to load whatever you want. In essence, you can customize and change the meaning of
import in Python with ease!
Figure 2. We implemented loading
.pyce by hooking into
sys.meta_path and inheriting from pre-defined import machinery classes.
The above Figure 2 illustrates how we hook into the Python import machinery. We add a special
SorocoPathFinder at the beginning of the list of known module loaders.
sys.meta_path lists all loaders for the Python interpreter. Three shown above should be familiar:
BuiltinImporter: loads modules built into the language runtime
FrozenImporter: loads modules in the self-contained frozen format
PathFinder: loads modules within known file extensions on the Python path
How does the code work? Primarily, we reuse the two intermediate classes for their pre-existing implementation:
In other words, we benefit from all of the engineering already implemented in the import machinery. These two classes search the Python path and only call our special decrypting loader if they find a
module file with the right extension:
.pyce. All of the magic happens within the
SorocoFileLoader which inherits from
1 2 3 4 5 6 7 8
path = self.get_filename(fullname) data = self.get_data(path) # It is important to normalize path case for platforms like Windows data = decrypt(data, KEYS[normcase(relpath(path))]) bytes_data = _validate_bytecode_header(data, name=fullname, path=path) return _compile_bytecode(bytes_data, name=fullname, bytecode_path=path)
Figure 3. Code from the
SorocoFileLoader class demonstrating the decrypting and loading of an encrypted file.
The code above is largely taken from the Python
reference implementation, with a single line inserted calling a function called
decrypt. The fact that its implementation is just 5 lines of executable code is a testament to the flexibility
of the Python language.
It is precisely this powerful flexibility that Soroco leverages in its proprietary Python runtime focused on business process automation.
Across our production systems, we often deploy a lot of duplicate files. This is due to shared libraries, in some cases shared codebases, and shared resources. As an optimization, we deduplicate our storage layers to minimize our disk footprint.
But how can we deduplicate encrypted files? Encrypted data is, by design, indistinguishable from random data. Thus, if we encrypted the same Python file 10 times, we would produce 10 unique files.
We use convergent encryption to protect sensitive business logic, but still produce the same ciphertext when presented with the same plaintext. We are in a scenario such that attackers can not choose plaintext. Therefore, they can only gain information about publicly known files—such as open source libraries. Since we already reveal the open source libraries we use in production, we do not consider this as a risk at all.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
from hashlib import sha256 as srchash from hmac import new as hmac ... from cryptography.hazmat.backends import default_backend from cryptography.hazmat.primitives.ciphers import Cipher from cryptography.hazmat.primitives.ciphers.algorithms import AES as CIPHER from cryptography.hazmat.primitives.ciphers.modes import CTR as MODE BACKEND = default_backend() ... HMAC_HS = 'sha512' ... with open(absolute_path, 'rb+') as openf: # read data = openf.read() # hash hashv = srchash(data) key = hashv.digest() # encrypt cipher = Cipher(CIPHER(key), MODE(key[0:16]), backend=BACKEND) encryptor = cipher.encryptor() ciphertext = encryptor.update(data) # write out openf.seek(0) openf.write(ciphertext) # append HMAC openf.write(hmac(key, ciphertext, HMAC_HS).digest())
Figure 4. Convergent encryption implementation using AES-256 and SHA-256.
In Figure 4 above, we reproduce our core convergent encryption code. Convergent encryption fundamentally depends on deterministic generation of an encryption key. We use SHA-256 to deterministically generate a unique 256-bit key for each file, which we then encrypt with AES-256. The nonce for counter mode is deterministically set to the first 16 bytes of the key. After encrypting, we use an HMAC with SHA-512 as its hash algorithm, keyed deterministically with the same SHA-256 hash of the plaintext.
Naturally, we never want the keys to touch the disk in a plaintext format. We always store keys in a secure storage layer implementing a cryptographic barrier—meaning data at rest is always encrypted.
For example, Hashicorp's Vault implements a secure write barrier. When we execute on production servers, our runtime pipes in a JSON formatted dictionary of
all of the keys via
stdin to the executing Python process. It then resumes normal execution after reading in the keys. Whenever an import triggers loading a
.pyce file, our custom
SorocoFileLoader, as shown in Figure
2, checks this cache of keys, verifies the HMAC, decrypts, and loads the bytecode into Python's in-memory cache.
Our implementation has no overhead in production. This is due to Python's in-memory bytecode cache. Only when our processes begin execution, and first import bytecode, does decryption take place. Thus, during normal operation there is no impact to the runtime of Soroco business logic!
For the very paranoid, you should pin your Python process's memory space such that it never pages out to disk. Pinning memory guarantees that plaintext business logic will never be persistently stored, and neither will the decryption keys. Due to Python's in-memory bytecode cache, decrypted business logic could get paged out to disk. Pinning your process's memory space eliminates this risk.
Working with Python's import machinery was easier than we expected. Our entire implementation is less than 150 lines of Python code, with no performance impact to running business logic in production. Such a compact codebase makes auditing much easier. And we are fully open sourcing it here: Soroco PYCE.