How to upgrade a Python 2 codebase to Python 3

Python 2.7 is not supported after Jan. 1st, 2020. Since its first release on July 3, 2010, Python 2.7 has been active for 10 years and a large number of projects are written in this version. Most Python 2 projects should still be able to work properly as Python 2.7 is a very stable version and most bugs have been fixed in the past 10 years. However, with Python 2.7, you are not able to use many cool features introduced in Python 3, such as f-string, typing, and assignment expression. Particularly, typing has become a must-to-have for new Python projects because it can make the code more readable. Besides, more importantly, many packages do not support Python 2.7 anymore, which will be annoying when you keep maintaining your codebase in Python 2.7. Therefore, as a developer, you may have the task to upgrade your Python 2 codebase to Python 3.

The future module can be used to make Python 2 code compatible with both the Python 2 and Python 3 platforms (Py2/3). The future module is based on modules lib2to3, six and python-modernize , which provides basic functionality to make Python 2 code work in Python 3. The future module acts as a wrapper for the fixers of these modules and is thus easier to use. The futurize script provided by the future module passes Python 2 code through appropriate fixers to convert it into valid Python 3 code. Besides, the __future__ and future package are added to make the new code also compatible with Python 2. You may think you can jump to Python 3 directly. Well, technically you can, but if your codebase is large and has been running in production for a long time, you would prefer a smooth rather than abrupt conversion. You don’t want too many surprises in your logs every day.

Preparation. Before getting started to use the futurize script, we need to have both Python 2 and Python 3 installed on our computers. You may already have Python installed, but it may not be the version you expect. To maintain different versions of Python at the same time, you can use conda which is very powerful and very convenient to use.

Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. To install conda, the easiest way is to install anaconda which is a platform designed for data scientists as it includes most of the common data analysis packages such as NumPy, Scipy, pandas, etc. More importantly, it includes the package management tool conda. However, Anaconda is very bulky and can be several GB in size. If you prefer to install and manage all the packages by yourself, you can install the slim version miniconda, which includes the conda tool and a minimum number of packages.

After miniconda or anaconda has been installed on your computer, you can start to create virtual environments. Let’s create a virtual environment for Python 2.7 and Python 3.9 (the latest version when this article is written), respectively.

conda create --name py27 python==2.7.18
conda create --name py39 python==3.9.2

We can install the future module and run the futurize script in both environments and the result will be the same. The environments with different Python versions will be used to test the converted code.

Let’s use the py27 environment and install the future module in it:

conda activate py27
pip install future

Now that the environments are set up and the future module is installed. We can start to upgrade the Python 2 codebase.

With the future module, there are three stages for the conversion.

Stage 1 is called modernization, which is to upgrade deprecated features that have a Python3-compatible equivalent available in Python 2 to the Python 3 counterparts, such as the print function. This stage is for “safe” changes that modernize the code but do not break Python 2.7 compatibility or introduce a dependency on the future package. To run modernization conversion, use this command:

futurize --stage1 --write **/*.py

With --write option, the files will be updated in place. If you want to do a dry run for some files, you can omit the --write option and check the output in the console. With the --write option, the original files are backed up with the .bak suffix. You can delete them when all the stages have been completed successfully. Or you can pass the --nobackups opition to avoid generating backup files. But I would not recommend doing this.

If you can’t apply recursive globing in bash, you need to run this command to enable it:

shopt -s globstar

# Test globing
ls -d **/*.py

As a demonstration, I will do a Stage 1 conversion for the following Python 2 script, which includes some common Python 2 features that need to be refactored:

# -*- coding: utf-8 -*-

# print
print 123
print 123,

# dictionary
mydict = {"name": "John Doe", "age": 30}
for key in mydict.keys():
    print key

for value in mydict.values():
    print value

for key, value in mydict.iteritems():
    print key, value

# raw_input => input
name = raw_input("What is your name: ")
print name

# StringIO
import StringIO

stream = StringIO.StringIO("A string input.")
print stream.read()

# xrange => range
for i in xrange(1, 5):
    print i,

# map, filter, reduce, zip operations
my_list = ["1", "22", "333", "4444"]
my_list_mapped = map(int, my_list)
my_list_filtered = filter(lambda x: x < 10 or x > 100, my_list_mapped)
my_list_reduced = reduce(lambda x, y: x + y, my_list_filtered)
my_list_zip = zip(my_list, xrange(4))

# Class
class MyClass:
    def __init__(self, name):
        self.name = name

    def __str__(self):
        return self.name

# Iterator
class MyIterator:
    def __init__(self, iterable):
        self._iter = iter(iterable)

    def next(self):
        return self._iter.next()

    def __iter__(self):
        return self

my_iterator = MyIterator([1, 2, 4, 5])
my_iterator.next()
print my_iterator

# exception
try:
    b = 1 / 2.0
    c = 1 / 0
except ZeroDivisionError, e:
    print str(e)

# Attention here. This can not be automatically done properly and
# will break your code.
# bytes and unicode
assert isinstance("a regular string", bytes)
assert isinstance(u"a unicode string", unicode)
assert isinstance("a regular string", basestring)

# string codec: encodeing and decoding
my_bytes1 = "\xc3\x85, \xc3\x84, and \xc3\x96"
my_unicode1 = my_bytes1.decode("utf-8")
assert isinstance(my_bytes1, bytes)
assert isinstance(my_unicode1, unicode)

my_unicode2 = u"Å, Ä, and Ö"
my_bytes2 = my_unicode2.encode("utf-8")
assert isinstance(my_unicode2, unicode)
assert isinstance(my_bytes2, bytes)

After Stage 1 conversion, the code is converted as follows:

--- my_python2_script.py	(original)
+++ my_python2_script.py	(refactored)
@@ -1,33 +1,35 @@
 # -*- coding: utf-8 -*-
 
 # print
-print 123
-print 123,
+from __future__ import print_function
+from functools import reduce
+print(123)
+print(123, end=' ')
 
 # dictionary
 mydict = {"name": "John Doe", "age": 30}
 for key in mydict.keys():
-    print key
+    print(key)
 
 for value in mydict.values():
-    print value
+    print(value)
 
 for key, value in mydict.iteritems():
-    print key, value
+    print(key, value)
 
 # raw_input => input
 name = raw_input("What is your name: ")
-print name
+print(name)
 
 # StringIO
 import StringIO
 
 stream = StringIO.StringIO("A string input.")
-print stream.read()
+print(stream.read())
 
 # xrange => range
 for i in xrange(1, 5):
-    print i,
+    print(i, end=' ')
 
 # map, filter, reduce, zip operations
 my_list = ["1", "22", "333", "4444"]
@@ -50,21 +52,21 @@
         self._iter = iter(iterable)
 
     def next(self):
-        return self._iter.next()
+        return next(self._iter)
 
     def __iter__(self):
         return self
 
 my_iterator = MyIterator([1, 2, 4, 5])
-my_iterator.next()
-print my_iterator
+next(my_iterator)
+print(my_iterator)
 
 # exception
 try:
     b = 1 / 2.0
     c = 1 / 0
-except ZeroDivisionError, e:
-    print str(e)
+except ZeroDivisionError as e:
+    print(str(e))

# Attention here. This can not be automatically done properly and
# will break your code.
# bytes and unicode
assert isinstance("a regular string", bytes)
assert isinstance(u"a unicode string", unicode)
assert isinstance("a regular string", basestring)
    
# string codec: encodeing and decoding
my_bytes1 = "\xc3\x85, \xc3\x84, and \xc3\x96"
my_unicode1 = my_bytes1.decode("utf-8")
assert isinstance(my_bytes1, bytes)
assert isinstance(my_unicode1, unicode)

my_unicode2 = u"Å, Ä, and Ö"
my_bytes2 = my_unicode2.encode("utf-8")
assert isinstance(my_unicode2, unicode)
assert isinstance(my_bytes2, bytes)

With this updated code, you should still be able to run it in Python 2.7, but not yet in Python 3.9.

You should commit the changes for Stage 1 before you start Stage 2. If any error happens during tests and you need to fix it manually, you should create a new commit for the manual fix. It is important to separate things done by a machine from those done by a human.

Stage 2 is called porting, which adds support for Python 3 while keeping compatibility with Python 2 by introducing specific workarounds and helpers. This stage adds a dependency on the future package. The goal for Stage 2 is to make further mostly safe changes to the Python 2 code to use Python 3-style code which is still able to run on Python 2 with the help of the appropriate builtins and utilities in future.

Run Stage 2 of the conversion process with:

futurize --stage2 **/*.py --write <code>--unicode-literals</code>

With the unicode-literals option, from __future__ import unicode_literals will be added to all modules. As a result, all strings in the modules would then be unicode unless explicitly marked with a b'' prefix. There are different perspectives about whether or not to specify the --unicode-literals flag. I personally prefer to enable this option because it makes the string operation simpler. Besides, with the latest version of the future module, many known issues of the --unicode-literals flag have been solved now. For example, strings already marked with the u'' prefix will not be changed anymore.

When Stage 2 is completed, the demo script is converted as follows:

--- my_python2_script.py	(original)
+++ my_python2_script.py	(refactored)
@@ -2,44 +2,57 @@
 
 # print
 from __future__ import print_function
+from __future__ import division
+from __future__ import unicode_literals
+from future import standard_library
+standard_library.install_aliases()
+from builtins import input
+from builtins import zip
+from builtins import map
+from builtins import str
+from builtins import next
+from builtins import range
+from past.builtins import basestring
+from past.utils import old_div
+from builtins import object
 from functools import reduce
 print(123)
 print(123, end=' ')
 
 # dictionary
 mydict = {"name": "John Doe", "age": 30}
-for key in mydict.keys():
+for key in list(mydict.keys()):
     print(key)
 
-for value in mydict.values():
+for value in list(mydict.values()):
     print(value)
 
-for key, value in mydict.iteritems():
+for key, value in mydict.items():
     print(key, value)
 
 # raw_input => input
-name = raw_input("What is your name: ")
+name = input("What is your name: ")
 print(name)
 
 # StringIO
-import StringIO
+import io
 
-stream = StringIO.StringIO("A string input.")
+stream = io.StringIO("A string input.")
 print(stream.read())
 
 # xrange => range
-for i in xrange(1, 5):
+for i in range(1, 5):
     print(i, end=' ')
 
 # map, filter, reduce, zip operations
 my_list = ["1", "22", "333", "4444"]
-my_list_mapped = map(int, my_list)
-my_list_filtered = filter(lambda x: x < 10 or x > 100, my_list_mapped)
+my_list_mapped = list(map(int, my_list))
+my_list_filtered = [x for x in my_list_mapped if x < 10 or x > 100]
 my_list_reduced = reduce(lambda x, y: x + y, my_list_filtered)
-my_list_zip = zip(my_list, xrange(4))
+my_list_zip = list(zip(my_list, range(4)))
 
 # Class
-class MyClass:
+class MyClass(object):
     def __init__(self, name):
         self.name = name
 
@@ -47,11 +60,11 @@
         return self.name
 
 # Iterator
-class MyIterator:
+class MyIterator(object):
     def __init__(self, iterable):
         self._iter = iter(iterable)
 
-    def next(self):
+    def __next__(self):
         return next(self._iter)
 
     def __iter__(self):
@@ -64,7 +77,7 @@
 # exception
 try:
     b = 1 / 2.0
-    c = 1 / 0
+    c = old_div(1, 0)
 except ZeroDivisionError as e:
     print(str(e))
 
@@ -72,17 +85,17 @@
 # will break your code.
 # bytes and unicode
 assert isinstance("a regular string", bytes)
-assert isinstance(u"a unicode string", unicode)
+assert isinstance(u"a unicode string", str)
 assert isinstance("a regular string", basestring)
 
 # string codec: encodeing and decoding
 my_bytes1 = "\xc3\x85, \xc3\x84, and \xc3\x96"
 my_unicode1 = my_bytes1.decode("utf-8")
 assert isinstance(my_bytes1, bytes)
-assert isinstance(my_unicode1, unicode)
+assert isinstance(my_unicode1, str)
 
 my_unicode2 = u"Å, Ä, and Ö"
 my_bytes2 = my_unicode2.encode("utf-8")
-assert isinstance(my_unicode2, unicode)
+assert isinstance(my_unicode2, str)
 assert isinstance(my_bytes2, bytes)

Don’t be intimidated by the number of the imports, normally we won’t have so many Python 2 features that need to be converted in the same file. Here I gather the commons ones in the same script just for demonstration purposes. You can check more about the Python 2 and Python 3 feature conversion in this cheatsheet, which is quite exhaustive.

Alert here, this converted script won’t run on either Python 2 or Python 3 because the strings are treated differently in Python 2 and Python 3. We need a manual fix stage for this problem.

Before we start Stage 3, commit the changes in Stage 2 because we should always separate things done by a machine from those done by a human.

Stage 3 — Manually fix `bytes` string-related problems.

Handling bytes strings correctly have been one of the most difficult tasks in writing a Py2/3 compatible codebase. This is because, in Python 2, the bytes type is simply an alias for Python 2’s str type.

>>> # Python 2.7
>>> bytes == str
True
>>> bytes is str
True

>>> py2_string = "A python 2 string"
>>> type(py2_string)
<type 'str'>

# In Python 2.7, a bytes string is just a regular string.
>>> py2_bytes_1 = bytes("A python 2 string")
>>> py2_bytes_2 = b"A python 2 string"
>>> type(py2_bytes_1)
<type 'str'>
>>> type(py2_bytes_2)
<type 'str'>

>>> py2_string == py2_bytes_1
True
>>> py2_string == py2_bytes_2
True

As in Python 2.7, the string is technically a bytes string, to create a unicode string, you need to use the u'' prefix:

>>> py2_unicode = u"Å, Ä, and Ö"
>>> type(py2_unicode)
<type 'unicode'>
>>> str == unicode
False

To convert a unicode string to a bytes string, we need to call the encode method on the unicode string. On the other hand, we use the decode method to convert a bytes string to the unicode counterpart.

>>> py2_unicode = u"Å, Ä, and Ö"
>>> py2_unicode_to_bytes = py2_unicode.encode("utf-8")
>>> py2_unicode_to_bytes
'\xc3\x85, \xc3\x84, and \xc3\x96'
>>> type(py2_unicode_to_bytes)
<type 'str'>

>>> py2_bytes = '\xc3\x85, \xc3\x84, and \xc3\x96'
>>> py2_bytes_to_unicode = py2_bytes.decode("utf-8")
>>> py2_bytes_to_unicode
u'\xc5, \xc4, and \xd6'
>>> type(py2_bytes_to_unicode)
<type 'unicode'>

However, in Python 3, the default string type is unicode string and we need to explicitly mark a string with the b'' prefix to make it a bytes type. But interestingly, we don’t have a unicode type in Python 3, it’s just str.

>>> # Python 3
>>> str == bytes
False

>>> py3_string = "A python 3 unicode string"
>>> type(py3_string)
<class 'str'>
>>> py3_unicode = u"A python 3 unicode string"
>>> type(py3_unicode)
<class 'str'>
>>> py3_string == py3_unicode
>>> True

>>> py3_bytes = b"A python 3 bytes string"
>>> type(py3_bytes)
<class 'bytes'>

Similar to the usage in Python 2, in Python 3, the encode method is used to convert a unicode string to bytes string and decode from bytes to unicode:

# We don't need the u'' prefix to create a unicode string because 
# the unicode type is the default in Python 3.
>>> py3_unicode = "Å, Ä, and Ö"
>>> py3_uncode_to_bytes = py3_unicode.encode("utf-8")
>>> py3_uncode_to_bytes
b'\xc3\x85, \xc3\x84, and \xc3\x96'
>>> type(py3_uncode_to_bytes)
<class 'bytes'>

>>> py3_bytes = b'\xc3\x85, \xc3\x84, and \xc3\x96'
>>> py3_bytes_to_unicode = py3_bytes.decode("utf-8")
>>> py3_bytes_to_unicode
'Å, Ä, and Ö'
>>> type(py3_bytes_to_unicode)
>>> <class 'str'>

With the knowledge of the bytes string above, we can refactor the string part of the code as follows:

# Attention here. This can not be automatically done properly and
# will break your code.
# bytes and unicode
-assert isinstance("a regular string", bytes)
+assert isinstance("a regular string", str)
assert isinstance(u"a unicode string", str)
assert isinstance("a regular string", basestring)

# string codec: encodeing and decoding
-my_bytes1 = "\xc3\x85, \xc3\x84, and \xc3\x96"
+my_bytes1 = b"\xc3\x85, \xc3\x84, and \xc3\x96"
my_unicode1 = my_bytes1.decode("utf-8")
assert isinstance(my_bytes1, bytes)
assert isinstance(my_unicode1, str)

my_unicode2 = u"Å, Ä, and Ö"
my_bytes2 = my_unicode2.encode("utf-8")
assert isinstance(my_unicode2, str)
assert isinstance(my_bytes2, bytes)

Now the code should be able to run successfully on both Python 2 and Python 3. Commit your changes and cheers! 🙂

In this article, you have learned how to upgrade a Python 2 codebase to Python 3. There are three stages, namely modernization (Stage 1), porting (Stage 2) and bytes string correction (Stage 3). The first two stages can be done automatically and normally do not need any intervention. It is the third stage that needs your major attention. If you are not careful at this stage, you will break your code easily. You can read more about the caveats of the bytes string in Python 2 and Python 3 in the following articles:

SuperDataMiner

How to upgrade a Python 2 codebase to Python 3

Stage 3 — Manually fix `bytes` string-related problems.

Related articles:

Leave a comment Cancel reply

How to upgrade a Python 2 codebase to Python 3

Stage 3 — Manually fix bytes string-related problems.

Related articles:

Share this:

Leave a comment Cancel reply

Stage 3 — Manually fix `bytes` string-related problems.