Generators and Iterators in Julia and Python
A comparison of how iterators are implemented in Julia and Python and a look at the different role generators play in making iterators
Generators are used a lot in Python to implement iterators. Both Julia and Python implement list comprehensions with generators. Rather than writing say [x*x for x in 1:4]
, we can put expression inside the list comprehension inside a separate object:
julia> g = (x*x for x in 1:4)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##27#28"))}(getfield(Main, Symbol("##27#28"))(), 1:4)
Which you can see results in a Generator
object in Julia and a generator
object in python:
python> g = (x*x for x in range(1,5))
python> g
<generator object <genexpr> at 0x10bdeef48>
While seemingly similar, they are quite different. Python generators are based on coroutines, while Julia generators are built on top of iterators.
Which brings us to to iterators. We need to understand their role and implementation in both languages.
Python Iterators
I’ll do a bit of a silly example, where I’ll show you how to iterate over the coordinates of a point class.
class Point:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
This code will allow us to write:
python> p = Point(3, 8, 2)
However if we want to create an iterator for it, we need to add this method as well:
def __iter__(self):
return PointIterator(self)
That allows us to iterate over our point in an for-loop. And just for convenience will add this method as well. It gives a string representation of points in the REPL.
def __repr__(self):
return f"Point({self.x}, {self.y}, {self.z})"
The real meat of the functionality we are building is in the iterator class we return.
class PointIterator:
def __init__(self, point):
self.point = point
self.coords = ['z', 'y', 'x']
def __next__(self):
if self.coords:
return getattr(self.point, self.coords.pop())
else:
raise StopIteration
The way this works is that we keep a list of the coordinates x, y, and z which we want to access in turn as we iterate over the point.
__next__
is called upon each iteration, where we pop one coordinate from the list of coordinates to fetch. With getattr
we can get the attributes of Python objects by name.
With this implementation we can play around with iterators:
python> p = Point(2, 3, 8)
python> for a in p:
print(a)
2
3
8
python> it = iter(p)
python> next(it)
2
python> next(it)
3
python> next(it)
8
Let us compare this with how iterators are made in Julia before getting into generators.
Julia Iterators
Julia has a quite different approach to implementing iterators. Abstractly speaking one could say, that both languages require implementation of a sort of protocol to make something iterable. However, Julia does not do it in an object oriented fashion. There is no class to subclass, instead one must implement new versions of the iterate
function.
First let us make the Point
type. There is no need to implement the repr
function to show the type. It has an automatic representation.
struct Point
x::Int
y::Int
z::Int
end
It also has a default constructor, so we don’t have to implement that either. A for loop in Julia looks like this:
for i in iter # or "for i = iter"
# body
end
Julia translates this to the following code:
next = iterate(iter)
while next !== nothing
(i, state) = next
# body
next = iterate(iter, state)
end
So to iterate over the coordinates of a point, we need to implement two version of iterate
. In Julia terminology these are methods, but that can be confusing to people with an object oriented background. Methods are all the concrete implementations of a function in Julia. So in Julia terminology, we are implementing two methods of the iterate
function.
iterate(p::Point) = p.x, [:y, :z]
function iterate(p::Point, coords)
if isempty(coords)
nothing
else
getfield(p, coords[1]), coords[2:end]
end
end
So there are some similarities with __next__
in Python. Python talks about attributes of an object, while Julia talks of fields of an object. Thus getattr
and getfield
basically do the same thing. getattr
refers to attributes by string names, while Julia uses symbols. Symbols are quite similar to strings. If you want a better understanding how why Julia, often uses Symbols instead of strings, there is a great answer here. "foobar"
is a string while :foobar
is a symbol.
Another important difference is that Julia does not mutate any data while iterating. Look at this example of using enumerate
in Python:
python> it = enumerate(["one", "two"])
python> list(it)
[(0, 'one'), (1, 'two')]
python> list(it)
[]
You can see that the iterator gets spent. You cannot keep creating lists from the iterator. However in Julia you can, because the state of iteration is kept separate from the iterator itself:
julia> it = enumerate(["one", "two"])
Base.Iterators.Enumerate{Array{String,1}}(["one", "two"])
julia> collect(it)
2-element Array{Tuple{Int64,String},1}:
(1, "one")
(2, "two")
julia> collect(it)
2-element Array{Tuple{Int64,String},1}:
(1, "one")
(2, "two")
Using Python Generators to Implement Iterators
As you can see implementing iterators in Python is rather verbose, which is why the preferred method of implementing iterators in Python is through the usage of generators.
class Point:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
def __repr__(self):
return f"({self.x}, {self.y}, {self.z})"
def __iter__(self):
def gen():
yield self.x
yield self.y
yield self.z
return gen()
You see with this approach we don’t have to implement a PointIterator
class at all. yield
is different from return in that when run in the context of a coroutine, which generators do, it will "remember" where it was in the function when it exited. A coroutine will re-enter the function where it left off. That is very practical when dealing with iteration. Let me give another example illustrating that.
def generator():
for i in range(0, 6):
yield chr(ord('A') + i)
Let me show you how to use it in the REPL to produce letters from A to F.
python> g = generator()
python> list(g)
['A', 'B', 'C', 'D', 'E', 'F']
python> g = generator()
python> next(g)
'A'
python> next(g)
'B'
python> next(g)
'C'
The neat thing here is that you don’t need write code maintaining info about where you are in the for loop. The generator keeps track of that for you.
Python Style Generator in Julia
If you want to create a generator that functions like a Python generator, you have to utilize channels in Julia. This may seem a bit cryptic, but don’t worry I will get into more of the details of how this works towards the end.
function generator()
Channel() do channel
for i in 0:5
put!(channel, Char('A' + i))
end
end
end
This will not return a Generator
object, like the comprehension, but a Channel
object:
julia> g = generator()
Channel{Any}(sz_max:0,sz_curr:1)
We can however use this channel much like a regular iterator. E.g. we can collect the values and form an array, like we can with an iterator.
julia> collect(g)
6-element Array{Any,1}:
'A'
'B'
'C'
'D'
'E'
'F'
Or we can use the unpack syntax to create an array.
julia> g = generator()
julia> [g...]
6-element Array{Char,1}:
'A'
'B'
'C'
'D'
'E'
'F'
Unpacking the chars returned as an argument to string, for concatenate all the cars into a string.
julia> g = generator()
julia> string(g...)
"ABCDEF"
Getting into the Details about Tasks and Channels
The Channel
constructor we are using in our code example is actually doing quite a lot of stuff, so you don't get to see all the details of how it works.
Here what is going on is more spelled out:
function generator()
channel = Channel(2)
function make_letters()
for i in 0:5
put!(channel, Char('A' + i))
end
end
task = Task(make_letters)
schedule(task)
channel
end
You can think of coroutines a bit like threads. Like threads, you can have many of them, and you can communicate between them with different data structures. Channels are used to send data between coroutines. However unlike threads only one coroutine ever runs. You cannot have multiple coroutines running on different CPU cores. They work like cooperative multitasking. The scheduler switches to another coroutine, because the current one explicitly give up control.
So for instance when I write:
put!(channel, Char('A' + i))
A letter will be put into channel and the function return immediately if the channel has a buffer. We specified our channel as having a buffer of 2.
Once the buffer is exhausted our current coroutine will block, and the scheduler will pick a coroutine (task) waiting for data from this particular channel.
task = Task(make_letters)
This will create a task, which is how we keep track of different coroutines. Every coroutine needs a task. The task keeps track of the execution state.
However at this point, our coroutine is not actually running. The for-loop is not putting letters into the channel. To get it to run, we need to schedule
the task.
schedule(task)
This puts the task in the scheduler queue. Whenever the system is idle, it will look in its scheduler queue for tasks to run. The immediate result of this is that our task gets run, which causes ‘A’ and ‘B’ to be put into the channel. But when the loop tries to put in ‘C’ it has run out of buffer and it blocks.
The scheduler will then look for another task to run. I believe we can regard the REPL itself as being a main coroutine which the scheduler can switch to. So say I saved the channel
returned from generator
, we can do this:
julia> channel = generator()
julia> take!(ch)
'A'
After that the REPL goes into idle waiting for your input. The scheduler then has a chance to pick up our old make_letters
task and continue running it. One letter has been removed from the buffer, so it now has the chance to put in 'C' into the channel.
To avoid the boilerplate of setting up coroutine tasks and running them, there are a lot of shortcuts and macros to help us. Here is one alternative:
function generator()
channel = Channel(2)
@async begin
for i in 0:5
put!(channel, Char('A' + i))
end
end
channel
end
The @async
macro takes care of creating a function, wrapping it in a Task
and the scheduling that task. It will return the task object, but we don't need to store it for anything.
The Channel
constructor can take either an integer giving its capacity or a function object. In the latter case it will wrap the provided function object in a Task
and schedule it.
function generator()
function make_letters(channel::Channel)
for i in 0:5
put!(channel, Char('A' + i))
end
end
Channel(make_letters)
end
That will probably make it more obvious how the initial example worked, where we utilized the do
syntax. Remember the do
syntax is similar to:
foobar(x -> stuff(x), y, z)
foobar(y, z) do x
stuff(x)
end
This is kind of the stuff that goes under the hood in Python generators. You can actually do similar things in Python to deal with concurrency. I am going to explore that in a later story.