24. Introduction to NetworkX Graphs
Section author: Malcolm Smith <msmith.malcolmsmith@gmail.com>
[status: first draft]
24.1. Motivation, Prerequisites, Plan
Motivation
As you go about your life, you will often take for granted the vast networks you are a part of. From your local art club (or your workplace, or school, etc.) to the entire world, we all live in relation to one another. To make sense in rigorous terms of this, mathematicians use what is known as network theory. Understanding and using it is a very valuable skill to have, as it allows easy visual understanding of any sort of data or network you’re working with.
Situations in which it might arise include modeling economic markets, where interactions are recorded between people, or in A.I. research, in which a neural network (a system designed to emulate the way our brains work) is a very specialized case of a mathematical network. In general, network theory has a wide range on applications, and having a basic understanding of it will be helpful to anyone who wants to gain a deeper insight into their data.
Prerequisites
The 10-hour “serious programming” course.
The “Data files and first plots” mini-course in Section 2.
Having NetworkX and Matplotlib, a very powerful and versatile programming library, installed. You can install them with:
$ sudo apt install pip $ pip install --user --upgrade networkx[default] $ pip install --user --upgrade matplotlib
For the later parts of the course, you need wget and gzip. You may need to download or upgrade wget and gzip, and you can make sure you have the latest versions with:
$ sudo apt install wget gzip
Plan
We’re going to start off with learning what a network is with the Bridges of Konigsberg, as well as the basics of NetworkX, such as adding nodes and edges as well as graphing the results. We will then move on to directed and weighted networks, which will come into play for the final step. After learning these functions, the next thing we will do is download and extract the data from an online database. we will analyze the data next, trying to see patterns in the distribution of the data.
24.2. First Steps
Our first step is to make sure all the required libraries are installed properly.
Open a python 3 prompt and check to see if you properly downloaded NetworkX and Matplotlib.
$ python3
>>> import networkx as nx
>>> G = nx.Graph()
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
Don’t worry about what that means right now; just know it’s working if there isn’t an error message.
After getting the libraries squared away, let us explore what a mathematical network actually is, and some familiar examples. A network is really a simple object, with lines, or edges, going between points, or nodes. The famous mathematical problem known as the Bridges of Konigsberg (which prompted Euler’s early work in graph theory) is easily represented as a network. For those who don’t know of the Bridges of Konigsberg, here is a picture of it:
This can be represented as a network with 4 nodes and 7 edges. The edges, as you may have guessed, are the bridges, and the nodes are the landmasses. One thing to note about this is that there are two sets of repeat bridges. This does not disqualify it from being a network, but it isn’t a type of network we will see in this chapter. There is a section where we have edges going between the same nodes twice, but that’s in a network in which direction matters, and thus the edges are not identical. In this case, there are two sets of identical edges, a case we will not discuss in this chapter.
Next we’re going to make a very simple network using NetworkX. To do this, let us first walk through some basic NetworkX commands.
some_network = nx.Graph()
will create a network called some_network.some_network.add_node(some_node)
adds the node some_node to the network some_network.some_network.add_nodes_from([node1, node2, node3])
will add all nodes inside the array in the input to some_network. It can also be used with another network.some_network.add_edge(node1, node2)
will add an edge from node1 to node2 in some_network. Note that if node1 or node2 is not in the network the command will add them.some_network.add_edges_from([[node1, node2], [node3, node2]])
will add the dges contained in the nested array. It can also be used with another network. Like.add_edge()
, it will add any nodes it contains that are not already present in the network.some_network.nodes
will return a tuple with all the nodes in some_network in it.some_network.edges
will return an array with all the edges in some_network as tuples inside it.nx.draw(some_network)
will allow matplotlib to draw some_network in its current state.
With that out of the way, let us create a simple network. To start, we will use a network with 5 nodes and 7 edges.
>>> import networkx as nx
>>> import matplotlib.pyplot as plt
>>> G = nx.Graph()
>>> G.add_nodes_from(range(5))
>>> G.add_edges_from([[0,1],[1,2],[2,3],[3,4],[4,0],[0,3],[1,4]])
>>> nx.draw(G, with_labels=True)
>>> plt.show()
In this case, the with_labels keyword in the second to last line tells the program to draw the network with the numbers on the nodes. If you’ve done every thing correctly, an image like Figure 24.2.2 should pop up:
This doesn’t look very nice, though. To fix this, we need a bit more code:
>>> fig, ax = plt.subplots() # necessary for using a title on the graph
>>> nx.draw_circular(G, ax=ax, with_labels=True)
>>> ax.set_title('Simple network')
>>> plt.show()
This should produce the following:
24.3. Directed and Weighted Networks
Now that we understand the basics of using NetworkX, let us use it for something slightly more complex. Using a directed network is also the next step towards making the complex network at the end.
You can use the same code as before, but replace G = nx.Graph()
with G = nx.DiGraph()
. That should produce something like
Figure 24.3.1:
To add weights, we can make a new graph in a very similar manner:
>>> G.clear()
>>> G = nx.DiGraph()
>>> G.add_nodes_from(range(5))
>>> G.add_weighted_edges_from([[0,1,2],[1,2,1],[2,3,1],[3,4,3],[4,0,10],[0,3,0.5],[1,4,0.1]])
>>> fig, ax = plt.subplots()
>>> nx.draw(G, ax=ax, with_labels=True)
>>> ax.set_title('Weighted and Directed Network')
>>> plt.show()
The .add_weighted_edges_from()
function is very similar to
.add_edges_from()
, but it take an array of arrays with three
elements, rather than two. The third is the weight, which must be a
positive number.
As you can see, the high weight edge from 4 to 0 is by far the shortest in the image, followed by 0 to 1 and 3 to 4. 0 to 3 and 1 to 4 should be far and away the longest, but because the network has certain constraints placed on its shape, they end up being more or less the same length as 1 to 2 and 2 to 3.
Now, the next question is obvious. What happens when two edges going
opposite directions between the same node are weighted? For instance,
what would happen if we added an edge going from 0 to 4 with a weight
of 0.1? The code to add this is G.add_edge(0, 4, weight=0.1)
. Note
that because we are just adding one edge, we don’t have access to the
.add_weighted_edges_from()
function and have to manually specificy
that we want to add a weight. The resulting graph is pictured in
Figure 24.3.3:
While the [0,4] edge is slightly longer than before, the real change
is in all the other lengths of edges. This is because the
nx.draw()
function models the edges as springs, and now there are
two springs in that space, which affects the other springs around
it. To explain what this means, imagine a simple scenario where you
have two points (with no mass) connected by a spring. If you let it
come to a resting state, those two points will be some distance
apart. If we add a thrid point, though, and connect it to the other
two points with springs as well, something new will happen. Unless we
choose the precise parameters for the springs (in the case of NetworkX
these parameters come from the weight), no matter what at least one of
the springs will be either stretched or compressed. This is, in
effect, how NetworkX models networks. This is computationally
intensive, as you might imagine, but as we will see later it can give
valuable insights into network structure.
24.4. Using Larger Data Sets
Now that most of the basics are out of the way, let us write a script that can get useful information from large data sets with thousands of nodes. If you have a powerful computer, you should be able to graph the whole network with little trouble. However, if your computer is less powerful, you should use the optional code that samples a smaller version of the data set.
To start, we need a dataset. The one we will be using is a network of bitcoin users rating how trustworthy they find another person. It’s directed and weighted, with some edges that go both directions. The page where the data is downloadable is here: https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html , but don’t feel the need to download the data right now as we will be doing that in a later step.
First, you’re going to want to make a directory to house the data. To do this open a new terminal and run:
$ mkdir NetworkX-intro
$ cd NetworkX-intro
Of course, feel free to call your directory whatever you want. However, in the future, we will reference as NetworkX-intro/ , so make sure you change that if you have a different directory name. From now on, make sure you save your python files to this directory as well. Then, download and extract the data with:
$ wget https://snap.stanford.edu/data/soc-sign-bitcoinalpha.csv.gz
$ gunzip soc-sign-bitcoinalpha.csv.gz
If, for whatever reason, Stanford ever takes down their database of datasets, the same dataset can be downloaded (to be put in the wget command) from the wayback machine archive at this link .
If you run ls
now your directiory should contain the file
soc-sign-bitcoinalpha.csv
. Now open your prefered programming editor
and make a new python file. we will call it graph_bitcoin_data.py. At
the top, import the necessary libraries with:
import networkx as nx
import matplotlib.pyplot as plt
Now we can attempt to extract the data from the .csv file. To do this, we can use the following code:
import networkx as nx
import matplotlib.pyplot as plt
with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
for row in bitcoin_data_unformatted:
print(row)
This should output a huge list of arrays with strings inside them,
something like ['1234,4321,1,1234567890']
. Now, let us put this in
a network called bitcoin_network
with that third column as
weights:
with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
for line in bitcoin_data_unformatted:
words = line.split(',')
row = []
for i in range(len(words)):
row.append(int(words[i].strip()))
bitcoin_network.add_edge(row[0], row[1], weight=row[2])
If you run the program now, it should produce a huge list of arrays
containing numbers, something like [1234, 4321, 1,
1234567890]
. Now that we have it properply formatted, let us talk
about what this data actually is. The first and second numbers
represent users of the bitcoin network, and that’s going to be an edge
in our network later on. The third number is an integer rating
from -10 to 10 of how much the first person trusts the second. This
will be used as a weight in our network. The fourth is just a
timestamp that we will ignore. Now, let us put all this data into a
network.
Before moving on to the final program, we should address the size of the file. 21,000 edges is a LOT. This means it’s hard for your computer to render (unless you happen to have a high end gaming computer or something similar). This means some of the edges have to be removed. We could iterate through them and remove them at random, but a more efficient method would be to remove some nodes. This is because removing a node will also remove all the edges attached to it, and there are a lot less node to run through. Some example code below illustrates how we might remove 1500 nodes from a network:
import random
nodes = list(bitcoin_network.nodes)
random.shuffle(nodes)
for i in range(1500):
network.remove_node(nodes[i])
Combining all of this, plus a main()
function to tie it all
together, we get listing Listing 24.4.1:
import networkx as nx
import matplotlib.pyplot as plt
import random # for remove_nodes()
correct_negative_weights = 11
# because the trusts in the bitcoin data range from -10 to 10, to make all of them
# positive, we can add 11. The reason we don't add 10 is that weights of 0 result in
# division by 0, which causes problems.
def main():
bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv') # make sure .csv file is in the same directory as this .py file
bitcoin_network = remove_nodes(3000, bitcoin_network)
# 3000 is almost certainly more than you'll have to
# remove, but it's a good starting place.
show_data(bitcoin_network)
def convert_data_to_network(filepath):
"""this is a function that will convert data from the .csv file to a NetworkX variable
called bitcoin_network."""
bitcoin_network = nx.DiGraph()
# remember that we need a directed network because one user is rating the other.
with open(filepath) as bitcoin_data_unformatted:
for line in bitcoin_data_unformatted:
words = line.split(',')
words[3].replace('\n', '')
# the format of the data adds a \n to the end of every row
row = []
for i in range(len(words)):
row.append(int(words[i]))
row[2] = row[2] + correct_negative_weights
# weights must be positive. We'll undo this later.
bitcoin_network.add_edge(row[0], row[1], weight=row[2])
return bitcoin_network
def remove_nodes(num_to_remove, network):
"""this is a very large network, and for the sake of your computer we'll remove some
of the nodes and edges."""
nodes = list(network.nodes)
random.shuffle(nodes)
for i in range(num_to_remove):
network.remove_node(nodes[i])
return network
def show_data(network):
"""simple function that will display the bitcoin network using matplotlib."""
fig, ax = plt.subplots()
nx.draw(network, ax=ax, arrows=False, node_size=1, alpha=0.4, width=0.15)
# with this many edges arrows just muddle the graph and make it hard to read
ax.set_title('Bitcoin network with ' + str(network.number_of_nodes()) + ' nodes')
plt.savefig('bitcoin_network.png')
print('Saved bitcoin network as .png file in bitcoin_network.png')
main()
This should produce, after a couple seconds of waiting, something that looks like Figure 24.4.1:
As you may notice, not all of the nodes are connected with edges to anything. This is because as we remove the nodes they were attached to, they edges disappear but the nodes stay. This problem get less and less noticable as you remove less nodes, as can be seen below with no nodes removed in Figure 24.4.2:
As you can see, the form is fundamentally the same, so don’t worry if you have disconnected nodes. Now, let us analyze what you can learn from this figure. It resembles a hub-and-spoke diagram, with a very concentrated center and edges radiating outward (in practice, many of those “outward” lines go in both directions, but that isn’t super important for this). From what we learned about the way NetworkX graphs weighted networks, we know that the low-weight (and in this case, low-trust) edges are longer than the high-weight ones. This means that in this diagram, those that were found to be untrustworthy tended to only trade once or twice. There are exceptions- for example that node in the lower right with the thick, dark line coming out of it- but for the most part it’s a core of trustworthy people trading within themselves, and occasionally an untrustworthy newcomer will trade once or twice before not coming back. And you can see that when the members of the outer shell traded with each other, those lines are by far the longest- indicating that they mutually distrusted each other and the line got longer because of it.
24.5. Analyzing with Python
Now, let us do some analysis of this data. We started analyzing how the form of the graph showed how people trusted one another in the previous section. Now, let us use python to do that for us.
To start, we will graph the average trust of every node. Note that you do not need to remove any nodes from this dataset because the computer has to do a lot less work per node. We’re going to do this by calculating the average weight of a node’s predecessors, then sorting that list and graphing it. This can be accomplished with:
import networkx as nx
import matplotlib.pyplot as plt
correct_negative_weights = 11
# because the trusts in the bitcoin data range from -10 to 10, to make all of them
# positive, we can add 11. The reason we don't add 10 is that weights of 0 result in
# division by 0, which causes problems.
def main():
bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv')
show_avg_trust(bitcoin_network)
def convert_data_to_network(filepath):
"""this is a function that will convert data from the .csv file to a NetworkX
variable called bitcoin_network."""
Data = nx.DiGraph()
# remember that we need a directed network because one user is rating the other.
with open(filepath) as data:
for row in data:
row = row.split(',')
row[3].replace('\n', '') # the format of the data adds a \n to the end of every row
for i in range(len(row)):
row[i] = int(row[i])
row[2] = row[2] + correct_negative_weights # weights must be positive. We'll undo this later.
Data.add_edge(row[0], row[1], weight=row[2])
return Data
def get_avg_trust(network, node):
"""given a node, find the average trust assigned to it by other nodes"""
raters = list(network.predecessors(node))
total_rating = 0
for rater in raters:
total_rating += network[rater][node]['weight']
if len(raters) != 0:
total_rating /= len(raters)
# when extracting the data, we added a constant, we now have to subtract it off again.
total_rating -= correct_negative_weights
else:
total_rating = None
return total_rating
def show_avg_trust(network):
"""given a network, find average trust of all nodes in the network and display the
result as a graph"""
avg_trusts = []
for node in network.nodes:
avg_trust = get_avg_trust(network, node)
if avg_trust != None:
avg_trusts.append(avg_trust)
avg_trusts.sort(reverse=True)
fig, ax = plt.subplots()
plt.plot(range(len(avg_trusts)),avg_trusts)
plt.savefig('avg_trust.png')
print('Trust graph saved to .png file called avg_trust.png')
main()
When run, this function should produce something similar to Figure 24.5.1:
Next, let us compare number of interactions with trust. This should be
fairly simple, we can create a list of numbers of interaction and then
sort it using the average trust array as a key. This can be
accomplished by changing the show_avg_trust()
function to:
def show_avg_trust(filepath):
trust_dict = {}
interactions_dict = {}
avg_trusts = []
num_interactions_list = []
Data = convert_data_to_network(filepath)
for node in Data.nodes:
avg_trust = get_avg_trust(Data, node)
if avg_trust != None:
num_interactions = len(list(Data.adj[node]))
trust_dict[node] = avg_trust
interactions_dict[node] = num_interactions
node_order = sorted(trust_dict, key=trust_dict.get, reverse=True)
for node in node_order:
avg_trusts.append(trust_dict[node])
num_interactions_list.append(interactions_dict[node])
fig, axs = plt.subplots(2)
axs[0].plot(range(len(avg_trusts)),avg_trusts)
axs[1].plot(range(len(num_interactions_list)), num_interactions_list)
plt.show()
When run, this will produce Figure 24.5.2:
There’s a lot to be learned from this figure, but something we should look out for first: Given that the flat line at the trust of 1 is perfectly flat, it’s safe to assume something isn’t normal with this group. It’s possible that if a user gave no input for trust, it simply gave a trust of one. This also explains the oddly low number of interactions in this range, as it’s unlikely that someone who traded a lot would never get rated.
With that out of the way, let us move on to the results. As one would expect, the extreme ends had very low numbers of interactions. This can be explained by the fact that it’s difficult to get a perfect 10 or -10, and even a single user giving you a different rating would change your average. If we ignore the dead zones caused by the 1 flat and other similar flats (from nodes 750 to 1000, for example), the data looks almost normally distributed. Again, upon reflection, this makes sense; the more you trade, the more likely that your going to be near the middle of the trust spectrum. Of course this is not a perfect rule, and the user with the most trade actually has a trust rating around 1.5 instead of 0. However, the general pattern is fairly easy to understand.
Moving on to our final analysis, let us make a bar graph with trust on the x axis. To do this, we will make a function that will divide the data up into some number of baskets, like so:
import networkx as nx
import matplotlib.pyplot as plt
correct_negative_weights = 11
# because the trusts in the bitcoin data range from -10 to 10, to make all of them
# positive, we can add 11. The reason we don't add 10 is that weights of 0 result in
# division by 0, which causes problems.
def main():
bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv')
show_avg_trust_bar(bitcoin_network, 20)
def convert_data_to_network(filepath):
"""this is a function that will convert data from the .csv file to a NetworkX
variable called bitcoin_network."""
bitcoin_network = nx.DiGraph()
# remember that we need a directed network because one user is rating the other.
with open(filepath) as data:
for row in data:
row = row.split(',')
row[3].replace('\n', '')
# the format of the data adds a \n to the end of every row
for i in range(len(row)):
row[i] = int(row[i])
row[2] = row[2] + correct_negative_weights
# weights must be positive. We'll undo this later.
bitcoin_network.add_edge(row[0], row[1], weight=row[2])
return bitcoin_network
def get_avg_trust(network, node):
"""returns the average trust of a given node."""
raters = list(network.predecessors(node))
total_rating = 0
for rater in raters:
total_rating += network[rater][node]['weight']
if len(raters) != 0:
total_rating /= len(raters)
total_rating -= correct_negative_weights
# when extracting the data, we added a constant, we now have to subtract
# it off again
else:
total_rating = None
return total_rating
def show_avg_trust_bar(network, num_baskets):
"""divides weights in the network into a certain number of baskets, then graphs it."""
avg_trusts = []
cutoffs = list(range(-10, 10, int(round(20/num_baskets))))
baskets = []
for i in range(num_baskets):
baskets.append(0)
cutoffs.append(10)
formatted_baskets = []
for i in range(len(cutoffs) - 1):
formatted_baskets.append(str(cutoffs[i]))
for node in network.nodes:
avg_trust = get_avg_trust(network, node)
if avg_trust != None:
for i in range(len(baskets)):
if avg_trust < cutoffs[i + 1] and avg_trust > cutoffs[i]:
baskets[i] += 1
fig, ax = plt.subplots()
plt.bar(formatted_baskets,baskets)
plt.savefig('avg_trusts_bar.png')
print('Trust bar graph saved to .png file avg_trusts_bar.png')
main()
When run, this produces Figure 24.5.3:
At first glance, it may appear that this data is also skewed by the
plateau at 1. However, if you look closely, there are around 800 nodes
in the 1 - 2 trust rating basket, and almost 2000 in the plateau. The
reason for the nodes not being included in the data lies in line 15 of
show_avg_trust_bar()
; neither operator has a ‘=’ in it, meaning if
a node has exactly the value of one of the cutoffs, the set won’t
include it. This is negligible for good data, because the chances of a
trust score landing on exactly an integer are so low it doesn’t affect
the data. However, the bad data all equals exactly one. This means the
bar graph just skips over it and ignores it.
Moving on to an actual analysis of this graph, it looks like a normal distribution with a very low standard deviation, with a mean somewhere between 1 and 2. It’s not perfect, for example it trails off faster to the right than left, possibly indicating a general atmosphere of distrust on this website. This shape can be explained by considering most people on the website only traded a couple times, and it’s hard to get a reputation as trustworthy or untrustworthy that fast. Of course, the extreme edges are also mainly from people who traded once or twice, for the reason that someone might decide that they found someone really trustworth or untrustworthy. That’s just human nature messing with the data, though, and the outliers don’t even appear on the bar graph.
24.6. Conclusion
In this mini course, we learned to use NetworkX by making a drawing simple graphs. We then moved on to downloading and exploring a larger data set, one from an early bitcoin network. It contained directed and weighted edges, over 21,000 of them.
We then tried to develop an intuition for interpreting what the graph could indicate about the data, and then did more rigorous analysis with various matplotlib plotting routines. By the end, you hopefully have a better understanding of both NetworkX and data analysis with NetworkX.
It is important to note that in the final section we moved away from directly using NetworkX, and used it as an intermediary in the analysis steps. However, to even try to do the analysis, NetworkX is an invaluable tool. It gives a rough idea of what you can analyze, and also give insights into the dataset’s structure.