22. Introduction to NetworkX Graphs

Section author: Malcolm Smith <msmith.malcolmsmith@gmail.com>

[status: first draft]

22.1. Motivation, Prerequisites, Plan

Motivation

As you go about your life, you will often take for granted the vast networks you are a part of. From your local art club (or your workplace, or school, etc.) to the entire world, we all live in relation to one another. To make sense in rigorous terms of this, mathematicians use what is known as network theory. Understanding and using it is a very valuable skill to have, as it allows easy visual understanding of any sort of data or network you’re working with.

Situations in which it might arise include modeling economic markets, where interactions are recorded between people, or in A.I. research, in which a neural network (a system designed to emulate the way our brains work) is a very specialized case of a mathematical network. In general, network theory has a wide range on applications, and having a basic understanding of it will be helpful to anyone who wants to gain a deeper insight into their data.

Prerequisites

  • The 10-hour “serious programming” course.

  • The “Data files and first plots” mini-course in Section 2.

  • Having NetworkX and Matplotlib, a very powerful and versatile programming library, installed. You can install them with:

    $ sudo apt install pip
    $ pip install --user --upgrade networkx[default]
    $ pip install --user --upgrade matplotlib
    
  • For the later parts of the course, you need wget and gzip. You may need to download or upgrade wget and gzip, and you can make sure you have the latest versions with:

    $ sudo apt install wget gzip
    

Plan

We’re going to start off with learning what a network is with the Bridges of Konigsberg, as well as the basics of NetworkX, such as adding nodes and edges as well as graphing the results. We will then move on to directed and weighted networks, which will come into play for the final step. After learning these functions, the next thing we will do is download and extract the data from an online database. we will analyze the data next, trying to see patterns in the distribution of the data.

22.2. First Steps

Our first step is to make sure all the required libraries are installed properly.

Open a python 3 prompt and check to see if you properly downloaded NetworkX and Matplotlib.

$ python3
>>> import networkx as nx
>>> G = nx.Graph()
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()

Don’t worry about what that means right now; just know it’s working if there isn’t an error message.

After getting the libraries squared away, let us explore what a mathematical network actually is, and some familiar examples. A network is really a simple object, with lines, or edges, going between points, or nodes. The famous mathematical problem known as the Bridges of Konigsberg (which prompted Euler’s early work in graph theory) is easily represented as a network. For those who don’t know of the Bridges of Konigsberg, here is a picture of it:

../_images/bridges.png

Figure 22.2.1 The Bridges of Konigsberg.

This can be represented as a network with 4 nodes and 7 edges. The edges, as you may have guessed, are the bridges, and the nodes are the landmasses. One thing to note about this is that there are two sets of repeat bridges. This does not disqualify it from being a network, but it isn’t a type of network we will see in this chapter. There is a section where we have edges going between the same nodes twice, but that’s in a network in which direction matters, and thus the edges are not identical. In this case, there are two sets of identical edges, a case we will not discuss in this chapter.

Next we’re going to make a very simple network using NetworkX. To do this, let us first walk through some basic NetworkX commands.

  • some_network = nx.Graph() will create a network called some_network.

  • some_network.add_node(some_node) adds the node some_node to the network some_network.

  • some_network.add_nodes_from([node1, node2, node3]) will add all nodes inside the array in the input to some_network. It can also be used with another network.

  • some_network.add_edge(node1, node2) will add an edge from node1 to node2 in some_network. Note that if node1 or node2 is not in the network the command will add them.

  • some_network.add_edges_from([[node1, node2], [node3, node2]]) will add the dges contained in the nested array. It can also be used with another network. Like .add_edge(), it will add any nodes it contains that are not already present in the network.

  • some_network.nodes will return a tuple with all the nodes in some_network in it.

  • some_network.edges will return an array with all the edges in some_network as tuples inside it.

  • nx.draw(some_network) will allow matplotlib to draw some_network in its current state.

With that out of the way, let us create a simple network. To start, we will use a network with 5 nodes and 7 edges.

>>> import networkx as nx
>>> import matplotlib.pyplot as plt
>>> G = nx.Graph()
>>> G.add_nodes_from(range(5))
>>> G.add_edges_from([[0,1],[1,2],[2,3],[3,4],[4,0],[0,3],[1,4]])
>>> nx.draw(G, with_labels=True)
>>> plt.show()

In this case, the with_labels keyword in the second to last line tells the program to draw the network with the numbers on the nodes. If you’ve done every thing correctly, an image like figure 22.2.2 should pop up:

../_images/simple_network.svg

Figure 22.2.2 First visualization of network.

This doesn’t look very nice, though. To fix this, we need a bit more code:

>>> fig, ax = plt.subplots() # necessary for using a title on the graph
>>> nx.draw_circular(G, ax=ax, with_labels=True)
>>> ax.set_title('Simple network')
>>> plt.show()

This should produce the following:

../_images/circular_network.svg

Figure 22.2.3 Network with title.

22.3. Directed and Weighted Networks

Now that we understand the basics of using NetworkX, let us use it for something slightly more complex. Using a directed network is also the next step towards making the complex network at the end.

You can use the same code as before, but replace G = nx.Graph() with G = nx.DiGraph(). That should produce something like figure 22.3.1:

../_images/Directed_network.svg

Figure 22.3.1 Arrows show the direction of the edge.

To add weights, we can make a new graph in a very similar manner:

>>> G.clear()
>>> G = nx.DiGraph()
>>> G.add_nodes_from(range(5))
>>> G.add_weighted_edges_from([[0,1,2],[1,2,1],[2,3,1],[3,4,3],[4,0,10],[0,3,0.5],[1,4,0.1]])
>>> fig, ax = plt.subplots()
>>> nx.draw(G, ax=ax, with_labels=True)
>>> ax.set_title('Weighted and Directed Network')
>>> plt.show()

The .add_weighted_edges_from() function is very similar to .add_edges_from(), but it take an array of arrays with three elements, rather than two. The third is the weight, which must be a positive number.

../_images/W_and_D_network.svg

Figure 22.3.2 The higher the weight, the shorter the line between nodes.

As you can see, the high weight edge from 4 to 0 is by far the shortest in the image, follow by 0 to 1 and 3 to 4. 0 to 3 and 1 to 4 should be far and away the longest, but because the network has certain constraints placed on its shape, they end up being more or less the same length as 1 to 2 and 2 to 3.

Now, the next question is obvious. What happens when two edges going opposite directions between the same node are weighted? For instance, what would happen if we added an edge going from 0 to 4 with a weight of 0.1? The code to add this is G.add_edge(0, 4, weight=0.1). Note that because we are just adding one edge, we don’t have access to the .add_weighted_edges_from() function and have to manually specificy that we want to add a weight. The resulting graph is pictured below:

../_images/BWN.svg

Figure 22.3.3 A network with a double edge between nodes 0 and 4.

While the [0,4] edge is slightly longer than before, the real change is in all the other lengths of edges. This is because the nx.draw() function models the edges as springs, and now there are two springs in that space, which affects the other springs around it. To explain what this means, imagine a simple scenario where you have two points (with no mass) connected by a spring. If you let it come to a resting state, those two points will be some distance apart. If we add a thrid point, though, and connect it to the other two points with springs as well, something new will happen. Unless we choose the precise parameters for the springs (in the case of NetworkX these parameters come from the weight), no matter what at least one of the springs will be either stretched or compressed. This is, in effect, how NetworkX models networks. This is computationally intensive, as you might imagine, but as we will see later it can give valuable insights into network structure.

22.4. Using Larger Data Sets

Now that most of the basics are out of the way, let us write a script that can get useful information from large data sets with thousands of nodes. If you have a powerful computer, you should be able to graph the whole network with little trouble. However, if your computer is less powerful, you should use the optional code that samples a smaller version of the data set.

To start, we need a dataset. The one we will be using is a network of bitcoin users rating how trustworthy they find another person. It’s directed and weighted, with some edges that go both directions. The page where the data is downloadable is here: https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html , but don’t feel the need to download the data right now as we will be doing that in a later step.

First, you’re going to want to make a directory to house the data. To do this open a new terminal and run:

$ mkdir NetworkX-intro
$ cd NetworkX-intro

Of course, feel free to call your directory whatever you want. However, in the future, we will reference as NetworkX-intro/ , so make sure you change that if you have a different directory name. From now on, make sure you save your python files to this directory as well. Then, download and extract the data with:

$ wget https://snap.stanford.edu/data/soc-sign-bitcoinalpha.csv.gz
$ gunzip soc-sign-bitcoinalpha.csv.gz

If, for whatever reason, Stanford ever takes down their database of datasets, the same dataset can be downloaded (to be put in the wget command) from the wayback machine archive at this link .

If you run ls now your directiory should contain the file soc-sign-bitcoinalpha.csv . Now open your prefered programming editor and make a new python file. we will call it graph_bitcoin_data.py. At the top, import the necessary libraries with:

import networkx as nx
import matplotlib.pyplot as plt

Now we can attempt to extract the data from the .csv file. To do this, we can use the following code:

import networkx as nx
import matplotlib.pyplot as plt

with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
    for row in bitcoin_data_unformatted:
        print(row)

This should output a huge list of arrays with strings inside them, something like ['1234,4321,1,1234567890']. Now, let us put this in a network called bitcoin_network with that third column as weights:

with open('soc-sign-bitcoinalpha.csv') as bitcoin_data_unformatted:
    for line in bitcoin_data_unformatted:
        words = line.split(',')
        row = []
        for i in range(len(words)):
            row.append(int(words[i].strip()))
        bitcoin_network.add_edge(row[0], row[1], weight=row[2])

If you run the program now, it should produce a huge list of arrays containing numbers, something like [1234, 4321, 1, 1234567890]. Now that we have it properply formatted, let us talk about what this data actually is. The first and second numbers represent users of the bitcoin network, and that’s going to be an edge in our network later on. The third number is an integer rating from -10 to 10 of how much the first person trusts the second. This will be used as a weight in our network. The fourth is just a timestamp that we will ignore. Now, let us put all this data into a network.

Before moving on to the final program, we should address the size of the file. 21,000 edges is a LOT. This means it’s hard for your computer to render (unless you happen to have a high end gaming computer or something similar). This means some of the edges have to be removed. We could iterate through them and remove them at random, but a more efficient method would be to remove some nodes. This is because removing a node will also remove all the edges attached to it, and there are a lot less node to run through. Some example code below illustrates how we might remove 1500 nodes from a network:

import random

nodes = list(bitcoin_network.nodes)
random.shuffle(nodes)
for i in range(1500):
    network.remove_node(nodes[i])

Combining all of this, plus a main() function to tie it all together, we get listing Listing 22.4.1:

Listing 22.4.1 visualize_large_network.py - show data with some number of nodes removed
import networkx as nx
import matplotlib.pyplot as plt
import random # for remove_nodes()

# because the trusts in the bitcoin data range from -10 to 10, to make all of them positive, we can
# add 11. The reason we don't add 10 is that weights of 0 result in division by 0, which causes
# problems.
correct_negative_weights = 11

def main():
    # make sure .csv file is in the same directory as this .py file
    bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv')
    # 3000 is almost certainly more than you'll have to remove, but it's a good starting place.
    bitcoin_network = remove_nodes(3000, bitcoin_network)
    show_data(bitcoin_network)

def convert_data_to_network(filepath):
    """this is a function that will convert data from the .csv file to a NetworkX variable called
    bitcoin_network."""
    # remember that we need a directed network because one user is rating the other.
    bitcoin_network = nx.DiGraph()
    with open(filepath) as bitcoin_data_unformatted:
        for row in bitcoin_data_unformatted:
            row = row.split(',')
            row[3].replace('\n', '') # the format of the data adds a \n to the end of every row
            for i in range(len(row)):
                row[i] = int(row[i])
            row[2] = row[2] + correct_negative_weights # weights must be positive. We'll undo this later.
            bitcoin_network.add_edge(row[0], row[1], weight=row[2])
            
    return bitcoin_network

def remove_nodes(num_to_remove, network):
    """this is a very large network, and for the sake of your computer we'll remove some of the
    nodes and edges."""
    nodes = list(network.nodes)
    random.shuffle(nodes)
    for i in range(num_to_remove):
        network.remove_node(nodes[i])

    return network

def show_data(network):
    """simple function that will display the bitcoin network using matplotlib."""
    fig, ax = plt.subplots()
    # with this many edges arrows just muddle the graph and make it hard to read
    nx.draw(network, ax=ax, arrows=False, node_size=1, alpha=0.4, width=0.15)
    ax.set_title('Bitcoin network with ' + str(network.number_of_nodes()) + ' nodes')
    plt.savefig('bitcoin_network.png')
    print('Saved bitcoin network as .png file in bitcoin_network.png')
    
main()

This should produce, after a couple seconds of waiting, something that looks like Figure 22.4.1:

../_images/simple_bitcoin_network.svg

Figure 22.4.1 A stripped-down visualization of the data.

As you may notice, not all of the nodes are connected with edges to anything. This is because as we remove the nodes they were attached to, they edges disappear but the nodes stay. This problem get less and less noticable as you remove less nodes, as can be seen below with no nodes removed in figure Figure 22.4.2:

../_images/full_bitcoin_network.svg

Figure 22.4.2 All the data, graphed.

As you can see, the form is fundamentally the same, so don’t worry if you have disconnected nodes. Now, let us analyze what you can learn from this figure. It resembles a hub-and-spoke diagram, with a very concentrated center and edges radiating outward (in practice, many of those “outward” lines go in both directions, but that isn’t super important for this). From what we learned about the way NetworkX graphs weighted networks, we know that the low-weight (and in this case, low-trust) edges are longer than the high-weight ones. This means that in this diagram, those that were found to be untrustworthy tended to only trade once or twice. There are exceptions- for example that node in the lower right with the thick, dark line coming out of it- but for the most part it’s a core of trustworthy people trading within themselves, and occasionally an untrustworthy newcomer will trade once or twice before not coming back. And you can see that when the members of the outer shell traded with each other, those lines are by far the longest- indicating that they mutually distrusted each other and the line got longer because of it.

22.5. Analyzing with Python

Now, let us do some analysis of this data. We started analyzing how the form of the graph showed how people trusted one another in the previous section. Now, let us use python to do that for us.

To start, we will graph the average trust of every node. Note that you do not need to remove any nodes from this dataset because the computer has to do a lot less work per node. We’re going to do this by calculating the average weight of a node’s predecessors, then sorting that list and graphing it. This can be accomplished with:

Listing 22.5.1 show_avg_trust.py - create a line graph of each user’s average trust.
import networkx as nx
import matplotlib.pyplot as plt

# because the trusts in the bitcoin data range from -10 to 10, to make all of them positive, we can
# add 11. The reason we don't add 10 is that weights of 0 result in division by 0, which causes
# problems.
correct_negative_weights = 11

def main():
    bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv')
    show_avg_trust(bitcoin_network)

def convert_data_to_network(filepath):
    """this is a function that will convert data from the .csv file to a NetworkX variable called
    bitcoin_network."""
    Data = nx.DiGraph() # remember that we need a directed network because one user is rating the other.
    with open(filepath) as data:
        for row in data:
            row = row.split(',')
            row[3].replace('\n', '') # the format of the data adds a \n to the end of every row
            for i in range(len(row)):
                row[i] = int(row[i])
            row[2] = row[2] + correct_negative_weights # weights must be positive. We'll undo this later.
            Data.add_edge(row[0], row[1], weight=row[2])
            
    return Data

def get_avg_trust(network, node):
    raters = list(network.predecessors(node))
    total_rating = 0
    for rater in raters:
        total_rating += network[rater][node]['weight']
    if len(raters) != 0:
        total_rating /= len(raters)
        # when extracting the data, we added a constant, we now have to subtract it off again.
        total_rating -= correct_negative_weights
    else:
        total_rating = None
    return total_rating
        
def show_avg_trust(network):
    avg_trusts = []
    for node in network.nodes:
        avg_trust = get_avg_trust(network, node)
        if avg_trust != None:
            avg_trusts.append(avg_trust)
    avg_trusts.sort(reverse=True)
    fig, ax = plt.subplots()
    plt.plot(range(len(avg_trusts)),avg_trusts)
    plt.savefig('avg_trust.png')
    print('Trust graph saved to .png file called avg_trust.png')
    
        
main()

When run, this function should produce something similar to figure Figure 22.5.1:

../_images/average_trust.svg

Figure 22.5.1 A line graph of each user’s average trust.

Next, let us compare number of interactions with trust. This should be fairly simple, we can create a list of numbers of interaction and then sort it using the average trust array as a key. This can be accomplished by changing the show_avg_trust() function to:

def show_avg_trust(filepath):
    trust_dict = {}
    interactions_dict = {}
    avg_trusts = []
    num_interactions_list = []

    Data = convert_data_to_network(filepath)
    for node in Data.nodes:
        avg_trust = get_avg_trust(Data, node)
        if avg_trust != None:
        num_interactions = len(list(Data.adj[node]))
        trust_dict[node] = avg_trust
        interactions_dict[node] = num_interactions

    node_order = sorted(trust_dict, key=trust_dict.get, reverse=True)
    for node in node_order:
        avg_trusts.append(trust_dict[node])
        num_interactions_list.append(interactions_dict[node])

    fig, axs = plt.subplots(2)
    axs[0].plot(range(len(avg_trusts)),avg_trusts)
    axs[1].plot(range(len(num_interactions_list)), num_interactions_list)
    plt.show()

When run, this will produce the following figure:

../_images/trust_with_interactions.svg

Figure 22.5.2 The average trust (top) and the number of interactions that node has (bottom).

There’s a lot to be learned from this figure, but something we should look out for first: Given that the flat line at the trust of 1 is perfectly flat, it’s safe to assume something isn’t normal with this group. It’s possible that if a user gave no input for trust, it simply gave a trust of one. This also explains the oddly low number of interactions in this range, as it’s unlikely that someone who traded a lot would never get rated.

With that out of the way, let us move on to the results. As one would expect, the extreme ends had very low numbers of interactions. This can be explained by the fact that it’s difficult to get a perfect 10 or -10, and even a single user giving you a different rating would change your average. If we ignore the dead zones caused by the 1 flat and other similar flats (from nodes 750 to 1000, for example), the data looks almost normally distributed. Again, upon reflection, this makes sense; the more you trade, the more likely that your going to be near the middle of the trust spectrum. Of course this is not a perfect rule, and the user with the most trade actually has a trust rating around 1.5 instead of 0. However, the general pattern is fairly easy to understand.

Moving on to our final analysis, let us make a bar graph with trust on the x axis. To do this, we will make a function that will divide the data up into some number of baskets, like so:

Listing 22.5.2 show_avg_trust_bar.py - creates a bar graph of data with specified baskets.
import networkx as nx
import matplotlib.pyplot as plt

correct_negative_weights = 11 # because the trusts in the bitcoin data range from -10 to 10, to make all of them positive, we can add 11. The reason we don't add 10 is that weights of 0 result in division by 0, which causes problems.

def main():
    bitcoin_network = convert_data_to_network('soc-sign-bitcoinalpha.csv')
    show_avg_trust_bar(bitcoin_network, 20)

def convert_data_to_network(filepath): # this is a function that will convert data from the .csv file to a NetworkX variable called bitcoin_network.
    bitcoin_network = nx.DiGraph() # remember that we need a directed network because one user is rating the other.
    with open(filepath) as data:
        for row in data:
            row = row.split(',')
            row[3].replace('\n', '') # the format of the data adds a \n to the end of every row
            for i in range(len(row)):
                row[i] = int(row[i])
            row[2] = row[2] + correct_negative_weights # weights must be positive. We'll undo this later.
            bitcoin_network.add_edge(row[0], row[1], weight=row[2])
            
    return bitcoin_network

def get_avg_trust(network, node): # returns the average trust of a given node.
    raters = list(network.predecessors(node))
    total_rating = 0
    for rater in raters:
        total_rating += network[rater][node]['weight']
    if len(raters) != 0:
        total_rating /= len(raters)
        total_rating -= correct_negative_weights # when extracting the data, we added a constant, we now have to subtract it off again
    else:
        total_rating = None
    return total_rating
        
def show_avg_trust_bar(network, num_baskets): # divides weights in the network into a certain number of baskets, then graphs it.
    avg_trusts = []
    cutoffs = list(range(-10, 10, int(round(20/num_baskets))))
    baskets = []
    for i in range(num_baskets):
        baskets.append(0)
    cutoffs.append(10)
    formatted_baskets = []
    for i in range(len(cutoffs) - 1):
        formatted_baskets.append(str(cutoffs[i]))
    for node in network.nodes:
        avg_trust = get_avg_trust(network, node)
        if avg_trust != None:
            for i in range(len(baskets)):
                if avg_trust < cutoffs[i + 1] and avg_trust > cutoffs[i]:
                    baskets[i] += 1
    
    fig, ax = plt.subplots()
    plt.bar(formatted_baskets,baskets)
    plt.savefig('avg_trusts_bar.png')
    print('Trust bar graph saved to .png file avg_trusts_bar.png')
    
        
main()

When run, this produces figure 22.5.3:

../_images/trust_bar.svg

Figure 22.5.3 A bar graph of bitcoin trust levels with 20 baskets.

At first glance, it may appear that this data is also skewed by the plateau at 1. However, if you look closely, there are around 800 nodes in the 1 - 2 trust rating basket, and almost 2000 in the plateau. The reason for the nodes not being included in the data lies in line 15 of show_avg_trust_bar(); neither operator has a ‘=’ in it, meaning if a node has exactly the value of one of the cutoffs, the set won’t include it. This is negligible for good data, because the chances of a trust score landing on exactly an integer are so low it doesn’t affect the data. However, the bad data all equals exactly one. This means the bar graph just skips over it and ignores it.

Moving on to an actual analysis of this graph, it looks like a normal distribution with a very low standard deviation, with a mean somewhere between 1 and 2. It’s not perfect, for example it trails off faster to the right than left, possibly indicating a general atmosphere of distrust on this website. This shape can be explained by considering most people on the website only traded a couple times, and it’s hard to get a reputation as trustworthy or untrustworthy that fast. Of course, the extreme edges are also mainly from people who traded once or twice, for the reason that someone might decide that they found someone really trustworth or untrustworthy. That’s just human nature messing with the data, though, and the outliers don’t even appear on the bar graph.

22.6. Conclusion

In this mini course, we learned to use NetworkX by making a drawing simple graphs. We then moved on to downloading and exploring a larger data set, one from an early bitcoin network. It contained directed and weighted edges, over 21,000 of them.

We then tried to develop an intuition for interpreting what the graph could indicate about the data, and then did more rigorous analysis with various matplotlib plotting routines. By the end, you hopefully have a better understanding of both NetworkX and data analysis with NetworkX.

It is important to note that in the final section we moved away from directly using NetworkX, and used it as an intermediary in the analysis steps. However, to even try to do the analysis, NetworkX is an invaluable tool. It gives a rough idea of what you can analyze, and also give insights into the dataset’s structure.