greatx.utils

Progbar

A progress bar for display.

BunchDict

Container object for datasets Dictionary-like object that exposes its keys as attributes and remembers insertion order.

CKA

Centered Kernel Alignment (CKA) metric, where the features of the networks are compared.

topk

Returns the k largest/smallest elements and corresponding indices from an array-like input.

wrapper

Wrap a function to make some arguments have the same length.

repeat

Repeat any objects and return iterable ones.

split_nodes

Randomly split a set of nodes labeled with labels.

split_nodes_by_classes

Randomly split the training data by the number of nodes per classes.

setup_logger

Initialize the GreatX logger and set its verbosity level to "DEBUG".

get_logger

Get a logger for a given name.

singleton_filter

Filter edges that, if removed, would turn one or more nodes into singleton nodes.

SingletonFilter

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

LikelihoodFilter

Likelihood filter from the "Adversarial Attacks on Neural Networks for Graph Data" paper (KDD'18)

singleton_mask

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

add_edges

Add edges to the graph denoted as edge_index.

remove_edges

Remove edges from the graph denoted as edge_index.

flip_edges

Flip edges from the graph denoted as edge_index.

flip_graph

Flip edges from the graph denoted as data.

scipy_normalize

Normalize a sparse matrix according to GCN from the "Semi-supervised Classification with Graph Convolutional Networks" paper (ICLR'17)

normalize

Feature normalization function.

overlap

Compute graph overlapping according to the "Node Similarity Preserving Graph Convolutional Networks" paper (WSDM'21)

ego_graph

Returns induced subgraph of neighbors centered at node n within a given radius.

MissingFeature

Implementation of MissingFeature from the "On the Unreasonable Effectiveness of Feature propagation in Learning on Graphs with Missing Node Features" paper (ICLR'21)

mark

Marks the predicted classes and corresponding probability for a better view.

class Progbar(target: int, width: int = 20, verbose: int = 1, interval: float = 0.05, unit_name: str = 'step')[source]

A progress bar for display.

Parameters:
  • target (int) – total number of steps expected.

  • width (int, optional) – progress bar width on screen, by default 30

  • verbose (int, optional) – verbosity mode, 0 (silent), 1 (verbose), 2 (semi-verbose), by default 1

  • interval (float, optional) – minimum visual progress update interval (in seconds), by default 0.05

  • unit_name (str, optional) – display name for step counts (usually “step” or “sample”), by default ‘step’

Example

>>> from greatx.utils import Progbar
>>> pbar = Progbar(5)
>>> for i in range(5):
...     pbar.add(1, msg=f'current number {i}')
5/5 [===============] - Total: 3.22ms - 643us/step- current number 4
>>> pbar = Progbar(5)
>>> for i in range(5):
...     pbar.update(i+1, msg=f'current number {i}')
5/5 [===============] - Total: 3.22ms - 643us/step- current number 4
update(current: int, msg: Optional[Union[str, List, Tuple]] = None, finalize: Optional[bool] = None)[source]

Updates the progress bar using current value.

Parameters:
  • current (int) – index of current step

  • msg (Optional[Union[str, List, Tuple]], optional) – (name, value_for_last_step) or string messages, by default None

  • finalize (Optional[bool], optional) – whether this is the last update for the progress bar. If None, defaults to current >= self.target, by default None

Raises:

ValueError – invalid message msg for progress bar.

add(n: int, msg: Optional[Union[str, List, Tuple]] = None)[source]

Add n steps to the progress bar.

Parameters:
  • n (int) – number of steps to add to the progress bar

  • msg (Optional[Union[str, List, Tuple]], optional) – (name, value_for_last_step) or string messages, by default None

static format_num(n: int) str[source]

Intelligent scientific notation (.3g).

Parameters:

n (int or float or Numeric) – a Number.

Returns:

out – Formatted number.

Return type:

str

class BunchDict(*args, **kwargs)[source]

Container object for datasets Dictionary-like object that exposes its keys as attributes and remembers insertion order.

Examples

>>> b = BunchDict(a=1, b=2)
>>> b
Objects in BunchDict:
╒═════════╤═══════════╕
│ Names   │   Objects │
╞═════════╪═══════════╡
│ a       │         1 │
├─────────┼───────────┤
│ b       │         2 │
╘═════════╧═══════════╛
>>> b['b']
2
>>> b.b
2
>>> b.a = 3
>>> b['a']
3
>>> b.c = 6
>>> b['c']
6
>>> # Converting objects in BunchDict to `torch.Tensor` if possible.
>>> b = BunchDict(a=[1,2,3])
>>> b.to_tensor()
Objects in BunchDict:
╒═════════╤═══════════════════════════════╕
│ Names   │ Objects                       │
╞═════════╪═══════════════════════════════╡
│ a       │ Tensor, shape=torch.Size([3]) │
│         │ tensor([1, 2, 3])             │
╘═════════╧═══════════════════════════════╛
>>> b.a
tensor([1, 2, 3])
to_tensor(device: str = 'cpu', dtype=None) BunchDict[source]

Convert objects in BunchDict to torch.Tensor

Parameters:
  • device (str, optional) – device of the converted tensors, by default ‘cpu’

  • dtype (_type_, optional) – data types of the converted tensors, by default None

Return type:

the converted BunchDict

class CKA(model1: Module, model2: Module, model1_name: Optional[str] = None, model2_name: Optional[str] = None, model1_layers: Optional[List[str]] = None, model2_layers: Optional[List[str]] = None, training: bool = False, device: str = 'cpu')[source]

Centered Kernel Alignment (CKA) metric, where the features of the networks are compared.

Parameters:
  • model1 (nn.Module) – model 1

  • model2 (nn.Module) – model 2

  • model1_name (str, optional) – name of model 1, by default None

  • model2_name (str, optional) – name of model 2, by default None

  • model1_layers (List[str], optional) – List of layers to extract features from, by default None

  • model2_layers (List[str], optional) – List of layers to extract features from, by default None

  • training (bool, optional) – whether to set training mode (True) or evaluation mode (False) for models. by default False.

  • device (str, optional) – device to run the models, by default ‘cpu’

Example

data = ... # get your graph
m1 = ... # get your model1
m2 = ... # get your model2
cka = CKA(m1, m2)
cka.compare(data)
cka.plot_results()

Reference:

compare(data1: Data, data2: Data = None) None[source]

Computes the feature similarity between the models on the given datasets.

Parameters:
  • data1 (Data) – the dataset where model 1 run on.

  • data2 (Data, optional) – If given, model 2 will run on this dataset. by default None

export() Dict[source]

Exports the CKA data along with the respective model layer names. :return:

plot_results(save_path: Optional[str] = None, title: Optional[str] = None)[source]
topk(array: ndarray, k: int, largest: bool = True) topk_values_indices[source]

Returns the k largest/smallest elements and corresponding indices from an array-like input.

Parameters:
  • array (np.ndarray or list) – the array-like input

  • k (int) – the k in “top-k”

  • largest (bool, optional) – controls whether to return largest or smallest elements

Returns:

Returns the k largest/smallest elements and corresponding indices of the given array

Return type:

namedtuple[values, indices]

Example

>>> array = [5, 3, 7, 2, 1]
>>> topk(array, 2)
topk_values_indices(values=array([7, 5]), indices=array([2, 0], dtype=int64))
>>> topk(array, 2, largest=False)
topk_values_indices(values=array([1, 2]), indices=array([4, 3], dtype=int64))
>>> array = [[1, 2], [3, 4], [5, 6]]
>>> topk(array, 2)
topk_values_indices(values=array([6, 5]), indices=(array([2, 2], dtype=int64), array([1, 0], dtype=int64)))
wrapper(func: Callable) Callable[source]

Wrap a function to make some arguments have the same length. By default, the arguments to be modified are hids and acts.

Uses can custom these arguments by setting argument

  • includes : to includes custom arguments

  • excludes : to excludes custom arguments

  • length_as : to make the length of the arguments the same as length_as, by default, it is hids.

Parameters:

func (Callable) – a function to be wrapped.

Returns:

a wrapped function.

Return type:

Callable

Raises:

TypeError – if the required arguments of the function is missing.

Example

>>> @wrapper
... def func(hids=[16], acts=None):
...     print(locals())
>>> func(100)
{'hids': [100], 'acts': [None]}
>>> func([100, 64])
{'hids': [100, 64], 'acts': [None, None]}
>>> func([100, 64], excludes=['acts'])
{'hids': [100, 64], 'acts': None}
>>> @wrapper
... def func(self, hids=[16], acts=None):
...     print(locals())
>>> func()
TypeError: The decorated function 'func' missing required argument 'self'.
>>> func('class_itself')
{'self': 'class_itself', 'hids': [16], 'acts': [None]}
>>> func('class_itself', hids=[])
{'self': 'class_itself', 'hids': [], 'acts': []}
>>> @wrapper
... def func(self, hids=[16], acts=None, heads=8):
...     print(locals())
>>> func('class_itself', hids=[100, 200])
{'self': 'class_itself', 'hids': [100, 200], 'acts': [None, None], 'heads': 8}
>>> func('class_itself', hids=[100, 200], includes=['heads'])
{'self': 'class_itself', 'hids': [100, 200], 'acts': [None, None], 'heads': [8, 8]}
repeat(src: Any, length: Optional[int] = None) Any[source]

Repeat any objects and return iterable ones.

Parameters:
  • src (Any) – any objects

  • length (Optional[int], optional) – the length to be repeated. If None, it would return the iterable object itself, by default None

Returns:

the iterable repeated object

Return type:

Any

Example

>>> from greatx.utils import repeat
# repeat for single non-iterable object
>>> repeat(1)
[1]
>>> repeat(1, 3)
[1, 1, 1]
>>> repeat('relu', 2)
['relu', 'relu']
>>> repeat(None, 2)
[None, None]
>>> # repeat for iterable object
>>> repeat([1, 2, 3], 2)
[1, 2]
>>> repeat([1, 2, 3], 5)
[1, 2, 3, 3, 3]
split_nodes(labels: Tensor, *, train: float = 0.1, test: float = 0.8, val: float = 0.1, random_state: Optional[int] = None) BunchDict[source]

Randomly split a set of nodes labeled with labels.

Parameters:
  • labels (torch.Tensor) – the labels of the nodes.

  • train (float, optional) – the percentage of the training set, by default 0.1

  • test (float, optional) – the percentage of the test set, by default 0.8

  • val (float, optional) – the percentage of the validation set, by default 0.1

  • random_state (Optional[int], optional) – random seed for the random number generator, by default None

Returns:

  • train_nodes: torch.Tensor with Size [train * num_nodes]

    The indices of the training nodes

  • val_nodes: torch.Tensor with Size [val * num_nodes]

    The indices of the validation nodes

  • test_nodes torch.Tensor with Size [test * num_nodes]

    The indices of the test nodes

Return type:

BunchDict with the following items

split_nodes_by_classes(labels: Tensor, n_per_class: int = 20, random_state: Optional[int] = None) BunchDict[source]

Randomly split the training data by the number of nodes per classes.

Parameters:
  • labels (torch.Tensor [num_nodes]) – The class labels

  • n_per_class (int) – Number of samples per class

  • random_state (Optional[int]) – Random seed

Returns:

  • train_nodes: torch.Tensor with Size [n_per_class * num_classes]

    The indices of the training nodes

  • val_nodes: torch.Tensor with Size [n_per_class * num_classes]

    The indices of the validation nodes

  • test_nodes torch.Tensor with Size [num_nodes - 2*n_per_class * num_classes]

    The indices of the test nodes

Return type:

BunchDict with the following items

setup_logger(output: Optional[str] = None, name: str = 'GreatX', *, distributed_rank: int = 0, mode: str = 'w', color: bool = True, abbrev_name: Optional[str] = None) Logger[source]

Initialize the GreatX logger and set its verbosity level to “DEBUG”.

Parameters:
  • output (Optional[str], optional) – a file name or a directory to save log. If None, will not save log file. If ends with “.txt” or “.log”, assumed to be a file name. Otherwise, logs will be saved to {output}/log.txt.

  • name (str, optional) – the root module name of this logger, by default “GreatX”

  • distributed_rank (int, optional) – used for distributed training, by default 0

  • mode (str, optional) – mode for the output file (if output is given), by default ‘w’.

  • color (bool, optional) – whether to use color when printing, the termcolor package is required, by default True

  • abbrev_name (Optional[str], optional) – an abbreviation of the module, to avoid long names in logs. Set to “” to not log the root module in logs. By default, None.

Returns:

a logger

Return type:

logging.Logger

Example

>>> logger = setup_logger(name='my exp')
>>> logger.info('message')
[12/19 17:01:43 my exp]: message
>>> logger.error('message')
ERROR [12/19 17:02:22 my exp]: message
>>> logger.warning('message')
WARNING [12/19 17:02:32 my exp]: message
>>> # specify output files
>>> logger = setup_logger(output='log.txt', name='my exp')
# additive, by default mode='w'
>>> logger = setup_logger(output='log.txt', name='my exp', mode='a')

# once you logger is set, you can call it by >>> logger = get_logger(name=’my exp’)

get_logger(name: str = 'GreatX') Logger[source]

Get a logger for a given name.

Parameters:

name (str, optional) – name of the logger, by default “GreatX”

Return type:

a logger for the given name

singleton_filter(edges: ndarray, adj_matrix: csr_matrix)[source]

Filter edges that, if removed, would turn one or more nodes into singleton nodes.

Parameters:
  • edges (np.array, shape [M, 2], where M is the number of input edges.) – The candidate edges to remove.

  • adj_matrix (sp.sparse_matrix, shape [num_nodes, num_nodes]) – The input adjacency matrix where edges derived from.

Returns:

the edges that removed will not generate singleton nodes.

Return type:

np.array, shape [M, 2],

class SingletonFilter(adj_matrix: csr_matrix)[source]

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

Parameters:

adj_matrix (sp.csr_matrix) – the input adjacency matrix

update(u: int, v: int, edge_weight: float)[source]
class LikelihoodFilter(degree: ndarray, ll_cutoff: float = 0.004)[source]

Likelihood filter from the “Adversarial Attacks on Neural Networks for Graph Data” paper (KDD’18)

Parameters:
  • degree (np.ndarray) – the degree of the nodes in the graph

  • ll_cutoff (float, optional) – likelihood cutoff, by default 0.004

update(u: int, v: int, edge_weight: float, idx: int)[source]

Update likelihood ratio test values

static compute_alpha(n, S_d, d_min)[source]

Approximate the alpha of a power law distribution.

Parameters:
  • n (int or np.array of int) – Number of entries that are larger than or equal to d_min

  • S_d (float or np.array of float) – Sum of log degrees in the distribution that are larger than or equal to d_min

  • d_min (int) – The minimum degree of nodes to consider

Returns:

alpha – The estimated alpha of the power law distribution

Return type:

float

static update_Sx(S_old, n_old, d_old, d_new, d_min)[source]

Update on the sum of log degrees S_d and n based on degree distribution resulting from inserting or deleting a single edges.

Parameters:
  • S_old (float) – Sum of log degrees in the distribution that are larger than or equal to d_min.

  • n_old (int) – Number of entries in the old distribution that are larger than or equal to d_min.

  • d_old (np.array, shape [num_nodes,] dtype int) – The old degree sequence.

  • d_new (np.array, shape [num_nodes,] dtype int) – The new degree sequence

  • d_min (int) – The minimum degree of nodes to consider

Returns:

  • new_S_d (float, the updated sum of log degrees in the distribution that are larger than or equal to d_min.)

  • new_n (int, the updated number of entries in the old distribution that are larger than or equal to d_min.)

static compute_log_likelihood(n, alpha, S_d, d_min)[source]

Compute log likelihood of the powerlaw fit.

Parameters:
  • n (int) – Number of entries in the old distribution that are larger than or equal to d_min.

  • alpha (float) – The estimated alpha of the power law distribution

  • S_d (float) – Sum of log degrees in the distribution that are larger than or equal to d_min.

  • d_min (int) – The minimum degree of nodes to consider

Returns:

float

Return type:

the estimated log likelihood

static filter_chisquare(ll_ratios, cutoff)[source]
singleton_mask(adj_matrix: Tensor)[source]

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

Parameters:

adj_matrix (Tensor, shape [N, N],) – where N is the number of nodes the input adjacency matrix to compte the mask

Returns:

mask – a boolean mask with shape as adj_matrix.

Return type:

bool Tensor

add_edges(edge_index: Tensor, edges_to_add: Tensor, symmetric: bool = True, coalesce: bool = True, sort_edges: bool = True) Tensor[source]

Add edges to the graph denoted as edge_index.

Parameters:
  • edge_index (torch.Tensor) – the graph instance where edges will be removed from.

  • edges_to_add (torch.Tensor) – shape [2, M], the edges to be added into the graph.

  • symmetric (bool) – whether the output graph is symmetric, if True, it will also append the reversed edges into the graph.

  • coalesce (bool) – whether to coalesce the output edges.

  • sort_edges (bool) – whether to sort the output edges.

Returns:

the graph instance edge_index with edges added.

Return type:

Tensor

remove_edges(edge_index: Tensor, edges_to_remove: Tensor, symmetric: bool = True) Tensor[source]

Remove edges from the graph denoted as edge_index.

Parameters:
  • edge_index (torch.Tensor) – the graph instance where edges will be removed from.

  • edges_to_remove (torch.Tensor) – shape [2, M], the edges to be removed in the graph.

  • symmetric (bool) – whether the output graph is symmetric, if True, it will also remove the reversed edges from the graph.

Returns:

the graph instance edge_index with edges removed.

Return type:

Tensor

flip_edges(edge_index: Tensor, edges_to_flip: Tensor, symmetric: bool = True) Tensor[source]

Flip edges from the graph denoted as edge_index.

Parameters:
  • edge_index (torch.Tensor) – the graph instance where edges will be flipped from.

  • edges_to_flip (torch.Tensor) – shape [2, M], the edges to be flipped in the graph.

  • symmetric (bool) – whether the output graph is symmetric, if True, it will also flip the reversed edges from the graph.

Returns:

the graph instance edge_index with edges flipped.

Return type:

Tensor

flip_graph(data: Data, edges_to_flip: Tensor, symmetric: bool = True) Data[source]

Flip edges from the graph denoted as data.

Parameters:
  • edge_index (Data) – the graph instance where edges will be flipped from.

  • edges_to_flip (torch.Tensor) – shape [2, M], the edges to be flipped in the graph.

  • symmetric (bool) – whether the output graph is symmetric, if True, it will also flip the reversed edges from the graph.

Returns:

the graph instance data with edges flipped.

Return type:

Data

Note

We currently don’t support a weigher graph and this function will automatically set edge_weight and adj_t as None.

scipy_normalize(adj_matrix: csr_matrix, add_self_loops: bool = True) csr_matrix[source]

Normalize a sparse matrix according to GCN from the “Semi-supervised Classification with Graph Convolutional Networks” paper (ICLR’17)

Parameters:
  • adj_matrix (sp.csr_matrix) – the input sparse matrix denoting a graph.

  • add_self_loops (bool, optional) – whether to add self-loops, by default True

Returns:

the normalized adjacency matrix.

Return type:

sp.csr_matrix

normalize(feat: Tensor, norm: str = 'standardize', dim: Optional[int] = None, lim_min: float = -1.0, lim_max: float = 1.0) Tensor[source]

Feature normalization function.

Adapted from GRB: https://github.com/THUDM/grb/blob/master/grb/dataset/dataset.py#L638

Parameters:
  • feat (torch.Tensor) – node feature matrix with shape [N, D]

  • norm (Optional[str], optional) – how to normalize feature matrix, including [“linearize”, “arctan”, “tanh”, “standardize”, “none”], by default “standardize”

  • dim (None or int, optional) – Axis along which the means or standard deviations are computed. The default is to compute the mean or standard deviations of the flattened array, by default None

  • lim_min (float, optional) – minimum limit of feature, by default -1.0

  • lim_max (float, optional) – maximum limit of feature, by default 1.0

Returns:

feat – normalized feature matrix

Return type:

torch.Tensor

overlap(edge_index1: Tensor, edge_index2: Tensor, on: str = 'edge', symmetric: bool = False) float[source]

Compute graph overlapping according to the “Node Similarity Preserving Graph Convolutional Networks” paper (WSDM’21)

Parameters:
  • edge_index1 (torch.Tensor) – edges indices of a graph a graph

  • edge_index2 (torch.Tensor) – edges indices of another graph another graph

  • on (str, optional) – compute overlap on edge or node, by default edge, by default ‘edge’

  • symmetric (bool, optional) – whether the computed overlap is symetric, by default False

Returns:

overlapping of the two graphs on edge or node

Return type:

float

ego_graph(adj_matrix: csr_matrix, targets: Union[int, list], hops: int = 1) ego_graph[source]

Returns induced subgraph of neighbors centered at node n within a given radius.

Parameters:
  • adj_matrix (sp.csr_matrix,) – a Scipy CSR sparse adjacency matrix representing a graph

  • targets (Union[int, list]) – center nodes, a single node or a list of nodes

  • hops (int number, optional) – Include all neighbors of distance<=hops from nodes.

Returns:

nodes: shape [N], the nodes of the subgraph edges: shape [2, M], the edges of the subgraph

Return type:

NamedTuple(nodes, edges)

Note

This is a faster implementation of networkx.ego_graph based on scipy sparse matrix and numba

See also

networkx.ego_graph, torch_geometric.utils.k_hop_subgraph

class MissingFeature(missing_rate: float = 0.5, missing_type: str = 'uniform', missing_value: float = nan)[source]

Implementation of MissingFeature from the “On the Unreasonable Effectiveness of Feature propagation in Learning on Graphs with Missing Node Features” paper (ICLR’21)

MissingFeature generates missing feature mask indicating whether each feature is present or missing. according differemt stractegies.

Parameters:
  • missing_rate (float, optional) – ratio of missing features, by default 0.5

  • missing_type (str, optional) – type of strategies to generate missing feature mask. If type='uniform', then each feature of each node is missing uniformly at random with probability missing_rate. Instead, if type=’structural’, either we observe all features for a node, or we observe none. For each node there is a probability of missing_rate of not observing any feature, by default 'uniform'

  • missing_value (float, optional) – value to fill missing features, by default float("nan")

Reference:

mark(logit: Union[ndarray, Tensor], y_true: Optional[int] = None) str[source]

Marks the predicted classes and corresponding probability for a better view.

Parameters:
  • logit (Union[np.ndarray, torch.Tensor]) – the predicted class probability

  • y_true (Optional[int]) – the ground truth label, by default None

Returns:

the formated string

Return type:

str

Examples

>>> from greatx.utils import mark
>>> import torch
>>> pred = torch.tensor([0.5, 0.3, 0.1, 0.1])
>>> print(mark(pred))
Prediction (y=0): 0.500, 0.300, 0.100, 0.100
                   👆(0)
>>> print(mark(pred, y_true=2))
Ground Truth (y=2):                👇(2)
Prediction (pred=0): 0.500, 0.300, 0.100, 0.100
                      👆(0)
Margin: -0.400