greatx.utils

`Progbar`	A progress bar for display.
`BunchDict`	Container object for datasets Dictionary-like object that exposes its keys as attributes and remembers insertion order.
`CKA`	Centered Kernel Alignment (CKA) metric, where the features of the networks are compared.
`topk`	Returns the k largest/smallest elements and corresponding indices from an array-like input.
`wrapper`	Wrap a function to make some arguments have the same length.
`repeat`	Repeat any objects and return iterable ones.
`split_nodes`	Randomly split a set of nodes labeled with `labels`.
`split_nodes_by_classes`	Randomly split the training data by the number of nodes per classes.
`setup_logger`	Initialize the GreatX logger and set its verbosity level to "DEBUG".
`get_logger`	Get a logger for a given name.
`singleton_filter`	Filter edges that, if removed, would turn one or more nodes into singleton nodes.
`SingletonFilter`	Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.
`LikelihoodFilter`	Likelihood filter from the "Adversarial Attacks on Neural Networks for Graph Data" paper (KDD'18)
`singleton_mask`	Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.
`add_edges`	Add edges to the graph denoted as `edge_index`.
`remove_edges`	Remove edges from the graph denoted as `edge_index`.
`flip_edges`	Flip edges from the graph denoted as `edge_index`.
`flip_graph`	Flip edges from the graph denoted as `data`.
`scipy_normalize`	Normalize a sparse matrix according to `GCN` from the "Semi-supervised Classification with Graph Convolutional Networks" paper (ICLR'17)
`normalize`	Feature normalization function.
`overlap`	Compute graph overlapping according to the "Node Similarity Preserving Graph Convolutional Networks" paper (WSDM'21)
`ego_graph`	Returns induced subgraph of neighbors centered at node n within a given radius.
`MissingFeature`	Implementation of `MissingFeature` from the "On the Unreasonable Effectiveness of Feature propagation in Learning on Graphs with Missing Node Features" paper (ICLR'21)
`mark`	Marks the predicted classes and corresponding probability for a better view.

class Progbar(target: int, width: int = 20, verbose: int = 1, interval: float = 0.05, unit_name: str = 'step')[source]

A progress bar for display.

Parameters:

target (int) – total number of steps expected.
width (int, optional) – progress bar width on screen, by default 30
verbose (int, optional) – verbosity mode, 0 (silent), 1 (verbose), 2 (semi-verbose), by default 1
interval (float, optional) – minimum visual progress update interval (in seconds), by default 0.05
unit_name (str, optional) – display name for step counts (usually “step” or “sample”), by default ‘step’

Example

>>> from greatx.utils import Progbar
>>> pbar = Progbar(5)
>>> for i in range(5):
...     pbar.add(1, msg=f'current number {i}')
5/5 [===============] - Total: 3.22ms - 643us/step- current number 4

>>> pbar = Progbar(5)
>>> for i in range(5):
...     pbar.update(i+1, msg=f'current number {i}')
5/5 [===============] - Total: 3.22ms - 643us/step- current number 4

update(current: int, msg: Optional[Union[str, List, Tuple]] = None, finalize: Optional[bool] = None)[source]

Updates the progress bar using current value.

Parameters:

current (int) – index of current step
msg (Optional[Union[str, List, Tuple]], optional) – (name, value_for_last_step) or string messages, by default None
finalize (Optional[bool], optional) – whether this is the last update for the progress bar. If None, defaults to current >= self.target, by default None

Raises:

ValueError – invalid message msg for progress bar.

add(n: int, msg: Optional[Union[str, List, Tuple]] = None)[source]

Add n steps to the progress bar.

Parameters:

n (int) – number of steps to add to the progress bar
msg (Optional[Union[str, List, Tuple]], optional) – (name, value_for_last_step) or string messages, by default None

static format_num(n: int) → str[source]

Intelligent scientific notation (.3g).

Parameters:: n (int or float or Numeric) – a Number.
Returns:: out – Formatted number.
Return type:: str

class BunchDict(*args, **kwargs)[source]

Container object for datasets Dictionary-like object that exposes its keys as attributes and remembers insertion order.

Examples

>>> b = BunchDict(a=1, b=2)
>>> b
Objects in BunchDict:
╒═════════╤═══════════╕
│ Names   │   Objects │
╞═════════╪═══════════╡
│ a       │         1 │
├─────────┼───────────┤
│ b       │         2 │
╘═════════╧═══════════╛
>>> b['b']
2
>>> b.b
2
>>> b.a = 3
>>> b['a']
3
>>> b.c = 6
>>> b['c']
6

>>> # Converting objects in BunchDict to `torch.Tensor` if possible.
>>> b = BunchDict(a=[1,2,3])
>>> b.to_tensor()
Objects in BunchDict:
╒═════════╤═══════════════════════════════╕
│ Names   │ Objects                       │
╞═════════╪═══════════════════════════════╡
│ a       │ Tensor, shape=torch.Size([3]) │
│         │ tensor([1, 2, 3])             │
╘═════════╧═══════════════════════════════╛
>>> b.a
tensor([1, 2, 3])

to_tensor(device: str = 'cpu', dtype=None) → BunchDict[source]

Convert objects in BunchDict to torch.Tensor

Parameters:

device (str, optional) – device of the converted tensors, by default ‘cpu’
dtype (_type_, optional) – data types of the converted tensors, by default None

Return type:

the converted BunchDict

class CKA(model1: Module, model2: Module, model1_name: Optional[str] = None, model2_name: Optional[str] = None, model1_layers: Optional[List[str]] = None, model2_layers: Optional[List[str]] = None, training: bool = False, device: str = 'cpu')[source]

Centered Kernel Alignment (CKA) metric, where the features of the networks are compared.

Parameters:

model1 (nn.Module) – model 1
model2 (nn.Module) – model 2
model1_name (str, optional) – name of model 1, by default None
model2_name (str, optional) – name of model 2, by default None
model1_layers (List[str], optional) – List of layers to extract features from, by default None
model2_layers (List[str], optional) – List of layers to extract features from, by default None
training (bool, optional) – whether to set training mode (True) or evaluation mode (False) for models. by default False.
device (str, optional) – device to run the models, by default ‘cpu’

Example

data = ... # get your graph
m1 = ... # get your model1
m2 = ... # get your model2
cka = CKA(m1, m2)
cka.compare(data)
cka.plot_results()

Reference:

Paper: https://arxiv.org/abs/2010.15327
Code: https://github.com/AntixK/PyTorch-Model-Compare

compare(data1: Data, data2: Data = None) → None[source]

Computes the feature similarity between the models on the given datasets.

Parameters:

data1 (Data) – the dataset where model 1 run on.
data2 (Data, optional) – If given, model 2 will run on this dataset. by default None

export() → Dict[source]: Exports the CKA data along with the respective model layer names. :return:

plot_results(save_path: Optional[str] = None, title: Optional[str] = None)[source]

topk(array: ndarray, k: int, largest: bool = True) → topk_values_indices[source]

Returns the k largest/smallest elements and corresponding indices from an array-like input.

Parameters:

array (np.ndarray or list) – the array-like input
k (int) – the k in “top-k”
largest (bool, optional) – controls whether to return largest or smallest elements

Returns:

Returns the k largest/smallest elements and corresponding indices of the given array

Return type:

namedtuple[values, indices]

Example

>>> array = [5, 3, 7, 2, 1]
>>> topk(array, 2)
topk_values_indices(values=array([7, 5]), indices=array([2, 0], dtype=int64))

>>> topk(array, 2, largest=False)
topk_values_indices(values=array([1, 2]), indices=array([4, 3], dtype=int64))

>>> array = [[1, 2], [3, 4], [5, 6]]
>>> topk(array, 2)
topk_values_indices(values=array([6, 5]), indices=(array([2, 2], dtype=int64), array([1, 0], dtype=int64)))

wrapper(func: Callable) → Callable[source]

Wrap a function to make some arguments have the same length. By default, the arguments to be modified are hids and acts.

Uses can custom these arguments by setting argument

includes : to includes custom arguments
excludes : to excludes custom arguments
length_as : to make the length of the arguments the same as length_as, by default, it is hids.

Parameters:: func (Callable) – a function to be wrapped.
Returns:: a wrapped function.
Return type:: Callable
Raises:: TypeError – if the required arguments of the function is missing.

Example

>>> @wrapper
... def func(hids=[16], acts=None):
...     print(locals())

>>> func(100)
{'hids': [100], 'acts': [None]}

>>> func([100, 64])
{'hids': [100, 64], 'acts': [None, None]}

>>> func([100, 64], excludes=['acts'])
{'hids': [100, 64], 'acts': None}

>>> @wrapper
... def func(self, hids=[16], acts=None):
...     print(locals())

>>> func()
TypeError: The decorated function 'func' missing required argument 'self'.

>>> func('class_itself')
{'self': 'class_itself', 'hids': [16], 'acts': [None]}

>>> func('class_itself', hids=[])
{'self': 'class_itself', 'hids': [], 'acts': []}

>>> @wrapper
... def func(self, hids=[16], acts=None, heads=8):
...     print(locals())

>>> func('class_itself', hids=[100, 200])
{'self': 'class_itself', 'hids': [100, 200], 'acts': [None, None], 'heads': 8}

>>> func('class_itself', hids=[100, 200], includes=['heads'])
{'self': 'class_itself', 'hids': [100, 200], 'acts': [None, None], 'heads': [8, 8]}

repeat(src: Any, length: Optional[int] = None) → Any[source]

Repeat any objects and return iterable ones.

Parameters:

src (Any) – any objects
length (Optional[int], optional) – the length to be repeated. If None, it would return the iterable object itself, by default None

Returns:

the iterable repeated object

Return type:

Any

Example

>>> from greatx.utils import repeat
# repeat for single non-iterable object
>>> repeat(1)
[1]
>>> repeat(1, 3)
[1, 1, 1]
>>> repeat('relu', 2)
['relu', 'relu']
>>> repeat(None, 2)
[None, None]

>>> # repeat for iterable object
>>> repeat([1, 2, 3], 2)
[1, 2]
>>> repeat([1, 2, 3], 5)
[1, 2, 3, 3, 3]

split_nodes(labels: Tensor, *, train: float = 0.1, test: float = 0.8, val: float = 0.1, random_state: Optional[int] = None) → BunchDict[source]

Randomly split a set of nodes labeled with labels.

Parameters:

labels (torch.Tensor) – the labels of the nodes.
train (float, optional) – the percentage of the training set, by default 0.1
test (float, optional) – the percentage of the test set, by default 0.8
val (float, optional) – the percentage of the validation set, by default 0.1
random_state (Optional[int], optional) – random seed for the random number generator, by default None

Returns:

train_nodes: torch.Tensor with Size [train * num_nodes]
The indices of the training nodes
val_nodes: torch.Tensor with Size [val * num_nodes]
The indices of the validation nodes
test_nodes torch.Tensor with Size [test * num_nodes]
The indices of the test nodes

Return type:

BunchDict with the following items

split_nodes_by_classes(labels: Tensor, n_per_class: int = 20, random_state: Optional[int] = None) → BunchDict[source]

Randomly split the training data by the number of nodes per classes.

Parameters:

labels (torch.Tensor [num_nodes]) – The class labels
n_per_class (int) – Number of samples per class
random_state (Optional[int]) – Random seed

Returns:

train_nodes: torch.Tensor with Size [n_per_class * num_classes]
The indices of the training nodes
val_nodes: torch.Tensor with Size [n_per_class * num_classes]
The indices of the validation nodes
test_nodes torch.Tensor with Size [num_nodes - 2*n_per_class * num_classes]
The indices of the test nodes

Return type:

BunchDict with the following items

setup_logger(output: Optional[str] = None, name: str = 'GreatX', *, distributed_rank: int = 0, mode: str = 'w', color: bool = True, abbrev_name: Optional[str] = None) → Logger[source]

Initialize the GreatX logger and set its verbosity level to “DEBUG”.

Parameters:

output (Optional[str], optional) – a file name or a directory to save log. If None, will not save log file. If ends with “.txt” or “.log”, assumed to be a file name. Otherwise, logs will be saved to {output}/log.txt.
name (str, optional) – the root module name of this logger, by default “GreatX”
distributed_rank (int, optional) – used for distributed training, by default 0
mode (str, optional) – mode for the output file (if output is given), by default ‘w’.
color (bool, optional) – whether to use color when printing, the termcolor package is required, by default True
abbrev_name (Optional[str], optional) – an abbreviation of the module, to avoid long names in logs. Set to “” to not log the root module in logs. By default, None.

Returns:

a logger

Return type:

logging.Logger

Example

>>> logger = setup_logger(name='my exp')

>>> logger.info('message')
[12/19 17:01:43 my exp]: message

>>> logger.error('message')
ERROR [12/19 17:02:22 my exp]: message

>>> logger.warning('message')
WARNING [12/19 17:02:32 my exp]: message

>>> # specify output files
>>> logger = setup_logger(output='log.txt', name='my exp')
# additive, by default mode='w'
>>> logger = setup_logger(output='log.txt', name='my exp', mode='a')

# once you logger is set, you can call it by >>> logger = get_logger(name=’my exp’)

get_logger(name: str = 'GreatX') → Logger[source]

Get a logger for a given name.

Parameters:: name (str, optional) – name of the logger, by default “GreatX”
Return type:: a logger for the given name

singleton_filter(edges: ndarray, adj_matrix: csr_matrix)[source]

Filter edges that, if removed, would turn one or more nodes into singleton nodes.

Parameters:

edges (np.array, shape [M, 2], where M is the number of input edges.) – The candidate edges to remove.
adj_matrix (sp.sparse_matrix, shape [num_nodes, num_nodes]) – The input adjacency matrix where edges derived from.

Returns:

the edges that removed will not generate singleton nodes.

Return type:

np.array, shape [M, 2],

class SingletonFilter(adj_matrix: csr_matrix)[source]

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

Parameters:: adj_matrix (sp.csr_matrix) – the input adjacency matrix

update(u: int, v: int, edge_weight: float)[source]

class LikelihoodFilter(degree: ndarray, ll_cutoff: float = 0.004)[source]

Likelihood filter from the “Adversarial Attacks on Neural Networks for Graph Data” paper (KDD’18)

Parameters:

degree (np.ndarray) – the degree of the nodes in the graph
ll_cutoff (float, optional) – likelihood cutoff, by default 0.004

update(u: int, v: int, edge_weight: float, idx: int)[source]: Update likelihood ratio test values

static compute_alpha(n, S_d, d_min)[source]

Approximate the alpha of a power law distribution.

Parameters:

n (int or np.array of int) – Number of entries that are larger than or equal to d_min
S_d (float or np.array of float) – Sum of log degrees in the distribution that are larger than or equal to d_min
d_min (int) – The minimum degree of nodes to consider

Returns:

alpha – The estimated alpha of the power law distribution

Return type:

float

static update_Sx(S_old, n_old, d_old, d_new, d_min)[source]

Update on the sum of log degrees S_d and n based on degree distribution resulting from inserting or deleting a single edges.

Parameters:

S_old (float) – Sum of log degrees in the distribution that are larger than or equal to d_min.
n_old (int) – Number of entries in the old distribution that are larger than or equal to d_min.
d_old (np.array, shape [num_nodes,] dtype int) – The old degree sequence.
d_new (np.array, shape [num_nodes,] dtype int) – The new degree sequence
d_min (int) – The minimum degree of nodes to consider

Returns:

new_S_d (float, the updated sum of log degrees in the distribution that are larger than or equal to d_min.)
new_n (int, the updated number of entries in the old distribution that are larger than or equal to d_min.)

static compute_log_likelihood(n, alpha, S_d, d_min)[source]

Compute log likelihood of the powerlaw fit.

Parameters:

n (int) – Number of entries in the old distribution that are larger than or equal to d_min.
alpha (float) – The estimated alpha of the power law distribution
S_d (float) – Sum of log degrees in the distribution that are larger than or equal to d_min.
d_min (int) – The minimum degree of nodes to consider

Returns:

float

Return type:

the estimated log likelihood

static filter_chisquare(ll_ratios, cutoff)[source]

singleton_mask(adj_matrix: Tensor)[source]

Computes a mask for entries potentially leading to singleton nodes, i.e. one of the two nodes corresponding to the entry have degree 1 and there is an edge between the two nodes.

Parameters:: adj_matrix (Tensor, shape [N, N],) – where N is the number of nodes the input adjacency matrix to compte the mask
Returns:: mask – a boolean mask with shape as adj_matrix.
Return type:: bool Tensor

add_edges(edge_index: Tensor, edges_to_add: Tensor, symmetric: bool = True, coalesce: bool = True, sort_edges: bool = True) → Tensor[source]

Add edges to the graph denoted as edge_index.

Parameters:

edge_index (torch.Tensor) – the graph instance where edges will be removed from.
edges_to_add (torch.Tensor) – shape [2, M], the edges to be added into the graph.
symmetric (bool) – whether the output graph is symmetric, if True, it will also append the reversed edges into the graph.
coalesce (bool) – whether to coalesce the output edges.
sort_edges (bool) – whether to sort the output edges.

Returns:

the graph instance edge_index with edges added.

Return type:

Tensor

remove_edges(edge_index: Tensor, edges_to_remove: Tensor, symmetric: bool = True) → Tensor[source]

Remove edges from the graph denoted as edge_index.

Parameters:

edge_index (torch.Tensor) – the graph instance where edges will be removed from.
edges_to_remove (torch.Tensor) – shape [2, M], the edges to be removed in the graph.
symmetric (bool) – whether the output graph is symmetric, if True, it will also remove the reversed edges from the graph.

Returns:

the graph instance edge_index with edges removed.

Return type:

Tensor

flip_edges(edge_index: Tensor, edges_to_flip: Tensor, symmetric: bool = True) → Tensor[source]

Flip edges from the graph denoted as edge_index.

Parameters:

edge_index (torch.Tensor) – the graph instance where edges will be flipped from.
edges_to_flip (torch.Tensor) – shape [2, M], the edges to be flipped in the graph.
symmetric (bool) – whether the output graph is symmetric, if True, it will also flip the reversed edges from the graph.

Returns:

the graph instance edge_index with edges flipped.

Return type:

Tensor

flip_graph(data: Data, edges_to_flip: Tensor, symmetric: bool = True) → Data[source]

Flip edges from the graph denoted as data.

Parameters:

edge_index (Data) – the graph instance where edges will be flipped from.
edges_to_flip (torch.Tensor) – shape [2, M], the edges to be flipped in the graph.
symmetric (bool) – whether the output graph is symmetric, if True, it will also flip the reversed edges from the graph.

Returns:

the graph instance data with edges flipped.

Return type:

Data

Note

We currently don’t support a weigher graph and this function will automatically set edge_weight and adj_t as None.

scipy_normalize(adj_matrix: csr_matrix, add_self_loops: bool = True) → csr_matrix[source]

Normalize a sparse matrix according to GCN from the “Semi-supervised Classification with Graph Convolutional Networks” paper (ICLR’17)

Parameters:

adj_matrix (sp.csr_matrix) – the input sparse matrix denoting a graph.
add_self_loops (bool, optional) – whether to add self-loops, by default True

Returns:

the normalized adjacency matrix.

Return type:

sp.csr_matrix

normalize(feat: Tensor, norm: str = 'standardize', dim: Optional[int] = None, lim_min: float = -1.0, lim_max: float = 1.0) → Tensor[source]

Feature normalization function.

Adapted from GRB: https://github.com/THUDM/grb/blob/master/grb/dataset/dataset.py#L638

Parameters:

feat (torch.Tensor) – node feature matrix with shape [N, D]
norm (Optional[str], optional) – how to normalize feature matrix, including [“linearize”, “arctan”, “tanh”, “standardize”, “none”], by default “standardize”
dim (None or int, optional) – Axis along which the means or standard deviations are computed. The default is to compute the mean or standard deviations of the flattened array, by default None
lim_min (float, optional) – minimum limit of feature, by default -1.0
lim_max (float, optional) – maximum limit of feature, by default 1.0

Returns:

feat – normalized feature matrix

Return type:

torch.Tensor

overlap(edge_index1: Tensor, edge_index2: Tensor, on: str = 'edge', symmetric: bool = False) → float[source]

Compute graph overlapping according to the “Node Similarity Preserving Graph Convolutional Networks” paper (WSDM’21)

Parameters:

edge_index1 (torch.Tensor) – edges indices of a graph a graph
edge_index2 (torch.Tensor) – edges indices of another graph another graph
on (str, optional) – compute overlap on edge or node, by default edge, by default ‘edge’
symmetric (bool, optional) – whether the computed overlap is symetric, by default False

Returns:

overlapping of the two graphs on edge or node

Return type:

float

ego_graph(adj_matrix: csr_matrix, targets: Union[int, list], hops: int = 1) → ego_graph[source]

Returns induced subgraph of neighbors centered at node n within a given radius.

Parameters:

adj_matrix (sp.csr_matrix,) – a Scipy CSR sparse adjacency matrix representing a graph
targets (Union[int, list]) – center nodes, a single node or a list of nodes
hops (int number, optional) – Include all neighbors of distance<=hops from nodes.

Returns:

nodes: shape [N], the nodes of the subgraph edges: shape [2, M], the edges of the subgraph

Return type:

NamedTuple(nodes, edges)

Note

This is a faster implementation of networkx.ego_graph based on scipy sparse matrix and numba