Some notes that explain how PINNs are implemented in tensorflow
NOTE: This blog post contains some useful explanations for people who are less familiar with tensorflow, it explains few steps in a notebook of Jan Blechschmidt
Physics Informed Neural Networks (PINNs) physics informed
The essential thing to understand is that PINNs have been designed as a tool for two types of tasks, forward and inverse problem. Forward problem refers to the solutions of Partial Differential Equation (PDE) with known functional form and inverse problems refers to finding coefficients from data. The latter application is actually more interesting, since, as it turns out, solving PDEs with PINNs (former application) is generally more efficient when using standard numerical schemes and libraries.
The method constructs a neural network approximation
\[u_\theta(t,x) \approx u(t,x)\]of the solution of nonlinear PDE, where $u_\theta :[0,T] \times \mathcal{D} \to \mathbb{R}$ denotes a function realized by a neural network with parameters $\theta$.
The continuous time approach for the parabolic PDE as described in (Raissi et al., 2017 (Part I)) is based on the (strong) residual of a given neural network approximation $u_\theta \colon [0,T] \times \mathcal{D} \to \mathbb{R} $ of the solution $u$, i.e.,
\[\begin{align} r_\theta (t,x) := \partial_t u_\theta (t,x) + \mathcal{N}[u_\theta] (t,x). \end{align}\]To incorporate this PDE residual $r_\theta$ into a loss function to be minimized, PINNs require a further differentiation to evaluate the differential operators $\partial_t u_\theta$ and $\mathcal{N}[u_\theta]$. Thus the PINN term $r_\theta$ shares the same parameters as the original network $u_\theta(t,x)$, but respects the underlying “physics” of the nonlinear PDE. Both types of derivatives can be easily determined through automatic differentiation with current state-of-the-art machine learning libraries, e.g., TensorFlow or PyTorch.
The PINN approach for the solution of the initial and boundary value problem now proceeds by minimization of the loss functional
\[\begin{align} \phi_\theta(X) := \phi_\theta^r(X^r) + \phi_\theta^0(X^0) + \phi_\theta^b(X^b), \end{align}\]where $X$ denotes the collection of training data and the loss function $\phi_\theta$ contains the following terms:
in a number of points \(X^0:=\{(t^0_i,x^0_i)\}_{i=1}^{N_0} \subset \{0\} \times \mathcal{D}\) and \(X^b:=\{(t^b_i,x^b_i)\}_{i=1}^{N_b} \subset (0,T] \times \partial \mathcal{D}\), where \(u_\theta\) is the neural network approximation of the solution \(u\colon[0,T] \times \mathcal{D} \to \mathbb{R}\).
Note that the training data $X$ consists entirely of time-space coordinates.
I tested the notes given in notebook of Jan Blechschmidt I did not find significant increases in computation time using M1 Apple Mac book, despite lower specificaitons (on Google Colab this took noticeably longer). I am using tensorflow-macos
and tensorflow-metal
which is supposedly tuned for M1 processors. The task is relatively simple, so no hicaps observed during training, no heating or noise, while running in the battery mode. Below i just treat a very simple example that explains how the code works. We start by loading the required dependencies: the scientific computing library NumPy and the machine learning library TensorFlow and also work with single precision.
import tensorflow as tf
import numpy as np
# Set data type
DTYPE='float32'
tf.keras.backend.set_floatx(DTYPE)
Note: Below we will give more clear explanation of how the method will work by creating simple network using just 1 hidden layer and exponential activation function to be able to easily take gradients
# Define residual of the PDE
def init_model_simple(num_hidden_layers=1, num_neurons_per_layer=2):
# Initialize a feedforward neural network
model = tf.keras.Sequential()
# Input is one-dimensional (time + one spatial dimension)
model.add(tf.keras.Input(1))
# Append hidden layers
for _ in range(num_hidden_layers):
model.add(tf.keras.layers.Dense(num_neurons_per_layer,
activation=tf.keras.activations.exponential,
kernel_initializer='glorot_normal'))
# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))
return model
model_simple = init_model_simple()
To show how the algorithm works I only evaluate $du/dx$ and extract the weights of the relevant weights. Notice that with glorot initialization biases are zero and weights are random normal.
def get_r_simple(model, x):
# A tf.GradientTape is used to compute derivatives in TensorFlow
with tf.GradientTape(persistent=True) as tape:
tape.watch(x)
# Determine residual
u = model(x)
# Compute gradient u_x within the GradientTape
# since we need second derivatives
u_x = tape.gradient(u, x)
del tape
return u_x
w21 = model_simple.weights[2].numpy()[0] # last layer
w22 = model_simple.weights[2].numpy()[1]
w11 = model_simple.weights[0].numpy()[0,0] # first layer
w12 = model_simple.weights[0].numpy()[0,1]
x = tf.constant([1], dtype=tf.float32)
print(f'du/dx = {get_r_simple(model_simple,
tf.constant(x, dtype=tf.float32)).numpy()} =
{w21*w11*np.exp(w11*x) + w22*w12*np.exp(w12)}' )
print s
Indeed the $du/dx$ is computed consistently (within single precision) both by hand and using the network (run the network to confirm)
Below we set up the loss $l := \partial_x u $ and compute its gradients (for simplicity of analytical manipulations we chose exponential activation function and just one hidden layer with two neurons)
\[\begin{aligned} u = w_{21}\,\exp(w_{11}\, x + b_{11}) + b_{21} + w_{22}\,\exp(w_{12}\, x + b_{12}) + b_{22}\\ l := \partial_x u = w_{21}\,w_{11}\,\exp(w_{11} \, x + b_{11}) + w_{22}\,w_{12}\,\exp(w_{12}\, x + b_{12}) \end{aligned}\]We will take very simple loss as stated, which will depend only on $du/dx$
def compute_loss_simple(model, X_r):
return get_r_simple(model, X_r)
def get_grad_simple(model, X_r):
with tf.GradientTape(persistent=True) as tape:
# This tape is for derivatives with
# respect to trainable variables
#tape.watch(model.trainable_variables)
loss = compute_loss_simple(model, X_r)
g = tape.gradient(loss, model.trainable_variables)
del tape
return loss, g
loss, g = get_grad_simple(model_simple, x)
for gi, varsi in zip(g, model_simple.variables):
print(f'{varsi.name} has graidents {gi}')
This generates output which I put in the following table (your results will depend on itialization)
Layer name | neuron 1 | neuron 2. |
---|---|---|
dense_2/kernel:0 | 4.466523 | -0.13126281 |
dense_2/bias:0 | 2.0364249 | -0.00236067 |
dense_3/kernel:0 | 1.9372414 | 0.01865212 |
dense_3/bias:0 | 1.9372414 | 0.01865212 |
It has the following expressions given that $l = w_{21}\,w_{11}\,\exp(w_{11} \, x + b_{11}) + w_{22}\,w_{12}\,\exp(w_{12}\, x + b_{12})$ \(\begin{aligned} \frac{\partial \, l}{\partial {b_{11}}} = w_{21}\,w_{11}\,\exp(w_{11} \, x + b_{11}) \\ \frac{\partial \, l}{\partial {w_{12}}} = w_{22} (1+w_{12}^2)\,\exp(w_{12}\, x + b_{12}) \end{aligned}\)
and others…
Note that in $\partial_x u$ there is no derivatives with respect to the biases in the last layer: $b_{21}$ and $b_{22}$, this is why the output of the list had None as the last element (because it shows gradients with respect to the bias in the last layer). We will compare the values below, to the ones evaluate by hand. You should be able to get consistent results with the lines below
print(f"The value we expect for dl/dw11 = {w21*(1+w11**2)*np.exp(w11*x)}, dl/dw12 = {w22*(1+w12*2)*np.exp(w12*x)}")
print(f"The value we expect for dl/db11 = {w21*w11*np.exp(w11*x)}, dl/db12 = {w22*w12*np.exp(w12*x)}")
print(f"The value we expect for dl/dw21 = {w11*np.exp(w11*x)}, dl/dw22 = {w12*np.exp(w12*x)}")