Postgresql replication

PREAMBLE

It goes without saying that if you want to set up replication then you need a master and a slave postgresql server.  There is a bit of configuration to do in order to get it running.

This how to was originally written for older versions of Postgresql but has been updated to reflect the significant changes in later versions.  It has been tested against version 15.

WARNING: Before continuing with this tutorial you should ensure that you can connect to both servers using SSL as replicating over the internet without ssl is a little bit insane. See here

We are using sha-256 encryption for passwords (which is now the default) so the first thing we want to do is edit postgresql.conf on the master and check that password_encryption is set correctly.

vi /etc/postgresql/15/main/postgresql.conf

NOTE: The directory path will change depending on the version.

Look for the “password_encryption” directive and change it to “scram-sha-256”.  This line may be commented out, if it is then uncomment it.

password_encryption = scram-sha-256             # md5 or scram-sha-256

If you have already set a password for your replication user that is not sha-256 encoded then you will have to set it again or you will not be able to authenticate.  The reason quoted below is explained in a stack overflow question here.

Each user password hash is saved in the table pg_authid. It includes the hashing algorithm that is used to transform the password to its hash.

When setting the password_encryption in postgresql.conf, you are setting the default encryption, i.e. the one used when creating a user or when (re)setting your password. The table pg_authid is not updated.

When changing pg_hba.conf, you are saying to accept only passwords hashed using the given method. The table pg_authid is not updated.

MASTER

On the master the first thing we need to do is create a user for replication purposes so start a postgresql shell and add a user.

CREATE ROLE replicate LOGIN REPLICATION ENCRYPTED PASSWORD 'mySecretPassword';

While we are in the postgresql shell we also want to create a slot.  For some reason we create a slot using a SLEECT statement, not sure why, but it is unimportant in any event.

SELECT * FROM pg_create_physical_replication_slot('replicate');

The slot does not have to have the same name as the user, we just use “replicate” in our example for the sake of simplicity.  NOTE:  It is not necessary to create a slot as postgresql will create a temporary slot if one isn’t explicitly created.

Now we can exit our postgresql client shell and move on to editing our pg_hba.conf file, usually located in /etc/postgresql/11/main

# TYPE DATABASE USER ADDRESS METHOD
...
hostssl replication  replicate    xxx.xxx.xxx.xxx/xx   scram-sha-256
....

IMPORTANT: Replace xxx.xxx.xxx.xxx/xx with the ip and subnet mask of the slave.

The line highlighted above will allow the user replicate to use the built in role ‘replication’ from the slave.  The type is set to ‘hostssl‘ to force the use of ssl encryption and ‘scram-sha-256‘ is the method used to decrypt the password.

Next you need to edit the postgresql.conf file to tell the server that it is acting as a master server.

#vi /etc/postgresql/11/main/postgresql.conf

edit the file as follows, uncommenting anything that is commented out.

listen_addresses = 'localhost,xxx.xxx.xxx.xxx'
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024MB

Note that xxx.xxx.xxx.xxx is the address of the interface you want to listen on, you can add several addresses here if you are listening on the WAN port and the LAN port for example, albeit if this is the case you may just wish to use a wildcard instead.

listen_address = '*'

Once you have done that you can restart the master and move on to configuring the slave.

#service postgresql restart

 

SLAVE

Before you configure the slave you need to create an ssl certificate using the Certificate Authority that you should have already created on the master (this is a prerequisite).

Once you have generated your certificate you need to copy the .key and .crt files that you have generated across to the slave along with the postgresCA.crt file.  These should be in the /pgsql folder.

Make sure that the postgres user owns them and that they are accessible and that the .key file has limited permissions.

chown postgres:postgres /pgsql/*.crt
chown postgres:postgres /pgsql/*.key
chmod 400 /pgsql/*.key

 

Next stop the slave server

#service postgresql stop

Now you can configure the slave in a similar manner to the master.  Firstly edit postgresql.conf.

WARNING:  If you have a recovery.conf file in your data directory you need to get rid of it or postgresql will not start.  More information provided here https://www.cybertec-postgresql.com/en/recovery-conf-is-gone-in-postgresql-v12/ (Acknowledgement).

Find the following directives in the file and set/uncomment them paying particular attention to the primary_conninfo line where you put the connection string.

NOTE that the text between the ‘ ‘ after the primary_conninfo directive is all on the same line.

You will need to change the host, password, sslcert and sslkey directives to reflect your configuration.

primary_conninfo = 'host=XXX.XXX.XXX.XXX port=5432 user=replicate password=mysecretpassword sslmode=require sslcert=/pgsql/slave.crt sslkey=/pgsql/slave.key sslrootcert=/pgsql/postgresCA.crt' 
primary_slot_name = 'replicate'         # replication slot on sending server
#promote_trigger_file = ''              # file name whose presence ends recovery
hot_standby = on                        # "off" disallows queries during recovery
                                        # (change requires restart)
#max_standby_archive_delay = 30s        # max delay before canceling queries
                                        # when reading WAL from archive;
                                        # -1 allows indefinite delay
#max_standby_streaming_delay = 30s      # max delay before canceling queries
                                        # when reading streaming WAL;
                                        # -1 allows indefinite delay
wal_receiver_create_temp_slot = on      # create temp slot if primary_slot_name
                                        # is not set
wal_receiver_status_interval = 10s      # send replies at least this often
                                        # 0 disables
hot_standby_feedback = on               # send info from standby to prevent
                                        # query conflicts
#wal_receiver_timeout = 60s             # time that receiver waits for
                                        # communication from primary
                                        # in milliseconds; 0 disables
wal_retrieve_retry_interval = 5s        # time to wait before retrying to
                                        # retrieve WAL after a failed attempt
#recovery_min_apply_delay = 0           # minimum delay for applying changes during recovery

Next we need to remove any existing data and copy the folders from the master

rm -R /pgsql/15 
pg_basebackup -v -h xxx.xxx.xxx.xxx -U replicate -P -D /pgsql/data --slot replicate 
chown -R postgres:postgres /pgsql/15

where xxx.xxx.xxx.xxx is the ip address of the master. If you have set up postgresql according to this site then the data directory will be /pgsql.  Note you may also need to set the permissions if you are not logged in as the postgres user.

You will be prompted for the ‘replicate‘ user’s password.  If all is well then the script should terminate with something like

177156/177156 kB (100%), 1/1 tablespace

The slave’s data folder should now be populated with a copy of the master.

NOTE: If you have issues performing this task then you need to resolve them before proceeding.

Now in order for our slave to know it is acting as a standby server we need to create an empty text file in the data directory.

touch /pgsql/15/standby.signal
chown postgres:postgres /pgsql/15/standby.signal

You should now be able to start the server and all should work as it should.

systemctl start postgrtesql

Troubleshooting

The log files in /var/log/postgresql on the master and the slave may help resolve problems.

On the master you can issue the following in the psql client

SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;

which should return a record.  The state should be “streaming”

Most of the problems are caused by permissions so review permissions on the data directory and for the ssl certificates.

Double check the connection string paying attention to the names of the certificates and make sure that they match the files stored in /pgsql/

On the slave from the psql client run

\x
SELECT * FROM pg_stat_wal_receiver;

Again there should be a record for the master.

Check the password for the replication user is encoded with sha-256

SELECT rolpassword from pg_authid where rolname = 'replicate';

rollpassword should begin with  SCRAM-SHA-256

If there is no result then check that the user exists.

NOTE: One of my databases has lots of roles that cause \du to bring up reams of gibberish.  For this reason I use this lookup query instead:

SELECT usename AS role_name,
  CASE 
     WHEN usesuper AND usecreatedb THEN 
       CAST('superuser, create database' AS pg_catalog.text)
     WHEN usesuper THEN 
        CAST('superuser' AS pg_catalog.text)
     WHEN usecreatedb THEN 
        CAST('create database' AS pg_catalog.text)
     ELSE 
        CAST('' AS pg_catalog.text)
  END role_attributes
FROM pg_catalog.pg_user
ORDER BY role_name desc;

 

WAL files not being deleted

For some reason the WAL files used for replication do not always get deleted and simply build up until you run out of space on the partition.

To get rid of them manually from psql -U postgres run

postgres=# 
SELECT slot_name,
   lpad((pg_control_checkpoint()).timeline_id::text, 8, '0') ||
   lpad(split_part(restart_lsn::text, '/', 1), 8, '0') ||
   lpad(substr(split_part(restart_lsn::text, '/', 2), 1, 2), 8, '0')
   AS wal_file
FROM pg_replication_slots;
slot_name |         wal_file
-----------+--------------------------
 postgres2 | 0000000100000000000000B0
(1 row)

postgres=# 
    SELECT pg_drop_replication_slot('postgres2');
 pg_drop_replication_slot
--------------------------

(1 row)

and the shortly after most of the WAL files will have disappeared.  Change “postgres2” in the second statement to whatever “slot_name” is returned in the first.

Acknowledgement: https://stackoverflow.com/questions/49539938/postgres-wal-file-not-getting-deleted