Setting Up PostgreSQL

To install PostgreSQL follow this tutorial.

Installing

sudo apt update
sudo apt install postgresql postgresql-contrib

Using Postgres

Switch to postgres (default) account on your server type:

sudo -i -u postgres

Access PostgreSQL prompt:

psql

Exit interactive Postgres session:

\q

New database

Create a new database:

sudo -i -u postgres
createdb database-name

Changing user and database

Add user:

sudo adduser user-name

Switch user and connect to the user database (after creating one with name same as user-name):

sudo -i -u user-name
psql

Changing it altogether:

psql my_database -U postgres

Check current connection info:

\conninfo

Viewing table and database

List database:

\l

List table:

\dt
\dt *.*

List table in a public schema:

\dt public.*

List schema:

\dn

Dumping database for backup

pg_dump dbname > \path\outfile

Extension used:

CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;

Refer rosette and wikiversity for more info.

Create database as user postgres

In [1]:

%%bash
sudo -i -u postgres
dropdb test;
createdb test;

sudo: no tty present and no askpass program specified

In [2]:

import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text

engine = create_engine('postgresql://postgres:postgres@localhost:5432/test')

/home/hectoryee/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)

Importing data

Read excel file into pandas.

In [3]:

excel_file = "example.xlsx"
xlsx = pd.ExcelFile(excel_file)
print(xlsx.sheet_names)
df = xlsx.parse('Sheet1')

# # from multiple sheets
# excel_file = "example.xlsx"
# xlsx = pd.ExcelFile(excel_file)
# print(xlsx.sheet_names)
# excel_sheets = []
# for sheet in xlsx.sheet_names:
#     excel_sheets.append(xlsx.parse(sheet))
# ds = pd.concat(excel_sheets, ignore_index=True, sort=False)

['Sheet1']

Data Star Student

In [4]:

df.reset_index(inplace =True)
df.columns

Index(['index', 'Name', 'Gender', 'Company', 'Role'], dtype='object')

In [5]:

df.rename(columns={'Name':'name',
                   'Gender':'gender',
                   'Company':'company',
                   'Role':'role'
                  }, 
          inplace = True)

In [6]:

df.head()

	index	name	gender	company	role
0	0	Shaw Kok Hao	Male	Speedminer Sdn Bhd	DIRECTOR
1	1	Arumugam	Male	Sinar Bumi Petroleum Sdn Bhd	DATA ARCHITECT
2	2	Pravin	Male	Citi Group	BUSINESS FRAUD ANALYST
3	3	Tan Pei Ling	Female	University of Malaya	SENIOR LECTURER
4	4	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER

In [7]:

df[df.duplicated('name') == True].index

Int64Index([4], dtype='int64')

In [8]:

dup = df[df.duplicated('name') == True]
dup

	index	name	gender	company	role
4	4	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER

Individual

In [9]:

cleaned_df = df.drop_duplicates(subset=['name'], keep='last')
cleaned_df

	index	name	gender	company	role
1	1	Arumugam	Male	Sinar Bumi Petroleum Sdn Bhd	DATA ARCHITECT
2	2	Pravin	Male	Citi Group	BUSINESS FRAUD ANALYST
3	3	Tan Pei Ling	Female	University of Malaya	SENIOR LECTURER
4	4	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER

Functions

In [10]:

# get the intersection list of attributes that is to be inserted and table columns
def table(attrib):
    return df[df.columns.intersection(attrib)]

# list columns of table
def list_column(table):
    statement = "select column_name from information_schema.columns \
    where table_schema ='test' and table_name = '" + table + "';"
    df = pd.read_sql(statement, engine)
    return df['column_name'].tolist()

# select all from table
def show(table_name):
    query = 'SELECT * FROM test.' + table_name
    data = pd.read_sql(query, engine)
    return data

# append dataframe to table
def insert(df, table_name):
    df.to_sql(name=table_name, con=engine,schema = 'test', if_exists = 'append', index=False)
    return

# run sql query
def sql_query(statement):
    data = pd.read_sql(statement, engine)
    return data

# wrap up for insert method
def insert_into(table_name, attrib):
    if isinstance(attrib, pd.DataFrame):
        to_be_inserted = attrib
    else:
        to_be_inserted = table(attrib)
    insert(to_be_inserted, table_name)
    result = show(table_name)
    return result.head()

Create Database

Here I create a database from a sql script.

create_db.sql

DROP SCHEMA if exists test cascade;

CREATE SCHEMA test
    AUTHORIZATION postgres;
	
drop table if exists test.individual cascade;
create table test.individual(
	name varchar (80) primary key, 
	gender varchar (7),
	company varchar (120),
	role varchar (120),
	ts timestamp default now()
	);

In [11]:

%%bash
psql -d test
\i /home/hectoryee/project/my_blog/database/create_db.sql 
\q

DROP SCHEMA
CREATE SCHEMA
DROP TABLE
CREATE TABLE


psql:/home/hectoryee/project/my_blog/database/create_db.sql:1: NOTICE:  schema "test" does not exist, skipping
psql:/home/hectoryee/project/my_blog/database/create_db.sql:6: NOTICE:  table "individual" does not exist, skipping

Insert into

In [12]:

# columns to be inserted
attrib = ['name', 'gender', 'company', 'role']
data = cleaned_df.loc[:,attrib]
insert_into('individual', data)

	name	gender	company	role	ts
0	Arumugam	Male	Sinar Bumi Petroleum Sdn Bhd	DATA ARCHITECT	2019-02-03 15:56:14.126646
1	Pravin	Male	Citi Group	BUSINESS FRAUD ANALYST	2019-02-03 15:56:14.126646
2	Tan Pei Ling	Female	University of Malaya	SENIOR LECTURER	2019-02-03 15:56:14.126646
3	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER	2019-02-03 15:56:14.126646

Query

In [13]:

individual = 'test.individual'

In [14]:

# -- courses/modules offered by CADS
sql_query('''
select name
from test.individual;
''')

	name
0	Arumugam
1	Pravin
2	Tan Pei Ling
3	Shaw Kok Hao

In [15]:

# -- insert new individual data
sql = '''
INSERT INTO test.individual (name, gender, company, role)
VALUES (:name, :gender, :company, :role);
'''

engine.execute(text(sql), {'name' : 'John Smith', 
                          'gender' : 'trans',
                          'company': 'cads',
                          'role': 'intern'})

<sqlalchemy.engine.result.ResultProxy at 0x7f2501733080>

In [16]:

show('individual').tail()

	name	gender	company	role	ts
0	Arumugam	Male	Sinar Bumi Petroleum Sdn Bhd	DATA ARCHITECT	2019-02-03 15:56:14.126646
1	Pravin	Male	Citi Group	BUSINESS FRAUD ANALYST	2019-02-03 15:56:14.126646
2	Tan Pei Ling	Female	University of Malaya	SENIOR LECTURER	2019-02-03 15:56:14.126646
3	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER	2019-02-03 15:56:14.126646
4	John Smith	trans	cads	intern	2019-02-03 15:56:14.333465

In [17]:

# -- how many data
sql_query('''
SELECT COUNT(*)
FROM test.individual
''')

	count
0	5

In [18]:

# -- update info
sql = '''
UPDATE test.individual
SET role = :role
WHERE name = :name;
'''

engine.execute(text(sql), {'role' : 'permanent staff', 
                       'name' : 'John Smith'})

<sqlalchemy.engine.result.ResultProxy at 0x7f2500997b70>

In [19]:

show('individual').tail()

	name	gender	company	role	ts
0	Arumugam	Male	Sinar Bumi Petroleum Sdn Bhd	DATA ARCHITECT	2019-02-03 15:56:14.126646
1	Pravin	Male	Citi Group	BUSINESS FRAUD ANALYST	2019-02-03 15:56:14.126646
2	Tan Pei Ling	Female	University of Malaya	SENIOR LECTURER	2019-02-03 15:56:14.126646
3	Shaw Kok Hao	Male	Speedminer Sdn Bhd	ENGINEER	2019-02-03 15:56:14.126646
4	John Smith	trans	cads	permanent staff	2019-02-03 15:56:14.333465

In [20]:

# -- delete info
sql = '''
DELETE FROM test.individual
WHERE name = :name;
'''

engine.execute(text(sql), {'name' : 'John Smith'})

<sqlalchemy.engine.result.ResultProxy at 0x7f25009a1358>

Fuzzy string matching

This is a mini project to work with. Check similar names to ensure there’s no duplicated record.

CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;

References:

Soundex

The Soundex system is a method of matching similar-sounding names by converting them to the same code.

The fuzzystrmatch module provides two functions for working with Soundex codes:

soundex(text) returns text
difference(text, text) returns int

The soundex function converts a string to its Soundex code. The difference function converts two strings to their Soundex codes and then reports the number of matching code positions. Since Soundex codes have four characters, the result ranges from zero to four, with zero being no match and four being an exact match. (Thus, the function is misnamed — similarity would have been a better name.)

In [21]:

%%bash
psql -d test
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
\q

CREATE EXTENSION
CREATE EXTENSION

In [22]:

fuzz = "soundex(name)"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} = soundex('Pravin') 
ORDER BY {fuzz};
''')

	name	soundex
0	Pravin	P615

In [23]:

fuzz = "difference(name, 'Pravi')"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} > 2
ORDER BY {fuzz};
''')

	name	difference
0	Pravin	3

Levenshtein

This function calculates the Levenshtein distance between two strings:

levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost)
returns int
levenshtein(text source, text target) returns int
levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int
sub_cost, int max_d) returns int
levenshtein_less_equal(text source, text target, int max_d) returns int

Both source and target can be any non-null string, with a maximum of 255 characters. The cost parameters specify how much to charge for a character insertion, deletion, or substitution, respectively. You can omit the cost parameters, as in the second version of the function; in that case they all default to 1.

levenshtein_less_equal is an accelerated version of the Levenshtein function for use when only small distances are of interest. If the actual distance is less than or equal to max_d, then levenshtein_less_equal returns the correct distance; otherwise it returns some value greater than max_d. If max_d is negative then the behavior is the same as levenshtein.

In [24]:

fuzz = "levenshtein(name, 'Tan Pei Li')"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} <= 6
ORDER BY {fuzz};
''')

	name	levenshtein
0	Tan Pei Ling	2

In [25]:

fuzz = "levenshtein(name, 'Tan Pei Li',1,0,4)"

sql_query(f'''
SELECT name , {fuzz}
FROM {individual}
WHERE soundex(name) = soundex('Tan Pei Li')
ORDER BY {fuzz};
''')

	name	levenshtein
0	Tan Pei Ling	0

Metaphone

Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes.

This function calculates the metaphone code of an input string:

metaphone(text source, int max_output_length) returns text

source has to be a non-null string with a maximum of 255 characters. max_output_length sets the maximum length of the output metaphone code; if longer, the output is truncated to this length.

In [26]:

fuzz = "metaphone('Pravi', 4)"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE metaphone(name, 4) = {fuzz}
ORDER BY {fuzz};
''')

	name	metaphone

In [27]:

fuzz = "metaphone('Pravi', 3)"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE metaphone(name, 3) = {fuzz}
ORDER BY {fuzz};
''')

	name	metaphone
0	Pravin	PRF

Double Metaphone

The Double Metaphone system computes two “sounds like” strings for a given input string — a “primary” and an “alternate”. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes:

dmetaphone(text source) returns text
dmetaphone_alt(text source) returns text

There is no length limit on the input strings.

In [28]:

fuzz = "dmetaphone('Pravi')"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE dmetaphone(name) = {fuzz}
ORDER BY {fuzz};
''')

	name	dmetaphone

In [29]:

fuzz = "dmetaphone_alt('Pravi')"

sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE dmetaphone_alt(name) = {fuzz}
ORDER BY {fuzz};
''')

	name	dmetaphone_alt

trigram

postgesql

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages. Note

pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at ”. The set of trigrams in the string “foo|bar” is “ f”, “ fo”, “foo”, “oo ”, “ b”, “ ba”, “bar”, and “ar ”.

pg_trgm Functions

Function|Returns|Description :—|:—|:— similarity(text, text) | real | Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical). show_trgm(text) | text[] | Returns an array of all the trigrams in the given string. (In practice this is seldom useful except for debugging.) word_similarity(text, text) | real | Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below. strict_word_similarity(text, text) | real | Same as word_similarity(text, text), but forces extent boundaries to match word boundaries. Since we don’t have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string. show_limit() | real | Returns the current similarity threshold used by the % operator. This sets the minimum similarity between two words for them to be considered similar enough to be misspellings of each other, for example (deprecated). set_limit(real) | real | Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3). Returns the same value passed in (deprecated).

In [30]:

sql_query("SELECT word_similarity('word', 'two words');")

	word_similarity
0	0.8

In the first string, the set of trigrams is {“ w”,” wo”,”ord”,”wor”,”rd “}. In the second string, the ordered set of trigrams is {“ t”,” tw”,”two”,”wo “,” w”,” wo”,”wor”,”ord”,”rds”,”ds “}. The most similar extent of an ordered set of trigrams in the second string is {“ w”,” wo”,”wor”,”ord”}, and the similarity is 0.8.

Thus, the strict_word_similarity(text, text) function is useful for finding the similarity to whole words, while word_similarity(text, text) is more suitable for finding the similarity for parts of words.

postgres10 does not support strict_word_similarity

In [31]:

sql_query("SELECT show_trgm('word')")

	show_trgm
0	[ w, wo, ord, rd , wor]

In [32]:

sql_query("SELECT similarity('word', 'two words');")

	similarity
0	0.363636

pg_trgm Operators

Operator | Returns | Description — | — | — text % text | boolean | Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold. text <% text | boolean | Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by pg_trgm.word_similarity_threshold parameter. text %> text | boolean | Commutator of the <% operator. text «% text | boolean | Returns true if its second argument has a continuous extent of an ordered trigram set that matches word boundaries, and its similarity to the trigram set of the first argument is greater than the current strict word similarity threshold set by the pg_trgm.strict_word_similarity_threshold parameter. text %» text | boolean | Commutator of the «% operator. text <-> text | real | Returns the “distance” between the arguments, that is one minus the similarity() value. text «-> text | real | Returns the “distance” between the arguments, that is one minus the word_similarity() value. text <-» text | real | Commutator of the «-> operator. text «<-> text | real | Returns the “distance” between the arguments, that is one minus the strict_word_similarity() value. text <-»> text | real | Commutator of the «<-> operator.

GUC Parameters

pg_trgm.similarity_threshold (real)

Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3).

pg_trgm.word_similarity_threshold (real)

Sets the current word similarity threshold that is used by <% and %> operators. The threshold must be between 0 and 1 (default is 0.6).

Index Support

CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);

CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

In [33]:

sql = f'''
CREATE INDEX trgm_idx ON {individual} USING GIN (name gin_trgm_ops);
'''
engine.execute(text(sql))

<sqlalchemy.engine.result.ResultProxy at 0x7f25009972e8>

In [34]:

search_term = 'Pravi'
sql_query(f'''
SELECT name, similarity(name, '{search_term}') AS sml
FROM {individual}
WHERE name %% '{search_term}'
ORDER BY sml DESC, name
LIMIT 5;
''')

	name	sml
0	Pravin	0.625

Share on

Twitter Facebook LinkedIn

Xun Wei