Setting Up PostgreSQL
To install PostgreSQL follow this tutorial.
Installing
sudo apt update
sudo apt install postgresql postgresql-contrib
Using Postgres
Switch to postgres (default) account on your server type:
sudo -i -u postgres
Access PostgreSQL prompt:
psql
Exit interactive Postgres session:
\q
New database
Create a new database:
sudo -i -u postgres
createdb database-name
Changing user and database
Add user:
sudo adduser user-name
Switch user and connect to the user database (after creating one with name same as user-name):
sudo -i -u user-name
psql
Changing it altogether:
psql my_database -U postgres
Check current connection info:
\conninfo
Viewing table and database
List database:
\l
List table:
\dt
\dt *.*
List table in a public schema:
\dt public.*
List schema:
\dn
Dumping database for backup
pg_dump dbname > \path\outfile
Extension used:
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
Refer rosette and wikiversity for more info.
Create database as user postgres
In [1]:
%%bash
sudo -i -u postgres
dropdb test;
createdb test;
sudo: no tty present and no askpass program specified
In [2]:
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
engine = create_engine('postgresql://postgres:postgres@localhost:5432/test')
/home/hectoryee/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Importing data
Read excel file into pandas.
In [3]:
excel_file = "example.xlsx"
xlsx = pd.ExcelFile(excel_file)
print(xlsx.sheet_names)
df = xlsx.parse('Sheet1')
# # from multiple sheets
# excel_file = "example.xlsx"
# xlsx = pd.ExcelFile(excel_file)
# print(xlsx.sheet_names)
# excel_sheets = []
# for sheet in xlsx.sheet_names:
# excel_sheets.append(xlsx.parse(sheet))
# ds = pd.concat(excel_sheets, ignore_index=True, sort=False)
['Sheet1']
Data Star Student
In [4]:
df.reset_index(inplace =True)
df.columns
Index(['index', 'Name', 'Gender', 'Company', 'Role'], dtype='object')
In [5]:
df.rename(columns={'Name':'name',
'Gender':'gender',
'Company':'company',
'Role':'role'
},
inplace = True)
In [6]:
df.head()
index | name | gender | company | role | |
---|---|---|---|---|---|
0 | 0 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | DIRECTOR |
1 | 1 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT |
2 | 2 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST |
3 | 3 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER |
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
In [7]:
df[df.duplicated('name') == True].index
Int64Index([4], dtype='int64')
In [8]:
dup = df[df.duplicated('name') == True]
dup
index | name | gender | company | role | |
---|---|---|---|---|---|
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
Individual
In [9]:
cleaned_df = df.drop_duplicates(subset=['name'], keep='last')
cleaned_df
index | name | gender | company | role | |
---|---|---|---|---|---|
1 | 1 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT |
2 | 2 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST |
3 | 3 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER |
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
Functions
In [10]:
# get the intersection list of attributes that is to be inserted and table columns
def table(attrib):
return df[df.columns.intersection(attrib)]
# list columns of table
def list_column(table):
statement = "select column_name from information_schema.columns \
where table_schema ='test' and table_name = '" + table + "';"
df = pd.read_sql(statement, engine)
return df['column_name'].tolist()
# select all from table
def show(table_name):
query = 'SELECT * FROM test.' + table_name
data = pd.read_sql(query, engine)
return data
# append dataframe to table
def insert(df, table_name):
df.to_sql(name=table_name, con=engine,schema = 'test', if_exists = 'append', index=False)
return
# run sql query
def sql_query(statement):
data = pd.read_sql(statement, engine)
return data
# wrap up for insert method
def insert_into(table_name, attrib):
if isinstance(attrib, pd.DataFrame):
to_be_inserted = attrib
else:
to_be_inserted = table(attrib)
insert(to_be_inserted, table_name)
result = show(table_name)
return result.head()
Create Database
Here I create a database from a sql script.
create_db.sql
DROP SCHEMA if exists test cascade;
CREATE SCHEMA test
AUTHORIZATION postgres;
drop table if exists test.individual cascade;
create table test.individual(
name varchar (80) primary key,
gender varchar (7),
company varchar (120),
role varchar (120),
ts timestamp default now()
);
In [11]:
%%bash
psql -d test
\i /home/hectoryee/project/my_blog/database/create_db.sql
\q
DROP SCHEMA
CREATE SCHEMA
DROP TABLE
CREATE TABLE
psql:/home/hectoryee/project/my_blog/database/create_db.sql:1: NOTICE: schema "test" does not exist, skipping
psql:/home/hectoryee/project/my_blog/database/create_db.sql:6: NOTICE: table "individual" does not exist, skipping
Insert into
In [12]:
# columns to be inserted
attrib = ['name', 'gender', 'company', 'role']
data = cleaned_df.loc[:,attrib]
insert_into('individual', data)
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
Query
In [13]:
individual = 'test.individual'
In [14]:
# -- courses/modules offered by CADS
sql_query('''
select name
from test.individual;
''')
name | |
---|---|
0 | Arumugam |
1 | Pravin |
2 | Tan Pei Ling |
3 | Shaw Kok Hao |
In [15]:
# -- insert new individual data
sql = '''
INSERT INTO test.individual (name, gender, company, role)
VALUES (:name, :gender, :company, :role);
'''
engine.execute(text(sql), {'name' : 'John Smith',
'gender' : 'trans',
'company': 'cads',
'role': 'intern'})
<sqlalchemy.engine.result.ResultProxy at 0x7f2501733080>
In [16]:
show('individual').tail()
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
4 | John Smith | trans | cads | intern | 2019-02-03 15:56:14.333465 |
In [17]:
# -- how many data
sql_query('''
SELECT COUNT(*)
FROM test.individual
''')
count | |
---|---|
0 | 5 |
In [18]:
# -- update info
sql = '''
UPDATE test.individual
SET role = :role
WHERE name = :name;
'''
engine.execute(text(sql), {'role' : 'permanent staff',
'name' : 'John Smith'})
<sqlalchemy.engine.result.ResultProxy at 0x7f2500997b70>
In [19]:
show('individual').tail()
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
4 | John Smith | trans | cads | permanent staff | 2019-02-03 15:56:14.333465 |
In [20]:
# -- delete info
sql = '''
DELETE FROM test.individual
WHERE name = :name;
'''
engine.execute(text(sql), {'name' : 'John Smith'})
<sqlalchemy.engine.result.ResultProxy at 0x7f25009a1358>
Fuzzy string matching
This is a mini project to work with. Check similar names to ensure there’s no duplicated record.
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
References:
Soundex
The Soundex system is a method of matching similar-sounding names by converting them to the same code.
The fuzzystrmatch module provides two functions for working with Soundex codes:
soundex(text) returns text
difference(text, text) returns int
The soundex
function converts a string to its Soundex code. The difference
function converts two strings to their Soundex codes and then reports the number
of matching code positions. Since Soundex codes have four characters, the result
ranges from zero to four, with zero being no match and four being an exact
match. (Thus, the function is misnamed — similarity
would have been a better
name.)
In [21]:
%%bash
psql -d test
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
\q
CREATE EXTENSION
CREATE EXTENSION
In [22]:
fuzz = "soundex(name)"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} = soundex('Pravin')
ORDER BY {fuzz};
''')
name | soundex | |
---|---|---|
0 | Pravin | P615 |
In [23]:
fuzz = "difference(name, 'Pravi')"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} > 2
ORDER BY {fuzz};
''')
name | difference | |
---|---|---|
0 | Pravin | 3 |
Levenshtein
This function calculates the Levenshtein distance between two strings:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost)
returns int
levenshtein(text source, text target) returns int
levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int
sub_cost, int max_d) returns int
levenshtein_less_equal(text source, text target, int max_d) returns int
Both source
and target
can be any non-null string, with a maximum of 255
characters. The cost parameters specify how much to charge for a character
insertion, deletion, or substitution, respectively. You can omit the cost
parameters, as in the second version of the function; in that case they all
default to 1.
levenshtein_less_equal
is an accelerated version of the Levenshtein function
for use when only small distances are of interest. If the actual distance is
less than or equal to max_d
, then levenshtein_less_equal
returns the correct
distance; otherwise it returns some value greater than max_d
. If max_d
is
negative then the behavior is the same as levenshtein.
In [24]:
fuzz = "levenshtein(name, 'Tan Pei Li')"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE {fuzz} <= 6
ORDER BY {fuzz};
''')
name | levenshtein | |
---|---|---|
0 | Tan Pei Ling | 2 |
In [25]:
fuzz = "levenshtein(name, 'Tan Pei Li',1,0,4)"
sql_query(f'''
SELECT name , {fuzz}
FROM {individual}
WHERE soundex(name) = soundex('Tan Pei Li')
ORDER BY {fuzz};
''')
name | levenshtein | |
---|---|---|
0 | Tan Pei Ling | 0 |
Metaphone
Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes.
This function calculates the metaphone code of an input string:
metaphone(text source, int max_output_length) returns text
source
has to be a non-null string with a maximum of 255 characters.
max_output_length
sets the maximum length of the output metaphone code; if
longer, the output is truncated to this length.
In [26]:
fuzz = "metaphone('Pravi', 4)"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE metaphone(name, 4) = {fuzz}
ORDER BY {fuzz};
''')
name | metaphone |
---|
In [27]:
fuzz = "metaphone('Pravi', 3)"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE metaphone(name, 3) = {fuzz}
ORDER BY {fuzz};
''')
name | metaphone | |
---|---|---|
0 | Pravin | PRF |
Double Metaphone
The Double Metaphone system computes two “sounds like” strings for a given input string — a “primary” and an “alternate”. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes:
dmetaphone(text source) returns text
dmetaphone_alt(text source) returns text
There is no length limit on the input strings.
In [28]:
fuzz = "dmetaphone('Pravi')"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE dmetaphone(name) = {fuzz}
ORDER BY {fuzz};
''')
name | dmetaphone |
---|
In [29]:
fuzz = "dmetaphone_alt('Pravi')"
sql_query(f'''
SELECT name, {fuzz}
FROM {individual}
WHERE dmetaphone_alt(name) = {fuzz}
ORDER BY {fuzz};
''')
name | dmetaphone_alt |
---|
trigram
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages. Note
pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at ”. The set of trigrams in the string “foo|bar” is “ f”, “ fo”, “foo”, “oo ”, “ b”, “ ba”, “bar”, and “ar ”.
pg_trgm Functions
Function|Returns|Description :—|:—|:— similarity(text, text) | real | Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical). show_trgm(text) | text[] | Returns an array of all the trigrams in the given string. (In practice this is seldom useful except for debugging.) word_similarity(text, text) | real | Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below. strict_word_similarity(text, text) | real | Same as word_similarity(text, text), but forces extent boundaries to match word boundaries. Since we don’t have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string. show_limit() | real | Returns the current similarity threshold used by the % operator. This sets the minimum similarity between two words for them to be considered similar enough to be misspellings of each other, for example (deprecated). set_limit(real) | real | Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3). Returns the same value passed in (deprecated).
In [30]:
sql_query("SELECT word_similarity('word', 'two words');")
word_similarity | |
---|---|
0 | 0.8 |
In the first string, the set of trigrams is {“ w”,” wo”,”ord”,”wor”,”rd “}. In the second string, the ordered set of trigrams is {“ t”,” tw”,”two”,”wo “,” w”,” wo”,”wor”,”ord”,”rds”,”ds “}. The most similar extent of an ordered set of trigrams in the second string is {“ w”,” wo”,”wor”,”ord”}, and the similarity is 0.8.
Thus, the strict_word_similarity(text, text) function is useful for finding the similarity to whole words, while word_similarity(text, text) is more suitable for finding the similarity for parts of words.
postgres10 does not support strict_word_similarity
In [31]:
sql_query("SELECT show_trgm('word')")
show_trgm | |
---|---|
0 | [ w, wo, ord, rd , wor] |
In [32]:
sql_query("SELECT similarity('word', 'two words');")
similarity | |
---|---|
0 | 0.363636 |
pg_trgm Operators
Operator | Returns | Description — | — | — text % text | boolean | Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold. text <% text | boolean | Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by pg_trgm.word_similarity_threshold parameter. text %> text | boolean | Commutator of the <% operator. text «% text | boolean | Returns true if its second argument has a continuous extent of an ordered trigram set that matches word boundaries, and its similarity to the trigram set of the first argument is greater than the current strict word similarity threshold set by the pg_trgm.strict_word_similarity_threshold parameter. text %» text | boolean | Commutator of the «% operator. text <-> text | real | Returns the “distance” between the arguments, that is one minus the similarity() value. text «-> text | real | Returns the “distance” between the arguments, that is one minus the word_similarity() value. text <-» text | real | Commutator of the «-> operator. text «<-> text | real | Returns the “distance” between the arguments, that is one minus the strict_word_similarity() value. text <-»> text | real | Commutator of the «<-> operator.
GUC Parameters
pg_trgm.similarity_threshold (real)
Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3).
pg_trgm.word_similarity_threshold (real)
Sets the current word similarity threshold that is used by <% and %> operators. The threshold must be between 0 and 1 (default is 0.6).
Index Support
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
or
CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
In [33]:
sql = f'''
CREATE INDEX trgm_idx ON {individual} USING GIN (name gin_trgm_ops);
'''
engine.execute(text(sql))
<sqlalchemy.engine.result.ResultProxy at 0x7f25009972e8>
In [34]:
search_term = 'Pravi'
sql_query(f'''
SELECT name, similarity(name, '{search_term}') AS sml
FROM {individual}
WHERE name %% '{search_term}'
ORDER BY sml DESC, name
LIMIT 5;
''')
name | sml | |
---|---|---|
0 | Pravin | 0.625 |