Setting Up PostgreSQL
To install PostgreSQL follow this tutorial.
Installing
sudo apt update
sudo apt install postgresql postgresql-contrib
Using Postgres
Switch to postgres (default) account on your server type:
sudo -i -u postgres
Access PostgreSQL prompt:
psql
Exit interactive Postgres session:
\q
New database
Create a new database:
sudo -i -u postgres
createdb database-name
Changing user and database
Add user:
sudo adduser user-name
Switch user and connect to the user database (after creating one with name same as user-name):
sudo -i -u user-name
psql
Changing it altogether:
psql my_database -U postgres
Check current connection info:
\conninfo
Viewing table and database
List database:
\l
List table:
\dt
\dt *.*
List table in a public schema:
\dt public.*
List schema:
\dn
Dumping database for backup
pg_dump dbname > \path\outfile
Extension used:
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
Refer rosette and wikiversity for more info.
Create database as user postgres
In [1]:
sudo: no tty present and no askpass program specified
In [2]:
/home/hectoryee/anaconda3/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
Importing data
Read excel file into pandas.
In [3]:
['Sheet1']
Data Star Student
In [4]:
Index(['index', 'Name', 'Gender', 'Company', 'Role'], dtype='object')
In [5]:
In [6]:
index | name | gender | company | role | |
---|---|---|---|---|---|
0 | 0 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | DIRECTOR |
1 | 1 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT |
2 | 2 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST |
3 | 3 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER |
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
In [7]:
Int64Index([4], dtype='int64')
In [8]:
index | name | gender | company | role | |
---|---|---|---|---|---|
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
Individual
In [9]:
index | name | gender | company | role | |
---|---|---|---|---|---|
1 | 1 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT |
2 | 2 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST |
3 | 3 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER |
4 | 4 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER |
Functions
In [10]:
Create Database
Here I create a database from a sql script.
create_db.sql
DROP SCHEMA if exists test cascade;
CREATE SCHEMA test
AUTHORIZATION postgres;
drop table if exists test.individual cascade;
create table test.individual(
name varchar (80) primary key,
gender varchar (7),
company varchar (120),
role varchar (120),
ts timestamp default now()
);
In [11]:
DROP SCHEMA
CREATE SCHEMA
DROP TABLE
CREATE TABLE
psql:/home/hectoryee/project/my_blog/database/create_db.sql:1: NOTICE: schema "test" does not exist, skipping
psql:/home/hectoryee/project/my_blog/database/create_db.sql:6: NOTICE: table "individual" does not exist, skipping
Insert into
In [12]:
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
Query
In [13]:
In [14]:
name | |
---|---|
0 | Arumugam |
1 | Pravin |
2 | Tan Pei Ling |
3 | Shaw Kok Hao |
In [15]:
<sqlalchemy.engine.result.ResultProxy at 0x7f2501733080>
In [16]:
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
4 | John Smith | trans | cads | intern | 2019-02-03 15:56:14.333465 |
In [17]:
count | |
---|---|
0 | 5 |
In [18]:
<sqlalchemy.engine.result.ResultProxy at 0x7f2500997b70>
In [19]:
name | gender | company | role | ts | |
---|---|---|---|---|---|
0 | Arumugam | Male | Sinar Bumi Petroleum Sdn Bhd | DATA ARCHITECT | 2019-02-03 15:56:14.126646 |
1 | Pravin | Male | Citi Group | BUSINESS FRAUD ANALYST | 2019-02-03 15:56:14.126646 |
2 | Tan Pei Ling | Female | University of Malaya | SENIOR LECTURER | 2019-02-03 15:56:14.126646 |
3 | Shaw Kok Hao | Male | Speedminer Sdn Bhd | ENGINEER | 2019-02-03 15:56:14.126646 |
4 | John Smith | trans | cads | permanent staff | 2019-02-03 15:56:14.333465 |
In [20]:
<sqlalchemy.engine.result.ResultProxy at 0x7f25009a1358>
Fuzzy string matching
This is a mini project to work with. Check similar names to ensure there’s no duplicated record.
CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION pg_trgm;
References:
Soundex
The Soundex system is a method of matching similar-sounding names by converting them to the same code.
The fuzzystrmatch module provides two functions for working with Soundex codes:
soundex(text) returns text
difference(text, text) returns int
The soundex
function converts a string to its Soundex code. The difference
function converts two strings to their Soundex codes and then reports the number
of matching code positions. Since Soundex codes have four characters, the result
ranges from zero to four, with zero being no match and four being an exact
match. (Thus, the function is misnamed — similarity
would have been a better
name.)
In [21]:
CREATE EXTENSION
CREATE EXTENSION
In [22]:
name | soundex | |
---|---|---|
0 | Pravin | P615 |
In [23]:
name | difference | |
---|---|---|
0 | Pravin | 3 |
Levenshtein
This function calculates the Levenshtein distance between two strings:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost)
returns int
levenshtein(text source, text target) returns int
levenshtein_less_equal(text source, text target, int ins_cost, int del_cost, int
sub_cost, int max_d) returns int
levenshtein_less_equal(text source, text target, int max_d) returns int
Both source
and target
can be any non-null string, with a maximum of 255
characters. The cost parameters specify how much to charge for a character
insertion, deletion, or substitution, respectively. You can omit the cost
parameters, as in the second version of the function; in that case they all
default to 1.
levenshtein_less_equal
is an accelerated version of the Levenshtein function
for use when only small distances are of interest. If the actual distance is
less than or equal to max_d
, then levenshtein_less_equal
returns the correct
distance; otherwise it returns some value greater than max_d
. If max_d
is
negative then the behavior is the same as levenshtein.
In [24]:
name | levenshtein | |
---|---|---|
0 | Tan Pei Ling | 2 |
In [25]:
name | levenshtein | |
---|---|---|
0 | Tan Pei Ling | 0 |
Metaphone
Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes.
This function calculates the metaphone code of an input string:
metaphone(text source, int max_output_length) returns text
source
has to be a non-null string with a maximum of 255 characters.
max_output_length
sets the maximum length of the output metaphone code; if
longer, the output is truncated to this length.
In [26]:
name | metaphone |
---|
In [27]:
name | metaphone | |
---|---|---|
0 | Pravin | PRF |
Double Metaphone
The Double Metaphone system computes two “sounds like” strings for a given input string — a “primary” and an “alternate”. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes:
dmetaphone(text source) returns text
dmetaphone_alt(text source) returns text
There is no length limit on the input strings.
In [28]:
name | dmetaphone |
---|
In [29]:
name | dmetaphone_alt |
---|
trigram
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages. Note
pg_trgm ignores non-word characters (non-alphanumerics) when extracting trigrams from a string. Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string “cat” is “ c”, “ ca”, “cat”, and “at ”. The set of trigrams in the string “foo|bar” is “ f”, “ fo”, “foo”, “oo ”, “ b”, “ ba”, “bar”, and “ar ”.
pg_trgm Functions
Function|Returns|Description :—|:—|:— similarity(text, text) | real | Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical). show_trgm(text) | text[] | Returns an array of all the trigrams in the given string. (In practice this is seldom useful except for debugging.) word_similarity(text, text) | real | Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below. strict_word_similarity(text, text) | real | Same as word_similarity(text, text), but forces extent boundaries to match word boundaries. Since we don’t have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string. show_limit() | real | Returns the current similarity threshold used by the % operator. This sets the minimum similarity between two words for them to be considered similar enough to be misspellings of each other, for example (deprecated). set_limit(real) | real | Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3). Returns the same value passed in (deprecated).
In [30]:
word_similarity | |
---|---|
0 | 0.8 |
In the first string, the set of trigrams is {“ w”,” wo”,”ord”,”wor”,”rd “}. In the second string, the ordered set of trigrams is {“ t”,” tw”,”two”,”wo “,” w”,” wo”,”wor”,”ord”,”rds”,”ds “}. The most similar extent of an ordered set of trigrams in the second string is {“ w”,” wo”,”wor”,”ord”}, and the similarity is 0.8.
Thus, the strict_word_similarity(text, text) function is useful for finding the similarity to whole words, while word_similarity(text, text) is more suitable for finding the similarity for parts of words.
postgres10 does not support strict_word_similarity
In [31]:
show_trgm | |
---|---|
0 | [ w, wo, ord, rd , wor] |
In [32]:
similarity | |
---|---|
0 | 0.363636 |
pg_trgm Operators
Operator | Returns | Description — | — | — text % text | boolean | Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold. text <% text | boolean | Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by pg_trgm.word_similarity_threshold parameter. text %> text | boolean | Commutator of the <% operator. text «% text | boolean | Returns true if its second argument has a continuous extent of an ordered trigram set that matches word boundaries, and its similarity to the trigram set of the first argument is greater than the current strict word similarity threshold set by the pg_trgm.strict_word_similarity_threshold parameter. text %» text | boolean | Commutator of the «% operator. text <-> text | real | Returns the “distance” between the arguments, that is one minus the similarity() value. text «-> text | real | Returns the “distance” between the arguments, that is one minus the word_similarity() value. text <-» text | real | Commutator of the «-> operator. text «<-> text | real | Returns the “distance” between the arguments, that is one minus the strict_word_similarity() value. text <-»> text | real | Commutator of the «<-> operator.
GUC Parameters
pg_trgm.similarity_threshold (real)
Sets the current similarity threshold that is used by the % operator. The threshold must be between 0 and 1 (default is 0.3).
pg_trgm.word_similarity_threshold (real)
Sets the current word similarity threshold that is used by <% and %> operators. The threshold must be between 0 and 1 (default is 0.6).
Index Support
CREATE TABLE test_trgm (t text);
CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
or
CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
In [33]:
<sqlalchemy.engine.result.ResultProxy at 0x7f25009972e8>
In [34]:
name | sml | |
---|---|---|
0 | Pravin | 0.625 |