tail -f /dev/null

If you haven't had any obstacles lately, you're not challenging. be the worst.

PandasでセットしたdataframeからPostgreSQLへデータ投入したい

とてもかんたんでした。

environment

  • mac OS Mojave 10.14.2
  • psql (PostgreSQL) 11.1
  • pg_ctl (PostgreSQL) 11.1
  • PostgreSQL 11.1
  • lxml 4.3.1
  • html5lib 1.0.1
  • beautifulsoup4 4.7.1
  • SQLAlchemy 1.2.18
  • pandas 0.24.1

pandasとは

  • Python用データ分析library。
  • data frame(2次元table)形式を扱える。
  • 公式サイト

csvをtsvへ変換

test.csvを準備。

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6

dataframeにセット。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

cdf = pd.read_csv("test.csv", encoding='utf-8', header=0)
pd.DataFrame(cdf).to_csv("test.tsv", sep='\t')

tdf = pd.read_csv("test.tsv", encoding='utf-8', sep='\t', header=0)

tdfの中身。

      Unnamed: 0  fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  ...  total sulfur dioxide  density    pH  sulphates  alcohol  quality
0              0            7.4             0.700         0.00             1.9      0.076  ...                  34.0  0.99780  3.51       0.56      9.4        5
1              1            7.8             0.880         0.00             2.6      0.098  ...                  67.0  0.99680  3.20       0.68      9.8        5
2              2            7.8             0.760         0.04             2.3      0.092  ...                  54.0  0.99700  3.26       0.65      9.8        5
3              3           11.2             0.280         0.56             1.9      0.075  ...                  60.0  0.99800  3.16       0.58      9.8        6
4              4            7.4             0.700         0.00             1.9      0.076  ...                  34.0  0.99780  3.51       0.56      9.4        5

scrapingしてtsvで吐き出す

requirements.txt

beautifulsoup4
pandas
lxml
html5lib
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

BASE_URL = "https://www.baystars.co.jp/players/detail/1000002"
url = BASE_URL

df_html = pd.read_html(url)
df_td = (df_html[1].head())
pd.DataFrame(df_td).to_csv("test2.tsv", sep='\t')

test2.csv

  年度  所属球団    試合  打数  安打  二塁打   三塁打   本塁打   打点  盗塁  打率
0   2010    横浜  3.0 7.0 1.0 0.0 0.0 1.0 1.0 0.0 0.14300000000000002
1   2011    横浜  40.0    145.0   35.0    10.0    0.0 8.0 22.0    1.0 0.24100000000000002
2   2012    横浜DeNA  108.0   386.0   84.0    16.0    3.0 10.0    45.0    1.0 0.218
3   2013    横浜DeNA  23.0    51.0    11.0    1.0 0.0 1.0 3.0 0.0 0.21600000000000003
4   2014    横浜DeNA  114.0   410.0   123.0   24.0    2.0 22.0    77.0    2.0 0.3

以下のような構造になっている。

f:id:hrt0kmt:20190222224840p:plain

brew経由でPostgreSQLをinstall

brew doctor
brew update
brew install postgresql

...{ellipsis}...

To migrate existing data from a previous major version of PostgreSQL run:
  brew postgresql-upgrade-database

To have launchd start postgresql now and restart at login:
  brew services start postgresql
Or, if you don't want/need a background service you can just run:
  pg_ctl -D /usr/local/var/postgres start
==> Summary
🍺  /usr/local/Cellar/postgresql/11.1_1: 3,548 files, 40.3MB

postgresql.conf のlistenはlocalのみ可能にする。

#listen_addresses = 'localhost'    # what IP address(es) to listen on;
listen_addresses = 'localhost'

brew services start postgresql を試して起動していそうに思えるも、プロセスおらず。

brew services start postgresql
==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)

pg_ctl -D /usr/local/var/postgres start を試すも、database fileが不完全でinitial時のPostgreSQL version差異があるとのエラーが出る。

pg_ctl -D /usr/local/var/postgres start
waiting for server to start....2019-02-22 22:54:14.872 JST [4307] FATAL:  database files are incompatible with server
2019-02-22 22:54:14.872 JST [4307] DETAIL:  The data directory was initialized by PostgreSQL version 10, which is not compatible with this version 11.1.
 stopped waiting
pg_ctl: could not start server
Examine the log output.

古いdatabaseファイルを消して initdb を行う。

rm -rf /usr/local/var/postgres && initdb /usr/local/var/postgres -E utf8
The files belonging to this database system will be owned by user "xxx".
This user must also own the server process.

The database cluster will be initialized with locale "ja_JP.UTF-8".
initdb: could not find suitable text search configuration for locale "ja_JP.UTF-8"
The default text search configuration will be set to "simple".

Data page checksums are disabled.

creating directory /usr/local/var/postgres ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /usr/local/var/postgres -l logfile start

起動

pg_ctl -D /usr/local/var/postgres start

worker process forkされて正常に動いている。 接続もOK。

ps aux | grep postgres

psql -d postgres
psql (11.1)
Type "help" for help.

PostgreSQLへ格納する

requirements.txt

beautifulsoup4
pandas
lxml
html5lib
sqlalchemy

to_sql で適当に投入。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib import request
from bs4 import BeautifulSoup
from sqlalchemy import create_engine
import pandas as pd

BASE_URL = "https://www.baystars.co.jp/players/detail/1000002"
url = BASE_URL

df_html = pd.read_html(url)
df_td = (df_html[1].head())

engine = create_engine(
    'postgresql://xxx@localhost:5432/postgres', echo=False)
df_td.to_sql('baseball', con=engine, index=False)

SELECTして格納されていることを確認。

postgres=# \d
           List of relations
 Schema |   Name   | Type  |   Owner
--------+----------+-------+-----------
 public | baseball | table | xxx
(1 row)

postgres=# select * from baseball limit 20;
 年度 | 所属球団 | 試合 | 打数 | 安打 | 二塁打 | 三塁打 | 本塁打 | 打点 | 盗塁 | 打率
------+----------+------+------+------+--------+--------+--------+------+------+-------
 2010 | 横浜     |    3 |    7 |    1 |      0 |      0 |      1 |    1 |    0 | 0.143
 2011 | 横浜     |   40 |  145 |   35 |     10 |      0 |      8 |   22 |    1 | 0.241
 2012 | 横浜DeNA |  108 |  386 |   84 |     16 |      3 |     10 |   45 |    1 | 0.218
 2013 | 横浜DeNA |   23 |   51 |   11 |      1 |      0 |      1 |    3 |    0 | 0.216
 2014 | 横浜DeNA |  114 |  410 |  123 |     24 |      2 |     22 |   77 |    2 |   0.3
(5 rows)