とてもかんたんでした。
environment
- mac OS Mojave 10.14.2
- psql (PostgreSQL) 11.1
- pg_ctl (PostgreSQL) 11.1
- PostgreSQL 11.1
- lxml 4.3.1
- html5lib 1.0.1
- beautifulsoup4 4.7.1
- SQLAlchemy 1.2.18
- pandas 0.24.1
pandasとは
- Python用データ分析library。
- data frame(2次元table)形式を扱える。
- 公式サイト
csvをtsvへ変換
test.csvを準備。
fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality 7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5 7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5 7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5 11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
dataframeにセット。
#!/usr/bin/env python # -*- coding: utf-8 -*- import pandas as pd cdf = pd.read_csv("test.csv", encoding='utf-8', header=0) pd.DataFrame(cdf).to_csv("test.tsv", sep='\t') tdf = pd.read_csv("test.tsv", encoding='utf-8', sep='\t', header=0)
tdfの中身。
Unnamed: 0 fixed acidity volatile acidity citric acid residual sugar chlorides ... total sulfur dioxide density pH sulphates alcohol quality 0 0 7.4 0.700 0.00 1.9 0.076 ... 34.0 0.99780 3.51 0.56 9.4 5 1 1 7.8 0.880 0.00 2.6 0.098 ... 67.0 0.99680 3.20 0.68 9.8 5 2 2 7.8 0.760 0.04 2.3 0.092 ... 54.0 0.99700 3.26 0.65 9.8 5 3 3 11.2 0.280 0.56 1.9 0.075 ... 60.0 0.99800 3.16 0.58 9.8 6 4 4 7.4 0.700 0.00 1.9 0.076 ... 34.0 0.99780 3.51 0.56 9.4 5
scrapingしてtsvで吐き出す
requirements.txt
beautifulsoup4 pandas lxml html5lib
#!/usr/bin/env python # -*- coding: utf-8 -*- import pandas as pd BASE_URL = "https://www.baystars.co.jp/players/detail/1000002" url = BASE_URL df_html = pd.read_html(url) df_td = (df_html[1].head()) pd.DataFrame(df_td).to_csv("test2.tsv", sep='\t')
test2.csv
年度 所属球団 試合 打数 安打 二塁打 三塁打 本塁打 打点 盗塁 打率 0 2010 横浜 3.0 7.0 1.0 0.0 0.0 1.0 1.0 0.0 0.14300000000000002 1 2011 横浜 40.0 145.0 35.0 10.0 0.0 8.0 22.0 1.0 0.24100000000000002 2 2012 横浜DeNA 108.0 386.0 84.0 16.0 3.0 10.0 45.0 1.0 0.218 3 2013 横浜DeNA 23.0 51.0 11.0 1.0 0.0 1.0 3.0 0.0 0.21600000000000003 4 2014 横浜DeNA 114.0 410.0 123.0 24.0 2.0 22.0 77.0 2.0 0.3
以下のような構造になっている。
brew経由でPostgreSQLをinstall
brew doctor brew update brew install postgresql ...{ellipsis}... To migrate existing data from a previous major version of PostgreSQL run: brew postgresql-upgrade-database To have launchd start postgresql now and restart at login: brew services start postgresql Or, if you don't want/need a background service you can just run: pg_ctl -D /usr/local/var/postgres start ==> Summary 🍺 /usr/local/Cellar/postgresql/11.1_1: 3,548 files, 40.3MB
postgresql.conf
のlistenはlocalのみ可能にする。
#listen_addresses = 'localhost' # what IP address(es) to listen on; listen_addresses = 'localhost'
brew services start postgresql
を試して起動していそうに思えるも、プロセスおらず。
brew services start postgresql ==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)
pg_ctl -D /usr/local/var/postgres start
を試すも、database fileが不完全でinitial時のPostgreSQL version差異があるとのエラーが出る。
pg_ctl -D /usr/local/var/postgres start waiting for server to start....2019-02-22 22:54:14.872 JST [4307] FATAL: database files are incompatible with server 2019-02-22 22:54:14.872 JST [4307] DETAIL: The data directory was initialized by PostgreSQL version 10, which is not compatible with this version 11.1. stopped waiting pg_ctl: could not start server Examine the log output.
古いdatabaseファイルを消して initdb
を行う。
rm -rf /usr/local/var/postgres && initdb /usr/local/var/postgres -E utf8 The files belonging to this database system will be owned by user "xxx". This user must also own the server process. The database cluster will be initialized with locale "ja_JP.UTF-8". initdb: could not find suitable text search configuration for locale "ja_JP.UTF-8" The default text search configuration will be set to "simple". Data page checksums are disabled. creating directory /usr/local/var/postgres ... ok creating subdirectories ... ok selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting dynamic shared memory implementation ... posix creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok WARNING: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: pg_ctl -D /usr/local/var/postgres -l logfile start
起動
pg_ctl -D /usr/local/var/postgres start
worker process forkされて正常に動いている。 接続もOK。
ps aux | grep postgres psql -d postgres psql (11.1) Type "help" for help.
PostgreSQLへ格納する
requirements.txt
beautifulsoup4 pandas lxml html5lib sqlalchemy
to_sql
で適当に投入。
#!/usr/bin/env python # -*- coding: utf-8 -*- from urllib import request from bs4 import BeautifulSoup from sqlalchemy import create_engine import pandas as pd BASE_URL = "https://www.baystars.co.jp/players/detail/1000002" url = BASE_URL df_html = pd.read_html(url) df_td = (df_html[1].head()) engine = create_engine( 'postgresql://xxx@localhost:5432/postgres', echo=False) df_td.to_sql('baseball', con=engine, index=False)
SELECTして格納されていることを確認。
postgres=# \d List of relations Schema | Name | Type | Owner --------+----------+-------+----------- public | baseball | table | xxx (1 row) postgres=# select * from baseball limit 20; 年度 | 所属球団 | 試合 | 打数 | 安打 | 二塁打 | 三塁打 | 本塁打 | 打点 | 盗塁 | 打率 ------+----------+------+------+------+--------+--------+--------+------+------+------- 2010 | 横浜 | 3 | 7 | 1 | 0 | 0 | 1 | 1 | 0 | 0.143 2011 | 横浜 | 40 | 145 | 35 | 10 | 0 | 8 | 22 | 1 | 0.241 2012 | 横浜DeNA | 108 | 386 | 84 | 16 | 3 | 10 | 45 | 1 | 0.218 2013 | 横浜DeNA | 23 | 51 | 11 | 1 | 0 | 1 | 3 | 0 | 0.216 2014 | 横浜DeNA | 114 | 410 | 123 | 24 | 2 | 22 | 77 | 2 | 0.3 (5 rows)