unbeknowndict

i have an unbeknown dictionary. everything is up to me.

pandasを触ってみる (web scrapingしてdataframeへセット、tsv出力)

environment

  • pandas 0.24.1
  • lxml 4.3.1
  • html5lib 1.0.1
  • beautifulsoup4 4.7.1

pandasとは

  • Python用データ分析library。
  • data frame(2次元table)形式を扱える。
  • 公式サイト

csvをtsvへ変換

test.csvを準備。

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6

dataframeにセット。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

cdf = pd.read_csv("test.csv", encoding='utf-8', header=0)
pd.DataFrame(cdf).to_csv("test.tsv", sep='\t')

tdf = pd.read_csv("test.tsv", encoding='utf-8', sep='\t', header=0)

tdfの中身。

      Unnamed: 0  fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  ...  total sulfur dioxide  density    pH  sulphates  alcohol  quality
0              0            7.4             0.700         0.00             1.9      0.076  ...                  34.0  0.99780  3.51       0.56      9.4        5
1              1            7.8             0.880         0.00             2.6      0.098  ...                  67.0  0.99680  3.20       0.68      9.8        5
2              2            7.8             0.760         0.04             2.3      0.092  ...                  54.0  0.99700  3.26       0.65      9.8        5
3              3           11.2             0.280         0.56             1.9      0.075  ...                  60.0  0.99800  3.16       0.58      9.8        6
4              4            7.4             0.700         0.00             1.9      0.076  ...                  34.0  0.99780  3.51       0.56      9.4        5

scrapingしてtsvで吐き出す

requirements.txt

beautifulsoup4
pandas
lxml
html5lib
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

BASE_URL = "https://www.baystars.co.jp/players/detail/1000002"
url = BASE_URL

df_html = pd.read_html(url)
df_td = (df_html[1].head())
pd.DataFrame(df_td).to_csv("test2.tsv", sep='\t')

test2.csv

 年度  所属球団    試合  打数  安打  二塁打   三塁打   本塁打   打点  盗塁  打率
0   2010    横浜  3.0 7.0 1.0 0.0 0.0 1.0 1.0 0.0 0.14300000000000002
1   2011    横浜  40.0    145.0   35.0    10.0    0.0 8.0 22.0    1.0 0.24100000000000002
2   2012    横浜DeNA  108.0   386.0   84.0    16.0    3.0 10.0    45.0    1.0 0.218
3   2013    横浜DeNA  23.0    51.0    11.0    1.0 0.0 1.0 3.0 0.0 0.21600000000000003
4   2014    横浜DeNA  114.0   410.0   123.0   24.0    2.0 22.0    77.0    2.0 0.3

以下のような構造になっている。

f:id:hrt0kmt:20190222224840p:plain