使用Pandas统计2011年CSDN泄漏的密码

Published: 2018-12-27

Tags: Python Pandas

本文总阅读量

整理网盘,发现多年前收藏的CSDN当年泄漏的密码文件,时隔七年,简单的分析统计一下,《CSDN明文口令泄露的启示》,与《CSDN 及人人网的用户密码分析》 两篇文章都分析的蛮多了,可自己动手,还是很有趣的,不是吗

想了一下,用awk等脚本还是... 最后还是用python来统计,第一次使用jupyterlab环境,真心方便,强烈推荐

统计数据共有642w行数据,数据内容如下

[dong@fedora29 test]$ wc -l csdn.csv 
6428632 csdn.csv
[dong@fedora29 test]$ tail csdn.csv 
m**ler1 # 809289527 # y**inlo@myce.net.cn
e**ashe83 # she196521 # e**ashe83@hotmail.com
f**ljt # 123456789 # c**ngyong702@126.com
x**oshazouyihui # xufeng808 # x**72209151@163.com
z**nfp # zhanping # z**n_fp@163.com
x**198 # x111123789 # x**198@hotmail.com
y**888 # 19820619 # y**888@163.com
s**ei2007 # suwei2007 # love_f**weiwei@163.com
f**gchengli # 19860601 # f**gchengli@gmail.com
j**axa # 05040603 # j**axa@hotmail.com

csdn password

最受大家喜爱的前十密码,12345678912345678真是无敌,将近7%的人使用了这两个密码

其次的11111111也很受人民欢迎,但是dearbook是什么?...

再看看2011年大家最常用的邮箱

csdn email

QQ邮箱/网易邮箱用户量还是很大的,2011年的时候,还可以快乐的使用Gmail,自从2014年后...

代码:

密码统计图

# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
from numpy import genfromtxt
import pandas as pd

data = pd.read_csv("csdn.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])
password_field = data["password"].value_counts()[:10].index.values
count_field = data["password"].value_counts()[:10].tolist()

plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'


fig, ax = plt.subplots(figsize=(10, 5))
# ax.set_xlabel('Count', fontsize=12, fontweight='black', color = '#333F4B')
# ax.set_ylabel('')

ax.invert_yaxis()

# [spine.set_visible(False) for spine in ax.spines.values()]
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')
ax.spines['left'].set_smart_bounds(True)
ax.spines['bottom'].set_smart_bounds(True)

ax.spines['bottom'].set_position(('axes', 0.01))
ax.spines['left'].set_position(('axes', -0.02))

ax.tick_params(axis='both', which='major', labelsize=12)
# plt.yticks(range(len(count_field)), password_field)

thecolor = '#c1224f'
vmax = max(count_field)
for i, value in enumerate(count_field):
    ax.text(value + vmax * 0.02, i, value, fontsize=12, va='center', color=thecolor)

plt.barh(range(len(count_field)), count_field, tick_label=password_field, color=thecolor, edgecolor=thecolor)
plt.title('CSDN Password/Count', fontsize=18, color='#333F4B')
# plt.show()
plt.savefig('password-count.png', dpi=300, bbox_inches='tight')

邮箱统计图

统计邮箱跟统计密码这个,区别只在于对邮箱列处理了一下,用@符号后的邮箱覆盖原邮箱值

df = pd.read_csv("csdn.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])

def f(email):
    if "@" in email:
        return email.split("@")[1]
    else:
        return ""

df['email'] = df['email'].apply(f)

其它

# -*- coding: utf-8 -*-
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
import re 
# import matplotlib.spines.Spine as Spine

df = pd.read_csv("csdn_10000.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])

def regex_phone(x):
    phone_pat = re.compile('^(13\d|14[5|7]|15\d|166|17[3|6|7]|18\d)\d{8}$')
    res = re.search(phone_pat, x)
    if res:
        return True
    else:
        return False


# 重建index,密码为手机号
df = df[df['password'].apply(regex_phone)]
df['password'] = df['password'].apply(lambda x: x[:5] + "****" + x[9:])
print(df.reset_index(drop=True))

# 密码中包含字符串
df = df[df['password'].str.contains("csdn")]
df = df.reset_index(drop=True)
print(df)

# 密码中为纯数字
df = df[df['password'].str.contains('^\d*$', regex=True)]
df = df.reset_index(drop=True)
print(df)

# 用户名与密码相等
df = df[df['username'] == df['password']]
print(df)
print(df.count())

# 用户名与邮箱名相等
def f(email):
    if "@" in email:
        return email.split("@")[0]
    else:
        return ""

df = df[df['username'] == df['email'].apply(f)]
df = df.reset_index(drop=True)
print(df)
print(df.count())

有不少人用自己的手机号当密码,这个习惯很不好,密码泄漏的话,更重要的手机号也会被泄漏

                username     password                         email
0                    aim  13691****39                  aim@sina.com
1                 hobnob  13620****32                  52k@21cn.com
2                S_Randy  13574****50           csuslf@yahoo.com.cn
3           headwolf1982  13661****84         ydtxchli_2002@163.com
4                 lzg119  13931****39                lzg119@126.com
...
189365          feiliuli  13401****69      huanglinyan1024@sina.com
189366          gueilinx  13435****40              250835671@qq.com
189367           dahaili  13791****15          dahailee@hotmail.com

[189368 rows x 3 columns]

另外,还有29w人的用户名与密码相同... =,=

参考: