整理网盘,发现多年前收藏的CSDN当年泄漏的密码文件,时隔七年,简单的分析统计一下,《CSDN明文口令泄露的启示》,与《CSDN 及人人网的用户密码分析》 两篇文章都分析的蛮多了,可自己动手,还是很有趣的,不是吗
想了一下,用awk等脚本还是... 最后还是用python来统计,第一次使用jupyterlab
环境,真心方便,强烈推荐
统计数据共有642w行数据,数据内容如下
[dong@fedora29 test]$ wc -l csdn.csv
6428632 csdn.csv
[dong@fedora29 test]$ tail csdn.csv
m**ler1 # 809289527 # y**inlo@myce.net.cn
e**ashe83 # she196521 # e**ashe83@hotmail.com
f**ljt # 123456789 # c**ngyong702@126.com
x**oshazouyihui # xufeng808 # x**72209151@163.com
z**nfp # zhanping # z**n_fp@163.com
x**198 # x111123789 # x**198@hotmail.com
y**888 # 19820619 # y**888@163.com
s**ei2007 # suwei2007 # love_f**weiwei@163.com
f**gchengli # 19860601 # f**gchengli@gmail.com
j**axa # 05040603 # j**axa@hotmail.com
最受大家喜爱的前十密码,123456789
和12345678
真是无敌,将近7%
的人使用了这两个密码
其次的11111111
也很受人民欢迎,但是dearbook
是什么?...
再看看2011年大家最常用的邮箱
QQ邮箱/网易邮箱用户量还是很大的,2011年的时候,还可以快乐的使用Gmail,自从2014年后...
代码:
密码统计图
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
from numpy import genfromtxt
import pandas as pd
data = pd.read_csv("csdn.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])
password_field = data["password"].value_counts()[:10].index.values
count_field = data["password"].value_counts()[:10].tolist()
plt.rcParams['axes.edgecolor']='#333F4B'
plt.rcParams['axes.linewidth']=0.8
plt.rcParams['xtick.color']='#333F4B'
plt.rcParams['ytick.color']='#333F4B'
fig, ax = plt.subplots(figsize=(10, 5))
# ax.set_xlabel('Count', fontsize=12, fontweight='black', color = '#333F4B')
# ax.set_ylabel('')
ax.invert_yaxis()
# [spine.set_visible(False) for spine in ax.spines.values()]
ax.spines['top'].set_color('none')
ax.spines['right'].set_color('none')
ax.spines['left'].set_smart_bounds(True)
ax.spines['bottom'].set_smart_bounds(True)
ax.spines['bottom'].set_position(('axes', 0.01))
ax.spines['left'].set_position(('axes', -0.02))
ax.tick_params(axis='both', which='major', labelsize=12)
# plt.yticks(range(len(count_field)), password_field)
thecolor = '#c1224f'
vmax = max(count_field)
for i, value in enumerate(count_field):
ax.text(value + vmax * 0.02, i, value, fontsize=12, va='center', color=thecolor)
plt.barh(range(len(count_field)), count_field, tick_label=password_field, color=thecolor, edgecolor=thecolor)
plt.title('CSDN Password/Count', fontsize=18, color='#333F4B')
# plt.show()
plt.savefig('password-count.png', dpi=300, bbox_inches='tight')
邮箱统计图
统计邮箱跟统计密码这个,区别只在于对邮箱列处理了一下,用@符号后的邮箱覆盖原邮箱值
df = pd.read_csv("csdn.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])
def f(email):
if "@" in email:
return email.split("@")[1]
else:
return ""
df['email'] = df['email'].apply(f)
其它
# -*- coding: utf-8 -*-
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt
import re
# import matplotlib.spines.Spine as Spine
df = pd.read_csv("csdn_10000.csv", sep=" # ", engine='python', header=None, names=["username", "password", "email"])
def regex_phone(x):
phone_pat = re.compile('^(13\d|14[5|7]|15\d|166|17[3|6|7]|18\d)\d{8}$')
res = re.search(phone_pat, x)
if res:
return True
else:
return False
# 重建index,密码为手机号
df = df[df['password'].apply(regex_phone)]
df['password'] = df['password'].apply(lambda x: x[:5] + "****" + x[9:])
print(df.reset_index(drop=True))
# 密码中包含字符串
df = df[df['password'].str.contains("csdn")]
df = df.reset_index(drop=True)
print(df)
# 密码中为纯数字
df = df[df['password'].str.contains('^\d*$', regex=True)]
df = df.reset_index(drop=True)
print(df)
# 用户名与密码相等
df = df[df['username'] == df['password']]
print(df)
print(df.count())
# 用户名与邮箱名相等
def f(email):
if "@" in email:
return email.split("@")[0]
else:
return ""
df = df[df['username'] == df['email'].apply(f)]
df = df.reset_index(drop=True)
print(df)
print(df.count())
有不少人用自己的手机号当密码,这个习惯很不好,密码泄漏的话,更重要的手机号也会被泄漏
username password email
0 aim 13691****39 aim@sina.com
1 hobnob 13620****32 52k@21cn.com
2 S_Randy 13574****50 csuslf@yahoo.com.cn
3 headwolf1982 13661****84 ydtxchli_2002@163.com
4 lzg119 13931****39 lzg119@126.com
...
189365 feiliuli 13401****69 huanglinyan1024@sina.com
189366 gueilinx 13435****40 250835671@qq.com
189367 dahaili 13791****15 dahailee@hotmail.com
[189368 rows x 3 columns]
另外,还有29w人的用户名与密码相同... =,=
参考: