# Python爬虫
# Flag
- https://github.com/topics/scraper (opens new window)
- https://github.com/topics/webscraper (opens new window)
- https://github.com/topics/web-scraper (opens new window)
- https://github.com/topics/spider (opens new window)
- https://github.com/topics/webscraper (opens new window)
- https://github.com/topics/web-scraper (opens new window)
- https://github.com/topics/crawler (opens new window)
- https://github.com/topics/webcrawler (opens new window)
- https://github.com/topics/web-crawler (opens new window)
- https://github.com/topics/automation (opens new window)
- https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China (opens new window)
- xvfb可以将屏幕的图像输出给放到虚拟内存中 (opens new window)
- https://github.com/scrapy/scrapy (opens new window)
- 批量杀死
chromedriver
进程
ps -efww|grep LOCAL=chromedriver|grep -v grep|cut -c 9-15|xargs kill -9
:: Windows
taskkill /f /im chromedriver.exe
- 关于网页referer以及破解referer反爬虫的办法 (opens new window)
- ReCaptcha验证码(常见为Google验证码) https://github.com/topics/captcha-solving (opens new window)
- https://github.com/teal33t/captcha_bypass (opens new window)
- https://github.com/NotHassan/Python-Google-Recaptcha-v2-Solver (opens new window)
- https://github.com/balanceofprobability/decaptcha (opens new window)
- https://github.com/IshanManchanda/pyreCAPTCHA (opens new window)
- https://github.com/tafalk/recaptcha-token-resolver-function (opens new window)
- 使用HTTP拦截器来实现cloudflare请求 https://github.com/sayem314/hooman (opens new window)
- Hashcat破解队列系统 https://github.com/f0cker/crackq (opens new window)
- 电话号码的高级信息收集 https://github.com/sundowndev/PhoneInfoga (opens new window)
# 开源脚本
- 微信公众号爬取研究 (opens new window)
- 微信公众号爬虫 (opens new window)
- https://github.com/wnma3mz/wechat_articles_spider (opens new window)
- WeChat Hook (opens new window)
- https://github.com/redtips/wechathook (opens new window)
- https://github.com/wwg88888888/WeChatExt (opens new window)
- https://github.com/TonyChen56/WeChatRobot (opens new window)
- https://github.com/KongKong20/WeChatPCHook (opens new window)
- WeChatDownload,批量微信公众号文章下载小工具 (opens new window)
- https://github.com/LeLe86/vWeChatCrawl (opens new window)
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=Mzg3NjE1MTczMQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzU5ODUwNzY1Nw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzU4NzU0MDIzOQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzA4Mjc1NjE0Mw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzIxMDAwMDAxMw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzAxODcyNjEzNQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI1NTI3MzEwMg==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzU4ODI1MjA3NQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzIyMzgyODkxMQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzAxNjk4ODE4OQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzIzNzYxNDYzNw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI3NzE0NjcwMg==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=Mzg2OTA0Njk0OA==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzA3MjMwMzg2Nw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzkwMDE1MzkwNQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzkzODE3OTI0Ng==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI4MTIxNDcxOQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI3ODcxMzQzMw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzA3ODQ0Mzg2OA==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzIwMTY0NDU3Nw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI4Njc5NjM1NQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzU1NTkwODE4Mw==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzIwOTE2MzU4NA==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI0NzEyODIyOA==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzUzMzQ2MDIyMA==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=Mzg3NjIxMjA1Ng==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI4NjExMDA4NQ==
- https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MzI0OTQ4NzY5NA==
签到
京东
- 京东茅台抢购 https://github.com/search?q=jd_maotai (opens new window)
- https://github.com/search?q=茅台 (opens new window)
- 京东签到 https://github.com/ruicky/jd_sign_bot (opens new window)
- https://github.com/ZainCheung/helper-618 (opens new window)
- https://github.com/tychxn/jd-assistant (opens new window)
- https://github.com/xuess/puppeteer-sign (opens new window)
- https://github.com/zqjzqj/mtSecKill (opens new window)
- https://github.com/Yx1aoq1/jdms (opens new window)
- https://github.com/shylocks/Loon (opens new window)
- https://github.com/LXK9301/jd_scripts/tree/master (opens new window)
- https://github.com/NobyDa/Script/tree/master (opens new window)
- https://github.com/yangtingxiao/QuantumultX (opens new window)
- https://github.com/chavyleung/scripts (opens new window)
联通
- https://github.com/2368693074/39shouting (opens new window)
- https://github.com/srcrs/UnicomTask (opens new window)
- https://github.com/mixool/HiCnUnicom (opens new window)
- https://github.com/hzys/HiCnUnicom (opens new window)
- https://github.com/zhangatle/XY-UNICOM (opens new window)
- https://github.com/ChuJian2/AutoSignMachine (opens new window)
# HeadlessBrowser
Headless Browser
(无头的浏览器)是没有图形用户界面(GUI)的web浏览器,通常是通过编程或命令行界面来控制的
- https://w3c.github.io/webdriver (opens new window)
- Headless Browser (opens new window)
- https://github.com/mozilla/geckodriver (opens new window)
- https://github.com/topics/headless-browser (opens new window)
反爬虫
- https://github.com/intoli/intoli-article-materials (opens new window)
- 无头浏览器检测 (opens new window)
- 隐藏Headles-Chrome不被检测出来 (opens new window)
# chromedriver
Selenium
操作Chrome
浏览器需要有ChromeDriver
驱动来协助,ChromeDriver
与Chrome
版本对应关系一定要正确
- http://chromedriver.storage.googleapis.com/index.html (opens new window)
- http://npm.taobao.org/mirrors/chromedriver (opens new window)
- https://npm.taobao.org/mirrors/chromium-browser-snapshots (opens new window)
- headless-chrome官方文档 (opens new window)
- 功能和ChromeOptions官方网站参考 (opens new window)
参数列表
- https://peter.sh/experiments/chromium-command-line-switches/ (opens new window)
- https://cs.chromium.org/chromium/src/content/public/common/content_switches.cc (opens new window)
参数 | 说明 |
---|---|
-blink-settings=imagesEnabled=false | 不加载图片, 此方式只针对单个标签页 |
-bookmark-menu | 在工具 栏增加一个书签按钮 |
-default-browser-check | 不检查默认浏览器 |
-disable-extensions | 禁用扩展 |
-disable-gpu | 关闭gpu,服务器一般没有显卡 |
-disable-images | 禁用图像,建议使用"profile.managed_default_content_settings.images":2 |
-disable-java | 禁用java |
-disable-javascript | 禁用Javascript |
-disable-plugins | 禁止加载所有插件。可以通过about:plugins页面查看效果 |
-disable-popup-blocking | 禁用弹出拦截 |
-disable-software-rasterizer | 禁用浏览器应用 |
-disk-cache-dir="[PATH]" | 指定缓存Cache路径 |
-disk-cache-size= | 指定Cache大小,单位Byte |
-enable-sync | 启用书签同步 |
-enable-udd-profiles | 启用账户切换菜单 |
-first run | 重置到初始状态,第一次运行 |
-headless | 不开启图像界面 |
-hide-scrollbars | 隐藏滚动条, 应对一些特殊页面 |
-ignore-certificate-errors | 忽略证书错误 |
-incognito | 隐身模式启动 |
-in-process-plugins | 插件不以独立的进程运行,插件的异常崩溃,可能会导致整个页面挂掉 |
-lang=zh-CN | 设置语言为简体中文 |
-media-cache-size | 自定义多媒体缓存最大值(单位byte) |
-no-first-run | 第一次不运行 |
-no-sandbox | 不开启沙盒模式可以减少对服务器的资源消耗,但是服务器安全性降低 |
-omnibox-popup-count="num" | 将地址栏弹出的提示菜单数量改为num个。我都改为15个了。 |
-process-per-site | 每个站点使用单独进程 |
-process-per-tab | 每个标签使用单独进程 |
-proxy-pac-url | 指定使用PAC代理时,所需要的脚本url地址 |
-remote-debugging-address | 远程调试地址 0.0.0.0 可以外网调用但是安全性低,建议使用默认值 127.0.0.1 |
-remote-debugging-port | chrome-debug工具的端口(golang chromepd 默认端口是9222,建议不要修改) |
-single-process | 浏览器只能以单进程运行,通常用于调试,定位bug |
-start-maximized | 浏览器启动后,窗口默认为最大化 |
-user-agent="" | 修改HTTP请求头部的Agent字符串,可以通过about:version页面查看修改效果 |
-user-data-dir="[PATH]" | 指定用户文件夹User Data路径。 |
-window-size="1600x900" | 窗口尺寸 |
# selenium
- https://github.com/topics/selenium (opens new window)
- https://github.com/topics/testing (opens new window)
- https://github.com/SeleniumHQ (opens new window)
- https://github.com/seleniumbase (opens new window)
- Python3-Selenium开启自动化测试 (opens new window)
- python+selenium 判断元素是否存在,是否可点击,是否被选中 (opens new window)
- EC:expected_conditions判断页面元素 (opens new window)
- selenium,Python3滚动到页面底部的几种解决方案 (opens new window)
- Python Selenium教程 - 猿人学Python (opens new window)
# 函数或变量
函数或变量 | 说明 |
---|---|
def file_detector_context(self, file_detector_class, *args, **kwargs): | |
def mobile(self): | |
def name(self): | |
def start_client(self): | |
def stop_client(self): | |
def start_session(self, capabilities, browser_profile=None): | |
def create_web_element(self, element_id): | |
def execute(self, driver_command, params=None): | |
def get(self, url): | |
def title(self): | |
def find_element_by_id(self, id_): | |
def find_elements_by_id(self, id_): | |
def find_element_by_xpath(self, xpath): | |
def find_elements_by_xpath(self, xpath): | |
def find_element_by_link_text(self, link_text): | |
def find_elements_by_link_text(self, text): | |
def find_element_by_partial_link_text(self, link_text): | |
def find_elements_by_partial_link_text(self, link_text): | |
def find_element_by_name(self, name): | |
def find_elements_by_name(self, name): | |
def find_element_by_tag_name(self, name): | |
def find_elements_by_tag_name(self, name): | |
def find_element_by_class_name(self, name): | |
def find_elements_by_class_name(self, name): | |
def find_element_by_css_selector(self, css_selector): | |
def find_elements_by_css_selector(self, css_selector): | |
def execute_script(self, script, *args): | |
def execute_async_script(self, script, *args): | |
def current_url(self): | |
def page_source(self): | |
def close(self): | |
def quit(self): | |
def current_window_handle(self): | |
def window_handles(self): | |
def maximize_window(self): | |
def fullscreen_window(self): | |
def minimize_window(self): | |
def switch_to(self): | |
def switch_to_active_element(self): | |
def switch_to_window(self, window_name): | |
def switch_to_frame(self, frame_reference): | |
def switch_to_default_content(self): | |
def switch_to_alert(self): | |
def back(self): | |
def forward(self): | |
def refresh(self): | |
def get_cookies(self): | |
def get_cookie(self, name): | |
def delete_cookie(self, name): | |
def delete_all_cookies(self): | |
def add_cookie(self, cookie_dict): | |
def implicitly_wait(self, time_to_wait): | |
def set_script_timeout(self, time_to_wait): | |
def set_page_load_timeout(self, time_to_wait): | |
def find_element(self, by=By.ID, value=None): | |
def find_elements(self, by=By.ID, value=None): | |
def desired_capabilities(self): | |
def get_screenshot_as_file(self, filename): | |
def save_screenshot(self, filename): | |
def get_screenshot_as_png(self): | |
def get_screenshot_as_base64(self): | |
def set_window_size(self, width, height, windowHandle='current'): | |
def get_window_size(self, windowHandle='current'): | |
def set_window_position(self, x, y, windowHandle='current'): | |
def get_window_position(self, windowHandle='current'): | |
def get_window_rect(self): | |
def set_window_rect(self, x=None, y=None, width=None, height=None): | |
def file_detector(self): | |
def file_detector(self, detector): | |
def orientation(self): | |
def orientation(self, value): | |
def application_cache(self): | |
def log_types(self): | |
def get_log(self, log_type): |
# 浏览器下载设置
- https://github.com/SeleniumHQ/selenium/issues/5722 (opens new window)
- https://github.com/SeleniumHQ/selenium/issues/5159 (opens new window)
# 向Selenium Webwdriver添加对Chrome "send_command"的支持
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
# allow自动、deny禁止、default默认
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'deny', 'downloadPath': "D:\\"}}
driver.execute("send_command", params)
driver.execute_cdp_cmd("Page.setDownloadBehavior", {'behavior': 'deny', 'downloadPath': "D:\\"})
# 打开新标签页
# 获取主窗口句柄
main_window = driver.current_window_handle
# 通过执行js打开新标签页并访问url
driver.execute_script(f"window.open('{url}')")
# 在新选项卡中打开空白页面
#driver.execute_script(f"window.open('','_blank')")
# 获取当前所有窗口句柄(窗口A、B),并切换到新标签页
driver.switch_to.window(driver.window_handles[-1])
# 访问url
#driver.get(url)
# 关闭当前窗口。
driver.close()
# 关闭新选项卡后回到主窗口,必须做这一步,否则会引发错误
driver.switch_to.window(main_window)
使用组合键
该方式在Chrome下无效
在增加了设置下载路径代码后,无法打开新标签页,但是捕获到的handler是两个,可以进行切换,只是没有切换动态效果了,实际是切换了的
OSX
操作系统通过组合键COMMAND + T或COMMAND + W来实现选项卡的打开/关闭
在其他操作系统上,可以使用CONTROL + T / CONTROL + W
# windows 用Keys.CONTROL 如同ctrl+t打开新标签页
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 't')
# <CTRL> + <T>通过Action链发送
# ActionChains(driver).key_down(Keys.CONTROL).send_keys("t").key_up(Keys.CONTROL).perform()
# 获取当前所有窗口句柄(窗口A、B),并切换到新标签页
driver.switch_to.window(driver.window_handles[-1])
# 访问url
driver.get(url)
# windows 用Keys.CONTROL 如同ctrl+w关闭标签页
#driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 'w')
# 关闭当前窗口。
driver.close()
# 执行JavaScript
# 通过 js 移动到最下
driver.execute_script( "var q=document.documentElement.scrollTop=10000" )
# 通过 js 返回所有html
driver.execute_script( "return document.documentElement.outerHTML" )
# m3u8解析下载解密合并
M3U8有两层:第一层存放的是流信息(EXT-X-STREAM-INF)和第二层的下载链接,第二层才是存放加密(EXT-X-KEY)和
ts
文件的下载地址