起因

之前我一直在追一部作品「助けた美少女JKが可哀そうすぎて同棲を始めるしかなかった」

但是由于这部作品描写内容过于真实，并且有很多残忍的描写，但是由于作者的疏忽，并没有加上相关Tag

导致这部作品的评论区两极分化严重，支持的非常支持，喷作者的说的话也很过分

作者于作品完结后一个月的2020年9月28日这一天，更新了一章标题为《この作品について》的文章

大概意思是说，这篇文章本来是要参赛的，并且已经过了中期，但是由于后半段内容的原因，作者自己决定放弃参加

この作品は第2回ファミ通文庫大賞へ参加しており、中間選考を突破しておりました。最終審査の結果が発表される間際でございますが、この度、第2回ファミ通文庫大賞の参加を辞退することに決めました。

并且这个作品将在カクヨム上只保留到10月31日，之后将完全删除

また、辞退が確定した場合、本作品の公開を10月いっぱいまでとし、10月31日をもちまして、本作のすべてのデータを完全消去することを決断いたしました。バックアップは取りません。

我个人是非常喜欢这部作品的，甚至之前已经找作者授权进行翻译转载了（预计是本博客上单独一篇文章介绍这部作品，但是截止到今天(2020-9-29)我还没写完，暂时就先不放链接了（隐藏状态尚未公开））。

原作者要删除的话，是非常可惜的，但是我觉得原作者也会有自己的原因吧。。这里不做过多讨论

但是在我将来想重新阅读的时候，没了可就不好了

因此有了这篇文章

代码部分

这次的爬虫依然是使用python去编写

2024-07-04更新

更新了最新的class
添加了优化后的代码版本

原版

说明：由于我喜欢使用类似于ipynb的方式执行python代码，因此代码看似比较乱，但是可以根据每个模块的输出来判断是否有问题。

# %%
import urllib.request  # 用去获取网站链接请求
from bs4 import BeautifulSoup  # 用于读取网站内容
# %%
url = "https://kakuyomu.jp/works/16818023211981083951"  # 幼馴染は、にゃあと鳴いてスカートのなか。

f = urllib.request.urlopen(url)
html = f.read().decode('utf-8')
# %%
f = open("SecondPage.html",'r+',encoding="utf-8")
f.write(html)
# %%
soup = BeautifulSoup(html, "html.parser")
# %%
linkPart = soup.findAll(name='a',attrs={"class","WorkTocSection_link__ocg9K"})
# %%
linkList = []
for item in linkPart:
    linkList.append("https://kakuyomu.jp"+item.get('href'))
# %%
linkList

# %%
urlChild = linkList[0]
childSoup = BeautifulSoup(urllib.request.urlopen(urlChild).read().decode('utf-8'),"html.parser")
MainBody = childSoup.find('div',attrs={'class','widget-episodeBody js-episode-body'})
# %%
# -*-coding:utf8-*-
# encoding:utf-8
# %%
br = MainBody.findAll('p')[1].get_text()
# %%
# d2cTable = ['一','二','三','四','五','六','七','八','九','十','十一','十二','十三','十四','十五','十六','十七','十八','十九','二十','二十一','二十','二十']
title = soup.find('h1',attrs={'class','Heading_heading__lQ85n Heading_left__RVp4h Heading_size-2l__rAFn3'}).a.string + '.txt' # 文件名/小说标题
# title = soup.find('h1',id="workTitle").a.string + '.txt' # 文件名/小说标题
print('写入文件')
txtFile = open(title,'a', encoding='utf-8') # 追加方式打开
count = 0
loadLine = 1
totalNumber = str(len(linkList))
for i in range(0, len(linkList)):
# for childLinkItem in linkList:
    childLinkItem = linkList[i]
    print("正在写入链接({}/{})".format(str(i+1),totalNumber)+childLinkItem)
    loadLine += 1
    childSoup = BeautifulSoup(urllib.request.urlopen(childLinkItem).read().decode('utf-8'),"html.parser")
    ZhangTitle = childSoup.find('p',attrs={"class","chapterTitle level1 js-vertical-composition-item"})
    ZhangTitle = ZhangTitle if(None == ZhangTitle) else ZhangTitle.span.string
    if(None != ZhangTitle): 
        print('正在写入章：'+ZhangTitle)
        txtFile.write('\n\n# '+ ZhangTitle+'\n\n')
    JieTitle =  childSoup.find('p',attrs={"class","chapterTitle level2 js-vertical-composition-item"})
    JieTitle = JieTitle if(None == JieTitle) else JieTitle.span.string
    if(None != JieTitle): 
        print('正在写入大节：'+JieTitle)
        txtFile.write('\n\n## '+ JieTitle+'\n\n')
        count = 0
    else:
        count += 1
    # charpTitle = "\n\n第"+ str(count) + "节  "+ childSoup.find('p',attrs={"class","widget-episodeTitle js-vertical-composition-item"}).string
    charpTitle = childSoup.find('p',attrs={"class","widget-episodeTitle js-vertical-composition-item"}).string
    txtFile.write('\n\n\n### '+ charpTitle + "\n\n")
    MainBody = childSoup.find('div',attrs={'class','widget-episodeBody js-episode-body'})
    PTags = MainBody.findAll('p')
    for childPTag in PTags:
        text = childPTag.get_text()
        txtFile.write('\n'+ text if(br != text) else '\n')
    
    print("链接"+childLinkItem+"写入完成")
print("小说《"+title+"》已全部写入完成")
# %%

# %%

优化后的版本

说明：优化版本的代码更加符合python的书写习惯，增加了错误处理和函数封装，并提升了代码的可读性和效率。

import urllib.request  # 用于获取网站链接请求
from bs4 import BeautifulSoup  # 用于读取网站内容

# 函数：获取网页内容
def fetch_url(url):
    try:
        response = urllib.request.urlopen(url)
        return response.read().decode('utf-8')
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

# 函数：保存HTML到文件
def save_html_to_file(html, filename):
    with open(filename, 'w', encoding="utf-8") as f:
        f.write(html)

# 函数：解析章节链接
def parse_chapter_links(soup):
    links = []
    link_elements = soup.findAll(name='a', attrs={"class": "WorkTocSection_link__ocg9K"})
    for item in link_elements:
        links.append("https://kakuyomu.jp" + item.get('href'))
    return links

# 函数：解析章节内容并写入文件
def parse_and_write_chapter(link, file, is_first_chapter):
    html = fetch_url(link)
    if not html:
        return
    soup = BeautifulSoup(html, "html.parser")
    
    # 章节和节标题
    zhang_title = soup.find('p', attrs={"class": "chapterTitle level1 js-vertical-composition-item"})
    zhang_title = zhang_title.span.string if zhang_title else None
    jie_title = soup.find('p', attrs={"class": "chapterTitle level2 js-vertical-composition-item"})
    jie_title = jie_title.span.string if jie_title else None

    # 写入章标题
    if zhang_title:
        file.write('\n\n# ' + zhang_title + '\n\n')
    # 写入节标题
    if jie_title:
        file.write('\n\n## ' + jie_title + '\n\n')

    # 写入章节标题
    charp_title = soup.find('p', attrs={"class": "widget-episodeTitle js-vertical-composition-item"}).string
    file.write('\n\n\n### ' + charp_title + "\n\n")

    # 写入正文
    main_body = soup.find('div', attrs={'class': 'widget-episodeBody js-episode-body'})
    p_tags = main_body.findAll('p')
    for p_tag in p_tags:
        text = p_tag.get_text()
        file.write('\n' + text)

# 主函数
def main():
    url = "https://kakuyomu.jp/works/16818023211981083951"
    html = fetch_url(url)
    if not html:
        return

    save_html_to_file(html, "SecondPage.html")

    soup = BeautifulSoup(html, "html.parser")
    chapter_links = parse_chapter_links(soup)

    title = soup.find('h1', attrs={'class': 'Heading_heading__lQ85n Heading_left__RVp4h Heading_size-2l__rAFn3'}).a.string + '.txt'
    with open(title, 'w', encoding='utf-8') as txt_file:
        total_number = len(chapter_links)
        for i, chapter_link in enumerate(chapter_links):
            print(f"正在写入链接({i + 1}/{total_number}): {chapter_link}")
            parse_and_write_chapter(chapter_link, txt_file, i == 0)
            print(f"链接 {chapter_link} 写入完成")

    print(f"小说《{title}》已全部写入完成")

if __name__ == "__main__":
    main()

代码中url可以用待爬取小说的介绍页面进行替换

这里以这部作品链接为例子

1	url = "https://kakuyomu.jp/works/1177354054894884567"

声明

我尊重かんなづき的决定，因此不会提供任何的爬取内容

爬取的文件我也仅仅会自用，不会进行任何互联网传播。

关于我的那篇翻译介绍文章，暂时决定不放出，待和原作者商讨以后再做决定。

这里留下官网链接：https://kakuyomu.jp/works/1177354054894884567

如果你感兴趣的话，在原作者没有删除之前去支持一下吧。

关于如何优雅地获取カクヨム小说全文

起因

代码部分

原版

优化后的版本

声明