Python可复用函数的六种最佳实践执行单一的用函任务

2024-06-30 15:27:40 [百科] 来源：避面尹邢网

Python可复用函数的可复六种最佳实践

作者：云朵君 2023-08-26 20:51:25开发前端在编写Python函数时，你不需要记住所有这些最佳实践。用函衡量一个Python函数质量的种最一个很好的指标是它的可测试性。如果一个函数可以很容易地被测试，佳实践这表明该函数是可复模块化的，执行单一的用函任务，并且没有重复的种最代码。

对于在一个有各种角色的佳实践团队中工作的数据科学家来说，编写干净的可复代码是一项必备的技能，因为：

清晰的用函代码增强了可读性，使团队成员更容易理解和贡献于代码库。种最
清晰的佳实践代码提高了可维护性，简化了调试、可复修改和扩展现有代码等任务。用函

为了实现可维护性，种最我们的Python函数应该：

Python可复用函数的六种最佳实践执行单一的用函任务

小型
只做一项任务
没有重复
有一个层次的抽象性
有一个描述性的名字
有少于四个参数

我们先来看看下面的 get_data 函数。

Python可复用函数的六种最佳实践执行单一的用函任务

import xml.etree.ElementTree as ETimport zipfilefrom pathlib import Pathimport gdowndef get_data(    url: str,    zip_path: str,    raw_train_path: str,    raw_test_path: str,    processed_train_path: str,    processed_test_path: str,):    # Download data from Google Drive    zip_path = "Twitter.zip"    gdown.download(url, zip_path, quiet=False)    # Unzip data    with zipfile.ZipFile(zip_path, "r") as zip_ref:        zip_ref.extractall(".")    # Extract texts from files in the train directory    t_train = []    for file_path in Path(raw_train_path).glob("*.xml"):        list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]        train_doc_1 = " ".join(t for t in list_train_doc_1)        t_train.append(train_doc_1)    t_train_docs = " ".join(t_train)    # Extract texts from files in the test directory    t_test = []    for file_path in Path(raw_test_path).glob("*.xml"):        list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]        test_doc_1 = " ".join(t for t in list_test_doc_1)        t_test.append(test_doc_1)    t_test_docs = " ".join(t_test)    # Write processed data to a train file    with open(processed_train_path, "w") as f:        f.write(t_train_docs)    # Write processed data to a test file    with open(processed_test_path, "w") as f:        f.write(t_test_docs)if __name__ == "__main__":    get_data(        url="https://drive.google.com/uc?id=1jI1cmxqnwsmC-vbl8dNY6b4aNBtBbKy3",        zip_path="Twitter.zip",        raw_train_path="Data/train/en",        raw_test_path="Data/test/en",        processed_train_path="Data/train/en.txt",        processed_test_path="Data/test/en.txt",    )

尽管在这个函数中有许多注释，但很难理解这个函数的作用，因为：

Python可复用函数的六种最佳实践执行单一的用函任务

该函数很长。
该函数试图完成多项任务。
函数内的代码处于不同的抽象层次。
该函数有许多参数。
有多个代码重复。
该函数缺少一个描述性的名称。

我们将通过使用文章开头提到的六种做法来重构这段代码。

小型

一个函数应该保持很小，以提高其可读性。理想情况下，一个函数的代码不应超过20行。此外，一个函数的缩进程度不应超过1或2。

import zipfileimport gdowndef get_raw_data(url: str, zip_path: str) -> None:    gdown.download(url, zip_path, quiet=False)    with zipfile.ZipFile(zip_path, "r") as zip_ref:        zip_ref.extractall(".")

只做一个任务

函数应该有一个单一的重点，并执行单一的任务。函数get_data试图完成多项任务，包括从Google Drive检索数据，执行文本提取，并保存提取的文本。

因此，这个函数应该被分成几个小的函数，如下图所示：

def main(    url: str,    zip_path: str,    raw_train_path: str,    raw_test_path: str,    processed_train_path: str,    processed_test_path: str,) -> None:    get_raw_data(url, zip_path)    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

这些功能中的每一个都应该有一个单一的目的：

def get_raw_data(url: str, zip_path: str) -> None:    gdown.download(url, zip_path, quiet=False)    with zipfile.ZipFile(zip_path, "r") as zip_ref:        zip_ref.extractall(".")

函数get_raw_data只执行一个动作，那就是获取原始数据。

重复性

我们应该避免重复，因为：

重复的代码削弱了代码的可读性。
重复的代码使代码修改更加复杂。如果需要修改，需要在多个地方进行修改，增加了出错的可能性。

下面的代码包含重复的内容，用于检索训练和测试数据的代码几乎是相同的。

from pathlib import Path   # 从train目录下的文件中提取文本t_train = []for file_path in Path(raw_train_path).glob("*.xml"):    list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]    train_doc_1 = " ".join(t for t in list_train_doc_1)    t_train.append(train_doc_1)t_train_docs = " ".join(t_train)# 从测试目录的文件中提取文本t_test = []for file_path in Path(raw_test_path).glob("*.xml"):    list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]    test_doc_1 = " ".join(t for t in list_test_doc_1)    t_test.append(test_doc_1)t_test_docs = " ".join(t_test)

我们可以通过将重复的代码合并到一个名为extract_texts_from_multiple_files的单一函数中来消除重复，该函数从指定位置的多个文件中提取文本。

def extract_texts_from_multiple_files(folder_path) -> str:

all_docs = []for file_path in Path(folder_path).glob("*.xml"):    list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]    text_in_one_file = " ".join(list_of_text_in_one_file)    all_docs.append(text_in_one_file)return " ".join(all_docs)

现在你可以使用这个功能从不同的地方提取文本，而不需要重复编码。

t_train = extract_texts_from_multiple_files(raw_train_path)t_test  = extract_texts_from_multiple_files(raw_test_path)

一个层次的抽象

抽象水平是指一个系统的复杂程度。高层次指的是对系统更概括的看法，而低层次指的是系统更具体的方面。

在一个代码段内保持相同的抽象水平是一个很好的做法，使代码更容易理解。

以下函数证明了这一点：

def extract_texts_from_multiple_files(folder_path) -> str:    all_docs = []    for file_path in Path(folder_path).glob("*.xml"):        list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]        text_in_one_file = " ".join(list_of_text_in_one_file)        all_docs.append(text_in_one_file)    return " ".join(all_docs)

该函数本身处于较高层次，但 for 循环内的代码涉及与XML解析、文本提取和字符串操作有关的较低层次的操作。

为了解决这种抽象层次的混合，我们可以将低层次的操作封装在extract_texts_from_each_file函数中：

def extract_texts_from_multiple_files(folder_path: str) -> str:    all_docs = []    for file_path in Path(folder_path).glob("*.xml"):        text_in_one_file = extract_texts_from_each_file(file_path)        all_docs.append(text_in_one_file)    return " ".join(all_docs)    def extract_texts_from_each_file(file_path: str) -> str:    list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]    return " ".join(list_of_text_in_one_file)

这为文本提取过程引入了更高层次的抽象，使代码更具可读性。

描述性的名称

一个函数的名字应该有足够的描述性，使用户不用阅读代码就能理解其目的。长一点的、描述性的名字比模糊的名字要好。例如，命名一个函数get_texts就不如命名为extract_texts_from_multiple_files来得清楚。

然而，如果一个函数的名字变得太长，比如retrieve_data_extract_text_and_save_data，这说明这个函数可能做了太多的事情，应该拆分成更小的函数。

少于四个参数

随着函数参数数量的增加，跟踪众多参数之间的顺序、目的和关系变得更加复杂。这使得开发人员难以理解和使用该函数。

def main(    url: str,    zip_path: str,    raw_train_path: str,    raw_test_path: str,    processed_train_path: str,    processed_test_path: str,) -> None:    get_raw_data(url, zip_path)    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

为了提高代码的可读性，你可以用数据类或Pydantic模型将多个相关参数封装在一个数据结构中。

from pydantic import BaseModelclass RawLocation(BaseModel):    url: str    zip_path: str    path_train: str    path_test: strclass ProcessedLocation(BaseModel):    path_train: str    path_test: strdef main(raw_location: RawLocation, processed_location: ProcessedLocation) -> None:    get_raw_data(raw_location)    t_train, t_test = get_train_test_docs(raw_location)    save_train_test_docs(processed_location, t_train, t_test)

我如何写这样的函数？

在编写Python函数时，你不需要记住所有这些最佳实践。衡量一个Python函数质量的一个很好的指标是它的可测试性。如果一个函数可以很容易地被测试，这表明该函数是模块化的，执行单一的任务，并且没有重复的代码。

def save_data(processed_path: str, processed_data: str) -> None:    with open(processed_path, "w") as f:        f.write(processed_data)def test_save_data(tmp_path):    processed_path = tmp_path / "processed_data.txt"    processed_data = "Sample processed data"    save_data(processed_path, processed_data)    assert processed_path.exists()    assert processed_path.read_text() == processed_data

参考文献Martin, R. C. (2009).Clean code：A handbook of agile software craftsmanship.Upper Saddle River：Prentice Hall.

责任编辑：武晓燕来源：数据STUDIO Python函数代码

(责任编辑：娱乐)