Генеративный преобразователь Python, использующий Github API

Question

Я работаю над проектом GPyT (генеративный преобразователь Python), и я написал приведенный ниже скрипт для загрузки большого количества файлов Python с github.com для обучения моей модели. Он работает нормально, но есть проблема, я думаю, потому что он становится медленнее с каждым загружаемым файлом. Для загрузки новых файлов требуется больше времени, а размер файлов Python почти такой же и небольшой.

Сначала я хотел задать свой вопрос о переполнении стека, но поскольку скрипт работает и ошибок нет, я подумал, что лучше задать его здесь. Чтобы получить совет по поводу моей работы, потому что я думаю, что она пахнет спагетти и связана с производительностью.

Часть, в которой я сомневаюсь, — это утверждение if:

if not os.path.isfile(f'{repos_path}/{file_content.name}')

Я думаю, что каждый раз проверять, существует ли файл, требуется больше времени, потому что количество файлов увеличивается. Он сверяется с растущим списком. Я также проверил использование памяти и процессора при запуске этого сценария, и нет никакой разницы, когда я впервые запустил сценарий.

Код:

# Part one >>> cloning lots of python related repositories from **GITHUB** and put them into repos directory.

from github import Github
from colorama import Fore
import wget
import os
from requests.exceptions import HTTPError

""" There are two ways to do so: 1. Clone lots of repositories and then walk through the directories and delete every
single file except python files, 2. Get a specific content file which is the chosen way here """

""" There are many ways to download the files: 1. Using wget, 2. Implementing your own dl.py file to 
download files using requests library or even making a function named download """

# loading the token.txt file
# TODO: token
access_token = open('', 'r').read()
github = Github(access_token)

query = 'language:python'
res = github.search_repositories(query)

# print(res.totalCount)
# print(dir(res))

# a directory of python files
repos_path=""  # replace with your desired directory
for repo in res:
    contents = repo.get_contents('')
    while contents:
        file_content = contents.pop(0)
        if file_content.type == 'dir':
            contents.extend(repo.get_contents(file_content.path))
        else:
            # print(f'{Fore.GREEN} + URL: {file_content.download_url}')
            if not file_content.path.endswith('.py'):
                continue
            contents_url = file_content.download_url

            if not os.path.isfile(f'{repos_path}/{file_content.name}'):
                try:
                    wget.download(contents_url, out=repos_path)
                except HTTPError as http_err:
                    print(f'{Fore.RED}Error occurred: {http_err}')
                    continue
                except Exception as err:
                    print(f'{Fore.RED}Error occurred: {err}')
                else:
                    print(f'{Fore.GREEN} + {Fore.WHITE} "{file_content.name}" {Fore.GREEN} Downloaded')
            else:
                print(f'{Fore.YELLOW} - {Fore.WHITE} "{file_content.name}" {Fore.YELLOW} Exists')

# TODO: add tqdm to show the progress bar which is very helpful in case you want to know where you at

0

Похожие записи:

Добавить комментарий Отменить ответ