🦾 유튜브 쇼츠메이커 개발기

“썰쇼츠를 한 번에 자동으로 만드는
파이썬 기반 영상제작기”

1. 개발 동기

기존의 n8n(무료: 월 200회)나 Make(무료: 월 1,000회)와 같은 자동화 플랫폼은 제한적인 무료 사용량과 유료 구독 비용이 부담스러웠습니다.
또한 원하는 기능을 모두 활용하기에는 한계가 있어
“직접 파이썬으로 자동화 프로그램을 개발”하게 되었습니다.

2. 개발에 필요한 도구

– 파이썬 3.10 이상
– ffmpeg (음성, 영상 처리용)
– OpenAI API (DALL·E, GPT, Whisper)
– Typecast API (음성 합성)
– moviepy, pydub, pillow 등 파이썬 라이브러리
– 무료 이미지·영상 소스(예정: 픽사베이 API)

3. 단순화한 개발 순서

1) 키워드 입력
2) GPT가 음성 대본 자동 생성
3) 타입캐스트로 TTS(음성) 파일 생성
4) DALL·E로 이미지 생성
5) Whisper로 자막 타이밍 자동 추출
6) 이미지 + 자막 + 음성을 합쳐 영상 완성!

4. 미리 설치해야할 프로그램, 라이브러리

– ffmpeg 다운로드 후 경로 등록 (예: C:/ffmpeg/ffmpeg-7.1.1-essentials_build/bin/)
– pip install openai pydub moviepy pillow requests whisper
– 윈도우: C:/Windows/Fonts/malgun.ttf (폰트 경로 확인 필수)

5. 주요 파이썬 함수와 역할

아래는 이 프로젝트의 주요 파이썬 코드들입니다.
파이썬 코드이며, 검정 바탕에 흰색 글씨로 하이라이트했습니다.
코드를 완전 상세하게 주석으로 설명해 두었으니 초보자도 쉽게 따라할 수 있습니다.

 # 1. GPT로 음성 대본 생성
 def generate_voice_script(food):
     prompt = f\”\”\”
     Write a 10-second narration for a YouTube Shorts video about ‘{food}’.
     – Do NOT include speaker names or acting cues.
     – No scene directions or sound/music notes.
     – Structure: recommend a recipe (ingredients/time), nutrition/health summary, one reference from literature/movie/real-life, 1-2 curiosity-provoking or self-questioning lines.
     – 30–40 words, all sentences conversational and naturally flowing.
     Write the entire script in natural Korean.
     \”\”\”
     response = client.chat.completions.create(
         model=\”gpt-4o-mini\”,
         messages=[{\”role\”: \”user\”, \”content\”: prompt}]
     )
     return response.choices[0].message.content.strip()
 
 # 2. Typecast API로 텍스트를 음성으로 변환(TTS)
 def generate_audio(text, food):
     os.makedirs(\”segments\”, exist_ok=True)  # 음성 저장용 폴더 생성
     clean_text = re.sub(r\”\\s+\”, \” \”, text).strip()  # 공백 정리
     payload = {
         \”actor_id\”: \”66d91c60da8dd20be59cd40b\”,  # 캐릭터(목소리) ID
         \”text\”: clean_text,
         \”lang\”: \”auto\”,
         \”tempo\”: 1,
         \”volume\”: 100,
         \”pitch\”: 0,
         \”xapi_hd\”: True,
         \”max_seconds\”: 300,
         \”model_version\”: \”latest\”,
         \”xapi_audio_format\”: \”mp3\”,
         \”emotion_tone_preset\”: \”angry-1\”
     }
     resp = requests.post(TYPECAST_URL, headers=TYPECAST_HEADERS, json=payload)
     speak_url = resp.json()[\”result\”][\”speak_url\”]
     print(\”speak_url:\”, speak_url)
 
     # 음성 파일이 준비될 때까지 최대 20회 폴링
     audio_url, extension = None, None
     for i in range(20):
         poll = requests.get(speak_url, headers=TYPECAST_HEADERS)
         result = poll.json()[\”result\”]
         if result.get(\”status\”) == \”done\” and result.get(\”audio\”) and result[\”audio\”].get(\”url\”):
             audio_url = result[\”audio\”][\”url\”]
             extension = result[\”audio\”][\”extension\”]
             print(f\”[{i+1}회차] 음성 URL:\”, audio_url)
             break
         print(f\”[{i+1}회차] 음성 파일 준비 중… 재시도\”)
         time.sleep(1)
     if not audio_url:
         raise RuntimeError(\”20초 내에 음성 파일 URL을 받지 못했습니다.\”)
 
     path = f\”segments/{food}_voice.{extension}\”
     audio_res = requests.get(audio_url, headers=TYPECAST_HEADERS)
     with open(path, \”wb\”) as f:
         f.write(audio_res.content)
     print(f\”파일 저장 완료: {path}\”)
 
     seg = AudioSegment.from_file(path, format=extension)
     duration = len(seg) / 1000.0  # 초 단위
     return path, duration
 
 # 3. DALL·E 이미지 생성 및 저장
 def download_image(url, filename):
     os.makedirs(\”images\”, exist_ok=True)
     path = os.path.join(\”images\”, filename)
     img_data = requests.get(url).content
     with open(path, \”wb\”) as f:
         f.write(img_data)
     return path
 
 def generate_images(food, voice_script, num_images=2):
     paths = []
     for i in range(num_images):
         prompt = voice_script.strip()
         print(f\”[{i+1}번째 이미지 프롬프트] {prompt}\”)
         res = client.images.generate(
             model=\”dall-e-3\”,
             prompt=prompt,
             n=1,
             size=\”1024×1792\”
         )
         url = res.data[0].url
         filename = f\”{food}_img_{i}.png\”
         img_path = download_image(url, filename)
         paths.append(img_path)
     if not paths:
         raise ValueError(\”❌ 이미지 생성 실패 (프롬프트 없음)\”)
     return paths
 
 # 4. Whisper로 자막 타임스탬프 추출
 def transcribe_audio(audio_path):
     print(\”[Whisper] 음성파일에서 자막 타이밍 추출 중…\”)
     model = whisper.load_model(\”medium\”)  # base/small/medium/large 가능
     result = model.transcribe(audio_path)
     subtitles = []
     for segment in result[\”segments\”]:
         subtitles.append({
             \”text\”: segment[\”text\”].strip(),
             \”start\”: segment[\”start\”],
             \”end\”: segment[\”end\”]
         })
     return subtitles
 
 # 5. 자막 텍스트 8자/4단어 이하로 분할 및 타이밍 재할당
 def split_whisper_line(text, max_chars=8, max_words=4):
     words = text.split()
     result = []
     chunk = []
     for word in words:
         test_chunk = chunk + [word]
         test_text = \”\”.join(test_chunk)
         if len(test_chunk) < max_words and len(test_text) <= max_chars:
             chunk.append(word)
         else:
             if chunk:
                 result.append(\” \”.join(chunk))
             chunk = [word]
     if chunk:
         result.append(\” \”.join(chunk))
     return result
 
 def postprocess_whisper_subtitles(subtitles, max_chars=8, max_words=4):
     new_subs = []
     for sub in subtitles:
         lines = split_whisper_line(sub[\”text\”], max_chars, max_words)
         seg_duration = sub[\”end\”] – sub[\”start\”]
         if len(lines) == 0:
             continue
         per_line = seg_duration / len(lines)
         for i, line in enumerate(lines):
             new_subs.append({
                 \”text\”: line,
                 \”start\”: sub[\”start\”] + i * per_line,
                 \”end\”: sub[\”start\”] + (i + 1) * per_line
             })
     return new_subs
 
 # 6. 자막용 이미지 생성 (중앙 정렬, 흰색 글씨 + 검은색 외곽선)
 def create_subtitle_image(text, size=(720, 1280), font_path=FONT_PATH_KO, fontsize=80, output_path=’subtitle_temp.png’):
     W, H = size
     img = Image.new(‘RGBA’, size, (0, 0, 0, 0))  # 투명 배경
     draw = ImageDraw.Draw(img)
     font = ImageFont.truetype(font_path, fontsize)
     clean_text = text.replace(‘\”‘, ”).replace(\”‘\”, \”\”)
     wrapped = clean_text
     text_bbox = draw.textbbox((0, 0), wrapped, font=font)
     text_w = text_bbox[2] – text_bbox[0]
     text_h = text_bbox[3] – text_bbox[1]
     pos = ((W – text_w) // 2, (H – text_h) // 2)
     outline_range = 2
     # 외곽선 효과
     for dx in range(-outline_range, outline_range + 1):
         for dy in range(-outline_range, outline_range + 1):
             if dx != 0 or dy != 0:
                 draw.text((pos[0] + dx, pos[1] + dy), wrapped, font=font, fill=\”black\”)
     draw.text(pos, wrapped, font=font, fill=\”white\”)
     img.save(output_path)
     return output_path
 
 # 7. 영상 합성 (이미지+자막+음성 싱크)
 def assemble_video_with_whisper_subtitles(image_paths, audio_file, food, subtitles, total_duration):
     W, H = 720, 1280  # 영상 해상도
     per_img = total_duration / len(image_paths)  # 이미지별 할당 시간
     img_clips = []
     for idx, img in enumerate(image_paths):
         start = idx * per_img
         duration = per_img if idx < len(image_paths) - 1 else total_duration - start
         img_clips.append(
             ImageClip(img).with_start(start).with_duration(duration).resized((W, H))
         )
     main_clip = concatenate_videoclips(img_clips, method=\”compose\”).with_duration(total_duration)
 
     subtitle_clips = []
     for i, sub in enumerate(subtitles):
         subtitle_path = create_subtitle_image(sub[\”text\”], size=(W, H), output_path=f\”subtitle_temp_{i}.png\”)
         subclip = ImageClip(subtitle_path)
             .with_start(sub[\”start\”])
             .with_duration(sub[\”end\”] – sub[\”start\”])
         subtitle_clips.append(subclip)
 
     video = CompositeVideoClip([main_clip] + subtitle_clips)
     audio = AudioFileClip(audio_file)
     final = video.with_audio(audio)
     output_filename = f\”{food}_shorts.mp4\”
     final.write_videofile(output_filename, fps=24)
 
 # 8. 메인 함수
 def main(food):
     num_images = int(input(\”생성할 이미지 개수를 입력하세요 (예: 4): \”))
 
     print(\”[1] 음성 대본 생성 중…\”)
     voice_script = generate_voice_script(food)
     print(\”== Generated Voice Script ==\”)
     print(voice_script)
     print(\”============================\”)
 
     print(\”[2] 음성 파일 생성 중…\”)
     audio_path, duration = generate_audio(voice_script, food)
     print(f\”[2] 음성 파일 저장된 경로: {audio_path}\”)
 
     print(\”[3] 이미지 생성 중…\”)
     image_paths = generate_images(food, voice_script, num_images=num_images)
 
     print(\”[4] Whisper로 자막 타임스탬프 추출 및 8자/4단어 이하 분할…\”)
     raw_subtitles = transcribe_audio(audio_path)
     subtitles = postprocess_whisper_subtitles(raw_subtitles, max_chars=8, max_words=4)
 
     print(\”[5] 영상 합성 중(정확 싱크 자막)…\”)
     assemble_video_with_whisper_subtitles(image_paths, audio_path, food, subtitles, duration)
 
     print(f\”[✅ 완료] {food}_shorts.mp4 생성 완료!\”)
 
 if __name__ == \”__main__\”:
     food = input(\”키워드를 입력하세요 (예: banana): \”)
     main(food)

6. 타입캐스트 API 사용방법

1) https://typecast.ai/api/speak 엔드포인트에
2) “API 토큰”과 “캐릭터(배우) ID”를 헤더에 포함
3) 변환할 텍스트, 음성 옵션(속도, 볼륨 등) JSON으로 전송
4) 반환된 speak_url로 음성 파일 생성 상태 확인
5) 최종 audio.url을 다운로드

예시 (파이썬)

import requests
headers = {
  “Authorization”: “Bearer API_TOKEN”,
  “Content-Type”: “application/json”
}
payload = {
  “actor_id”: “캐릭터ID”,
  “text”: “변환할 텍스트”,
  …
}
resp = requests.post(“https://typecast.ai/api/speak”, headers=headers, json=payload)

7. 최종 테스트

모든 과정을 자동화하면, 키워드 입력만으로 쇼츠 영상이 자동으로 생성됩니다.
썰, 요리, 여행, AI, 자기계발 등 다양한 주제로 활용 가능!

8. 추후 업데이트 예정 사항

– 쇼츠 템플릿 다양화 (썰/교육/감동/반전 등)
– 유튜브 자동 업로드 기능 추가
– GPT 이미지 생성(유료) → 픽사베이 API로 무료 이미지/영상 활용
– 배경음악 자동 추가 기능
– 초보자도 쉽게 따라할 수 있는 GUI 버전도 고려중

9. n8n/Make와 비교

비교 항목	n8n (Cloud Free)	Make (Free)
프로젝트 수	1개	2개
실행 횟수	200회/월	1,000 Ops/월
기능 확장성	제한적 (자체호스팅 시 무제한)	제한적
설치 옵션	자체 서버 설치로 제한 제거 가능	없음 (클라우드만 제공)

비용 부담 + 기능 한계 때문에 직접 개발!

10. 더 나은 자동화? 의견 남겨주세요!

혹시 추가되었으면 하는 기능, 궁금한 점이 있다면
댓글이나 메일로 언제든 문의해주세요 🙂

잠코딩