Comment éviter de me faire blacklister par les cibles ?

Limiter concurrence à 50 à 200 max, ajouter delay sleep 0.1 à 0.5s, User-Agent réaliste, IP rotative si nécessaire. Respecter robots.txt sur cibles non autorisées. En mission autorisée, robots.txt n'est pas obligatoire mais respecter les fenêtres convenues avec le client est fondamental.

Mon crawler récupère-t-il aussi les endpoints API ?

Pas par défaut. Pour endpoints dans JS : linkfinder avec -i app.js, gospider, ou parser regex slash api slash sur les .js récupérés. Les SPA modernes nécessitent un browser headless comme Playwright pour rendu JS et extraction des appels XHR avec leurs paramètres.

Python pentesting automatisation OSINT recon : tutoriel 2026

Lecture : 14 minutes · Niveau : intermédiaire · Mise à jour : avril 2026

⚠️ Disclaimer : Tous les scripts s’utilisent uniquement sur des cibles autorisées (votre propre infrastructure, mission contractuelle avec autorisation écrite, plateformes d’entraînement). Toute reconnaissance contre un système tiers sans cadre légal est illégale.

Tutoriel pas-à-pas pour automatiser la phase de reconnaissance et OSINT en Python : énumération sous-domaines, fuzzing de répertoires web, parsing HTML, intégration Shodan API, pipeline complet. Tout est testé contre des cibles légales : votre propre domaine, OWASP Juice Shop, scope HackerOne ou TryHackMe.

Pourquoi automatiser la recon en Python plutôt qu’utiliser des outils existants ? Plusieurs raisons : (1) la recon manuelle ou via outils standards est répétitive ; un script personnel encapsule votre workflow et évite d’oublier des étapes, (2) les outils packagés (subfinder, ffuf) sont en Go avec des interfaces CLI fixes ; un wrapper Python permet d’enchaîner et de structurer les sorties dans le format de votre rapport final, (3) intégrer plusieurs sources (DNS brute, certificat transparency, archive.org, Shodan) dans une seule commande devient trivial, (4) la documentation automatique des étapes facilite la chaîne de preuves et la reproductibilité.

Approche pragmatique. Ce tutoriel ne réécrit pas chaque outil from scratch — il montre comment chaîner dnspython, aiohttp, Shodan API et BeautifulSoup pour produire un pipeline reproductible. Beaucoup de scripts Python recon en circulation sur GitHub réinventent la roue ou sont fragiles. La valeur d’un bon script recon n’est pas la sophistication mais la fiabilité, la lisibilité, et la facilité de modification quand la cible présente une particularité.

Voir aussi → Python pour pentesting : guide complet, Python pentesting scripts réseau, Python pentesting exploitation et payloads.

Sommaire

Setup environnement async
Énumération sous-domaines avec dnspython
Bruteforce sous-domaines async (10K en 2 min)
Fuzzing de répertoires (FFUF en Python)
Parser HTML avec BeautifulSoup
Crawler async respectueux
Intégration Shodan API
Pipeline OSINT complet (sub → port → tech)
Output structuré (JSON + SQLite)
FAQ

1. Setup environnement async

python3 -m venv ~/venvs/recon
source ~/venvs/recon/bin/activate
pip install aiohttp dnspython httpx beautifulsoup4 lxml \
            shodan tqdm rich

Wordlists :

sudo apt install -y seclists
ls /usr/share/seclists/Discovery/DNS/
# subdomains-top1million-5000.txt
# subdomains-top1million-110000.txt

ls /usr/share/seclists/Discovery/Web-Content/
# raft-medium-directories.txt
# raft-medium-files.txt

2. Énumération sous-domaines avec dnspython

# subs_basic.py
import dns.resolver

def check_subdomain(domain, sub):
    try:
        dns.resolver.resolve(f"{sub}.{domain}", "A")
        return f"{sub}.{domain}"
    except dns.resolver.NXDOMAIN:
        return None
    except Exception:
        return None

WORDLIST = ["www", "mail", "api", "dev", "staging", "admin", "blog", "shop"]

for sub in WORDLIST:
    full = check_subdomain("entreprise.sn", sub)
    if full:
        print(f"  [+] {full}")

Vitesse : ~10 sub/sec. Insuffisant pour wordlist 100k+.

3. Bruteforce sous-domaines async (10K en 2 min)

# subs_async.py
import asyncio
import aiodns

async def check(resolver, domain, sub, sem):
    async with sem:
        full = f"{sub}.{domain}"
        try:
            await resolver.query(full, "A")
            return full
        except aiodns.error.DNSError:
            return None

async def main(domain, wordlist_file, max_concurrent=200):
    resolver = aiodns.DNSResolver(nameservers=["1.1.1.1", "8.8.8.8"])
    sem = asyncio.Semaphore(max_concurrent)

    with open(wordlist_file) as f:
        subs = [line.strip() for line in f if line.strip()]

    tasks = [check(resolver, domain, sub, sem) for sub in subs]
    results = await asyncio.gather(*tasks)
    return [r for r in results if r]

if __name__ == "__main__":
    import sys
    domain = sys.argv[1]
    wordlist = sys.argv[2]
    found = asyncio.run(main(domain, wordlist))
    print(f"\n[*] {len(found)} sub-domains:")
    for s in sorted(found):
        print(f"  {s}")

pip install aiodns
python3 subs_async.py entreprise.sn /usr/share/seclists/Discovery/DNS/subdomains-top1million-5000.txt

Sortie type :

[*] 14 sub-domains:
  api.entreprise.sn
  blog.entreprise.sn
  dev.entreprise.sn
  mail.entreprise.sn
  staging.entreprise.sn
  ...

Performance : ~5000 subs/sec sur connexion correcte avec 200 concurrent. 100k wordlist en ~30 sec.

4. Fuzzing de répertoires (FFUF en Python)

# dirfuzz.py
import asyncio
import aiohttp

async def probe(session, base_url, path, sem, valid_codes={200, 301, 302, 401, 403}):
    async with sem:
        url = f"{base_url}/{path}"
        try:
            async with session.get(url, allow_redirects=False) as r:
                if r.status in valid_codes:
                    size = int(r.headers.get('Content-Length', 0))
                    return (url, r.status, size)
        except Exception:
            pass
        return None

async def main(base, wordlist, max_concurrent=100):
    sem = asyncio.Semaphore(max_concurrent)
    timeout = aiohttp.ClientTimeout(total=10)
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) recon-script"}

    with open(wordlist) as f:
        paths = [line.strip() for line in f if line.strip()]

    async with aiohttp.ClientSession(headers=headers, timeout=timeout) as s:
        tasks = [probe(s, base, p, sem) for p in paths]
        results = await asyncio.gather(*tasks)
        return [r for r in results if r]

if __name__ == "__main__":
    import sys
    base = sys.argv[1]
    wl = sys.argv[2]
    found = asyncio.run(main(base, wl))
    for url, code, size in sorted(found, key=lambda x: x[1]):
        print(f"  [{code}] {size:>8}b  {url}")

python3 dirfuzz.py http://target.local /usr/share/seclists/Discovery/Web-Content/common.txt

Sortie :

  [200]      234b  http://target.local/admin
  [200]     1024b  http://target.local/login
  [301]        0b  http://target.local/uploads
  [403]      512b  http://target.local/.git

Astuce : filtrer par taille pour ignorer les pages 404 personnalisées qui retournent un 200.

5. Parser HTML avec BeautifulSoup

# parse_html.py
import requests
from bs4 import BeautifulSoup
import re

URL = "http://target.local"
r = requests.get(URL, timeout=10)
soup = BeautifulSoup(r.text, "lxml")

# Tous les liens
for a in soup.find_all("a", href=True):
    print(f"LINK: {a['href']}")

# Forms et leurs actions
for form in soup.find_all("form"):
    action = form.get("action", "")
    method = form.get("method", "GET").upper()
    inputs = [(inp.get("name"), inp.get("type", "text"))
              for inp in form.find_all("input")]
    print(f"FORM {method} {action} {inputs}")

# Commentaires HTML (souvent secrets oubliés)
comments = soup.find_all(string=lambda x: isinstance(x, type(soup.Comment)))
for c in comments:
    print(f"COMMENT: {c.strip()}")

# Détection de technos via meta tags et headers
for meta in soup.find_all("meta"):
    if meta.get("name") in ("generator", "framework"):
        print(f"TECH: {meta.get('content')}")

# Emails dans le contenu
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', r.text)
print(f"EMAILS: {set(emails)}")

6. Crawler async respectueux

# crawler.py
import asyncio
import aiohttp
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

class Crawler:
    def __init__(self, base, max_depth=2, max_concurrent=10, delay=0.2):
        self.base = base
        self.base_netloc = urlparse(base).netloc
        self.max_depth = max_depth
        self.delay = delay
        self.visited = set()
        self.sem = asyncio.Semaphore(max_concurrent)

    async def fetch(self, session, url, depth):
        if url in self.visited or depth > self.max_depth:
            return
        self.visited.add(url)

        async with self.sem:
            await asyncio.sleep(self.delay)  # respect du serveur
            try:
                async with session.get(url, timeout=10) as r:
                    if r.status != 200:
                        return
                    text = await r.text()
                    print(f"  [{depth}] {url}")

                    soup = BeautifulSoup(text, "lxml")
                    links = []
                    for a in soup.find_all("a", href=True):
                        next_url = urljoin(url, a["href"])
                        if urlparse(next_url).netloc == self.base_netloc:
                            links.append(next_url)

                    tasks = [self.fetch(session, l, depth + 1) for l in links]
                    await asyncio.gather(*tasks)
            except Exception as e:
                pass

    async def run(self):
        async with aiohttp.ClientSession() as s:
            await self.fetch(s, self.base, 0)
        return self.visited

if __name__ == "__main__":
    c = Crawler("http://target.local", max_depth=2)
    pages = asyncio.run(c.run())
    print(f"\n[*] {len(pages)} pages crawlées")

Politesse : delay 0.2s entre requêtes, max 10 simultanées. Permet de crawler sans saturer la cible.

7. Intégration Shodan API

# shodan_recon.py
import shodan

API_KEY = "VOTRE_CLE"
api = shodan.Shodan(API_KEY)

# Recherche par hostname
results = api.search('hostname:entreprise.sn')

print(f"Total: {results['total']}")
for result in results['matches']:
    print(f"\nIP: {result['ip_str']}")
    print(f"Port: {result['port']}")
    print(f"Org: {result.get('org', 'N/A')}")
    print(f"Hostnames: {result.get('hostnames', [])}")
    print(f"Banner (extrait): {result['data'][:200]}")

# Info détaillée par IP
host = api.host("198.51.100.42")
print(f"\nServices on {host['ip_str']}:")
for service in host['data']:
    print(f"  Port {service['port']}: {service['product']} {service.get('version', '')}")

Filtres Shodan utiles :
– hostname:exemple.com
– ssl:exemple.com (sites avec ce nom dans cert)
– port:80,443
– country:SN (Sénégal)
– org:"Sonatel"
– product:Apache
– vuln:CVE-2024-XXXX

8. Pipeline OSINT complet (sub → port → tech)

# pipeline.py
import asyncio
import aiodns, aiohttp
from bs4 import BeautifulSoup

class ReconPipeline:
    def __init__(self, domain):
        self.domain = domain
        self.results = {}

    async def step1_subdomains(self, wordlist):
        resolver = aiodns.DNSResolver()
        sem = asyncio.Semaphore(200)
        with open(wordlist) as f:
            subs = [line.strip() for line in f]

        async def chk(sub):
            async with sem:
                try:
                    await resolver.query(f"{sub}.{self.domain}", "A")
                    return f"{sub}.{self.domain}"
                except Exception:
                    return None

        results = await asyncio.gather(*[chk(s) for s in subs])
        self.results['subdomains'] = [r for r in results if r]
        return self.results['subdomains']

    async def step2_ports(self, subs, ports=[80, 443, 8080, 8443]):
        sem = asyncio.Semaphore(500)

        async def probe(host, port):
            async with sem:
                try:
                    r, w = await asyncio.wait_for(
                        asyncio.open_connection(host, port), timeout=2)
                    w.close()
                    await w.wait_closed()
                    return (host, port)
                except Exception:
                    return None

        tasks = [probe(s, p) for s in subs for p in ports]
        results = await asyncio.gather(*tasks)
        self.results['open_ports'] = [(h, p) for r in results if r for h, p in [r]]
        return self.results['open_ports']

    async def step3_tech(self, http_targets):
        sem = asyncio.Semaphore(50)
        techs = {}

        async def detect(session, host, port):
            async with sem:
                proto = "https" if port in (443, 8443) else "http"
                url = f"{proto}://{host}:{port}"
                try:
                    async with session.get(url, timeout=10, ssl=False) as r:
                        server = r.headers.get('Server', '')
                        powered = r.headers.get('X-Powered-By', '')
                        text = await r.text()
                        soup = BeautifulSoup(text, 'lxml')
                        gen = ''
                        if soup.find('meta', attrs={'name': 'generator'}):
                            gen = soup.find('meta', attrs={'name': 'generator'}).get('content', '')
                        techs[url] = {'Server': server, 'X-Powered-By': powered,
                                       'Generator': gen}
                except Exception:
                    pass

        async with aiohttp.ClientSession() as s:
            await asyncio.gather(*[detect(s, h, p) for h, p in http_targets])
        self.results['tech'] = techs
        return techs

    async def run(self, wordlist):
        print("[*] Step 1: subdomains")
        subs = await self.step1_subdomains(wordlist)
        print(f"  Found {len(subs)} sub-domains")

        print("[*] Step 2: open ports")
        ports = await self.step2_ports(subs)
        print(f"  Found {len(ports)} open ports")

        print("[*] Step 3: tech detection")
        techs = await self.step3_tech(ports)
        print(f"  Detected on {len(techs)} services")

        return self.results

if __name__ == "__main__":
    p = ReconPipeline("entreprise.sn")
    out = asyncio.run(p.run("/usr/share/seclists/Discovery/DNS/subdomains-top1million-5000.txt"))
    import json
    with open('recon.json', 'w') as f:
        json.dump(out, f, indent=2)
    print("\n[*] Saved to recon.json")

9. Output structuré (JSON + SQLite)

JSON pour intégration outils tiers :

import json
with open('recon.json', 'w') as f:
    json.dump(results, f, indent=2, default=str)

SQLite pour requêtes ultérieures :

import sqlite3

conn = sqlite3.connect('recon.db')
conn.execute('''CREATE TABLE IF NOT EXISTS subdomains
    (id INTEGER PRIMARY KEY, domain TEXT, sub TEXT,
     discovered_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP)''')

for sub in results['subdomains']:
    conn.execute("INSERT INTO subdomains (domain, sub) VALUES (?, ?)",
                 ("entreprise.sn", sub))
conn.commit()
conn.close()

# Requête ultérieure
conn = sqlite3.connect('recon.db')
for row in conn.execute("SELECT sub FROM subdomains WHERE domain=?",
                         ("entreprise.sn",)):
    print(row[0])

Markdown auto-généré pour rapport :

def to_markdown(results):
    md = "# Recon Report\n\n## Sub-domains\n\n"
    for s in sorted(results['subdomains']):
        md += f"- `{s}`\n"
    md += "\n## Open ports\n\n"
    for h, p in results['open_ports']:
        md += f"- {h}:{p}\n"
    return md

with open('recon.md', 'w') as f:
    f.write(to_markdown(results))

FAQ

Quelle wordlist sous-domaines choisir ?

Démarrage : subdomains-top1million-5000.txt (rapide, hits courants). Approfondi : subdomains-top1million-110000.txt. Brute force massif : combiner avec mutation du nom de marque (ex: nom + « dev », « staging », chiffres). Outils complémentaires : assetfinder, subfinder, amass (Go) — souvent meilleurs que pure brute Python.

Comment éviter de me faire blacklister ?

Limiter concurrence (50-200 max), ajouter delay (sleep 0.1-0.5s), User-Agent réaliste, IP rotative si nécessaire. Respecter robots.txt sur cibles non autorisées (mais en mission autorisée, robots.txt n’est pas obligatoire).

Shodan API gratuit suffit-il ?

Plan gratuit Shodan : très limité (1 query/mois en 2026 sauf changements). Plan Membership ($69 lifetime, ancien) reste le meilleur deal. Plan Small Business plus complet pour usage pro régulier.

aiohttp vs httpx vs requests : lequel choisir ?

requests : synchrone, le plus utilisé, suffit pour scripts simples. httpx : moderne, support sync ET async, HTTP/2. aiohttp : async natif, performant pour gros volumes. Pour recon scaling : aiohttp ou httpx async.

Comment intégrer ces scripts dans un pipeline complet ?

Plusieurs options : (1) Bash glue script qui chaîne les .py, (2) Makefile/Justfile, (3) Frameworks Python : prefect, luigi, airflow pour pipelines complexes, (4) Outils dédiés recon : recon-ng (déjà vu), axiom (cloud), osmedeus (orchestrateur).

Mon crawler récupère-t-il aussi les fichiers JS / API endpoints ?

Pas par défaut. Pour endpoints dans JS : linkfinder (-i app.js), gospider, ou parser regex /api/[a-z]+/[a-z0-9]+/ dans les .js récupérés. Les SPA modernes nécessitent un browser headless (Playwright) pour rendu JS.

Comment gérer le rate limiting des cibles ?

Détecter HTTP 429 ou 503 → backoff exponentiel. Distribuer sur multiple IPs (proxychains, rotating proxies). Réduire concurrence. En mission autorisée : discuter avec le client de la fenêtre acceptable.

asyncio rend-il les scripts plus difficiles à debugger ?

Oui, légèrement. Stack traces async parfois cryptiques, race conditions difficiles. Solutions : asyncio.run(..., debug=True), logging explicite, traces dans chaque coroutine. Pour scripts critiques, threading reste plus simple à debugger malgré performance moindre.

Articles liés (cluster Python pentesting)

Voir aussi : OSINT outils essentiels : tutoriel, Pentesting d’applications web : OWASP Top 10.

Article mis à jour le 25 avril 2026. Pour signaler une erreur ou suggérer une amélioration, écrivez-nous.