Research Protocol
Affiliate.Careers Methodology
Open and reproducible protocol: how data is ingested, cleaned, scored, and published in the catalog and analytics.
1. Research goal
Collect affiliate / iGaming / crypto jobs market data from Telegram sources, remove noise, merge duplicates, score card quality, and publish open datasets.
Important: this is a Telegram-channels market slice, not a full labor-market model.
2. Sources and data layers
tg_messages: immutable RAW layer of messages.tg_message_parse_runs: versioned parsing and classification outputs.jobs: canonical jobs after deduplication.job_sources: job-to-source links for source posts.analytics_daily_*: daily aggregates for public analytics.analytics_mv_*: materialized views for heavy analytics queries and weekly reports.
3. Processing pipeline
- Ingest: import Telegram messages into RAW layer.
- Classify: assign message type (vacancy/promo/resume/other).
- Split: split multi-vacancy messages into individual items.
- Extract: extract fields (title, company, salary, contacts, tags).
- Dedup: merge duplicates into canonical job entity.
- AI moderation: auto-review ambiguous cases and apply high-confidence decisions.
- Publish: ship to catalog/feeds and refresh analytics rollups.
Scheduled update cycle: hourly.
4. Filtering rules
- Ukrainian language: messages with stable Ukrainian markers are excluded from publishing.
- Channel exclusions: sources without apply contacts or with hard redirects can be disabled.
- Promo/noise: ad-like messages without vacancy structure are routed to review.
5. Quality Score formula (0..100)
Final quality score is calculated as a weighted component sum:
Q = T + C + S + Sn + L + W + A + D + R + V Where: T (title quality) = 0..7 C (company present) = 0..18 S (salary raw present) = 0..18 Sn (salary numeric parsed) = 0..8 L (location present) = 0..10 W (work_format != unknown) = 0..10 A (apply contacts present) = 0..14 D (description richness) = 0..8 R (multi-source confidence) = 0..3 V (direct/verified bonus) = 0..7
Current formula version: qscore-v1.0. Scores and flags are stored in jobs.quality_score and jobs.quality_flags.
6. Public indices and formulas
Salary Transparency Index (%) = jobs_seen_with_salary_raw / jobs_seen_count * 100 Salary Numeric Share (%) = jobs_seen_with_salary_numeric / jobs_seen_count * 100 Employer Transparency Index (%)= jobs_seen_with_company / jobs_seen_count * 100 Contacts Coverage (%) = jobs_seen_with_contacts / jobs_seen_count * 100 Remote Share (%) = jobs_seen_remote / jobs_seen_count * 100 Channel Noise Rate (%) = repost_jobs_total / unique_jobs_total * 100
Indices are published on /market-pulse/ and /channels/.
7. Reproducibility: how to verify calculations
Below is the minimal checklist to independently verify outputs.
# 1) Миграции / Migrations bash scripts/db/migrate.sh # 2) Полный hourly-проход / Full hourly pass bash scripts/pipeline/run_hourly_pipeline.sh # 3) Пересчет quality + rollup за 90 дней / Rebuild quality + rollups php scripts/analytics/build_rollups.php --days=90
For historical backfills use: --from=YYYY-MM-DD, --quality_only=1, --rollup_only=1.
SQL validation (raw data vs rollups)
-- Raw -> published sanity SELECT count(*) AS raw_messages FROM tg_messages; SELECT count(*) AS active_jobs FROM jobs WHERE status='active'; -- Index quality coverage SELECT count(*) FILTER (WHERE salary_raw IS NOT NULL AND btrim(salary_raw) <> '') AS jobs_with_salary_raw, count(*) FILTER (WHERE company_raw IS NOT NULL AND btrim(company_raw) <> '') AS jobs_with_company, count(*) FILTER (WHERE work_format <> 'unknown') AS jobs_with_format FROM jobs WHERE status='active'; -- Check latest day in analytics rollups SELECT day, jobs_seen_count, jobs_seen_with_salary_raw, jobs_seen_with_company FROM analytics_daily_jobs ORDER BY day DESC LIMIT 7;
Validation via public API/feed
- /feeds/jobs.json.php?limit=50 — job entities
- /feeds/analytics.json.php?limit=30 — daily market indices
- /feeds/changelog.ndjson.php?limit=50 — job updates stream
- /feeds/product-changelog.ndjson.php?limit=20 — system updates stream
8. Limitations and risks
- Sampling bias: only selected Telegram channels are covered.
- Incomplete fields: salary/company/location may be missing in source posts.
- Extraction is not perfect: tagging/classification errors are possible.
- Deduplication and AI moderation rely on thresholds, not absolute truth.
9. Publication integrity checks
- Validate raw post to job mapping via
job_sources. - Cross-check index values against daily tables
analytics_daily_jobs/channels. - Check trend consistency and coverage using public feeds.
10. Public aggregated data and monitoring
- Daily market indices: /feeds/analytics.json.php
- Daily indices (tabular format): /feeds/analytics.csv.php
- Weekly reports: /reports/ with URL pattern
/reports/weekly-YYYY-WW/. - Pipeline freshness and quality monitoring:
/admin/analytics/
Alerting sends Telegram notifications when hourly pipeline is stale or unknown format share spikes unknown_work_format.
11. Skill tags extraction
Skills are extracted deterministically (regex v2) using RU/EN aliases dictionary from config/skills_aliases.php. This reduces noise and keeps trends reproducible without a black-box model.
- Version source:
analytics_daily_skills.source = regex_v2. - Coverage: traffic sources, trackers, ad-networks, CRM/BI tools, programming languages.
- Mechanics: normalize text → match aliases → dedupe tags → daily aggregation.