Synavistra is committed to full transparency about how our AI models are built, what data they are trained on, and how they process your documents.
Overview
Our AI document analysis tool runs entirely in your web browser. This page discloses the complete training data, model architecture, and processing methodology in compliance with the EU AI Act (Regulation EU 2024/1689).
Local Processing Architecture
When you use our document analysis tool, all processing happens locally on your device:
- The AI model is downloaded once to your browser and cached locally
- Your PDF documents are processed entirely in browser memory
- Extracted text, entities, and knowledge graphs never leave your device
- No analytics, tracking, cookies, or telemetry of any kind
- Exported .snv.json files are saved directly to your local filesystem
Model Information
| Base Model | Phi-3-mini-4k-instruct (Microsoft, 3.8B parameters) |
| Model License | MIT License (open source) |
| Fine-tuning Method | LoRA (Low-Rank Adaptation) on domain-specific legal texts |
| Quantization | INT4 (ONNX format for browser inference) |
| Inference Engine | ONNX Runtime Web with WebGPU acceleration |
Evaluation Results
We publish our model evaluation results openly. These numbers reflect honest performance on held-out test data, not cherry-picked examples:
| Task | Precision | Recall | F1 | Parse Rate |
|---|---|---|---|---|
| Named Entity Recognition (prefix 0) | 69.4% | 59.6% | 62.3% | 100% |
Evaluation on 61 held-out examples from GDPR and CCPA texts. Additional prefix evaluations will be published as they complete. These results represent a 3.8B parameter model fine-tuned on 324 examples — not a frontier model.
Known Limitations
We believe honest disclosure of limitations is more valuable than marketing claims. This model:
- Only supports English text (German and other languages not trained)
- Only covers GDPR and CCPA/CPRA privacy law (no contract law, regulatory law, or other domains yet)
- Does not include case law, court decisions, or regulatory guidance (only statutory text)
- Is NOT a substitute for legal advice — outputs are AI-generated summaries that must be verified by qualified professionals
- Has a 1024-token context window — very long articles may be truncated
- NER F1 score of 62.3% means approximately 1 in 3 entities may be missed or incorrectly classified
- Falls back to rule-based extraction when the AI model is not available (lower quality, but still functional)
Training Data Sources
The model was fine-tuned exclusively on publicly available official legal texts. Every source is documented with full provenance:
| Source | Documents | License | Jurisdiction |
|---|---|---|---|
| GDPR (Regulation EU 2016/679) | 99 articles | CC-BY-4.0 | EU/EEA |
| CCPA/CPRA (Cal. Civ. Code 1798) | 23 sections | Public Domain (US state law) | California |
Total: 324 training examples across 5 task types (NER extraction, text cleanup, knowledge graph extraction, query decomposition, answer synthesis). All training examples were manually created from real legal text — no synthetic or AI-generated training data.
Training Methodology
- Source texts are official legal documents downloaded from government websites (EUR-Lex, California Legislature)
- Named entities were extracted using @nlpjs/ner with a curated legal entity dictionary
- Knowledge graph relationships were manually identified and verified by domain experts
- All training input/output pairs (golden records) are archived with SHA-256 checksums for reproducibility
- Training was performed on Google Cloud TPU v6e infrastructure in the EU (europe-west4)
No Synthetic Training Data
We do not use AI-generated or synthetic training examples. Every training example was created by humans working with real legal text. This ensures the model learns from authoritative sources, not from AI hallucinations or circular training patterns.
Open Golden Records
Training data for our publicly accessible tools is fully open and downloadable. These are the human-verified input/output pairs used to train and evaluate the model. Anyone can inspect, reproduce, or challenge our training methodology. All records are archived with SHA-256 checksums and available at models.synavistra.ai/training-data/.
| Pipeline Stage | GDPR Pairs | CCPA Pairs |
|---|---|---|
| Text Extraction | 47 | 14 |
| NER Extraction | 53 | 8 |
| Knowledge Graph | 54 | 7 |
| Query Decomposition | 52 | 9 |
| Answer Synthesis | 46 | 15 |
| Total | 252 | 53 |
Environmental Impact
We design for minimal environmental impact at every stage of the AI lifecycle:
Training
| Hardware | Google Cloud TPU v6e (single chip, ct6e-standard-1t) |
| Data Center | europe-west4 (Netherlands) — 82% carbon-free energy |
| Total Energy per Model | ~1 kWh — including all training, evaluation, scoring, ONNX export, and failed attempts. <a href="https://models.synavistra.ai/audits/phi3-legal-privacy-v1.json" rel="noopener">Detailed audit (JSON)</a>. |
Fine-tuning is the technically correct approach for our data scale (324 examples from 122 legal documents), leveraging Phi-3's existing understanding from trillions of pre-training tokens. Fine-tuning also has the benefit of low environmental impact.
Inference
- Browser-local inference: AI runs on the user's existing device — no cloud GPU servers required
- INT4 quantization reduces compute per inference by ~4x compared to FP16, lowering energy use on every device
- Zero idle energy: no servers running 24/7 waiting for requests — compute only happens when a user actively runs the tool
- Model downloaded once and cached in the browser — subsequent uses require no network transfer
Licensing
All Synavistra-produced artifacts for publicly accessible tools are licensed under Apache 2.0. Third-party components retain their original licenses.
| Artifact | License | Note |
|---|---|---|
| Fine-tuned model weights | Apache-2.0 | Synavistra derivative work |
| Golden records (training data) | Apache-2.0 | Human-created by Synavistra |
| Energy audits, manifests | Apache-2.0 | Synavistra documentation |
| GDPR source text | CC-BY-4.0 | EU official document, attribution required |
| CCPA source text | Public Domain | US state law, unrestricted |
| Phi-3-mini base model | MIT | Microsoft, included per MIT terms |
Phi-3 MIT License Notice (required attribution)
MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
EU AI Act Compliance
This disclosure is provided in accordance with Article 53 of the EU AI Act (Regulation EU 2024/1689) regarding transparency obligations for general-purpose AI models. Synavistra GmbH, Feldkirch, Austria, is the provider of this AI system. For questions about our AI practices, contact us at the address listed in our Impressum.
Independent Verification
We invite independent auditors, researchers, and regulators to verify any claim made on this page. All data is downloadable: training data registry, energy audit, and golden records for every pipeline stage. If you identify any inaccuracy or concern, please contact us.
Questions
If you have questions about our AI transparency practices, training data, or processing methodology, please contact us.