213 lines
6.7 KiB
Markdown
213 lines
6.7 KiB
Markdown
# GAIS Data Import for ClassroomCopilot
|
|
|
|
This document describes the GAIS (Get All Information from Schools) data import functionality for ClassroomCopilot, which allows you to import publicly available school databases into the Neo4j database.
|
|
|
|
## Overview
|
|
|
|
The GAIS data import system is designed to import publicly available educational data from various sources, starting with Edubase All Data. The system follows Neo4j naming conventions and creates a comprehensive graph structure representing schools and their relationships.
|
|
|
|
## Neo4j Naming Conventions
|
|
|
|
The import system adheres to the following Neo4j naming conventions:
|
|
|
|
- **Node Labels**: PascalCase (e.g., `Establishment`, `LocalAuthority`, `EstablishmentType`)
|
|
- **Relationships**: `HAS_` prefix followed by the target node label (e.g., `HAS_LOCAL_AUTHORITY`, `HAS_ESTABLISHMENT_TYPE`)
|
|
- **Properties**: camelCase (e.g., `establishmentName`, `schoolCapacity`, `numberOfPupils`)
|
|
|
|
## Data Structure
|
|
|
|
### Main Nodes
|
|
|
|
1. **Establishment** - The primary school/educational institution node
|
|
- Properties: URN, name, address, capacity, pupil counts, etc.
|
|
- Unique identifier: `urn` property
|
|
|
|
2. **LocalAuthority** - Local authority governing the establishment
|
|
- Properties: code, name
|
|
- Relationship: `HAS_LOCAL_AUTHORITY`
|
|
|
|
3. **EstablishmentType** - Type of educational establishment
|
|
- Properties: code, name
|
|
- Relationship: `HAS_ESTABLISHMENT_TYPE`
|
|
|
|
4. **EstablishmentTypeGroup** - Group classification of establishment types
|
|
- Properties: code, name
|
|
- Relationship: `HAS_ESTABLISHMENT_TYPE_GROUP`
|
|
|
|
5. **PhaseOfEducation** - Educational phase (Primary, Secondary, etc.)
|
|
- Properties: code, name
|
|
- Relationship: `HAS_PHASE_OF_EDUCATION`
|
|
|
|
6. **GenderType** - Gender classification of the establishment
|
|
- Properties: code, name
|
|
- Relationship: `HAS_GENDER_TYPE`
|
|
|
|
7. **ReligiousCharacter** - Religious character of the establishment
|
|
- Properties: code, name
|
|
- Relationship: `HAS_RELIGIOUS_CHARACTER`
|
|
|
|
8. **Diocese** - Religious diocese (if applicable)
|
|
- Properties: code, name
|
|
- Relationship: `HAS_DIOCESE`
|
|
|
|
9. **GovernmentOfficeRegion** - Government office region
|
|
- Properties: code, name
|
|
- Relationship: `HAS_GOVERNMENT_OFFICE_REGION`
|
|
|
|
10. **DistrictAdministrative** - Administrative district
|
|
- Properties: code, name
|
|
- Relationship: `HAS_DISTRICT_ADMINISTRATIVE`
|
|
|
|
11. **MSOA** - Middle Super Output Area
|
|
- Properties: code, name
|
|
- Relationship: `HAS_MSOA`
|
|
|
|
12. **LSOA** - Lower Super Output Area
|
|
- Properties: code, name
|
|
- Relationship: `HAS_LSOA`
|
|
|
|
13. **Country** - Country of the establishment
|
|
- Properties: name
|
|
- Relationship: `HAS_COUNTRY`
|
|
|
|
## Usage
|
|
|
|
### Command Line
|
|
|
|
You can run the GAIS data import using the startup script:
|
|
|
|
```bash
|
|
# Import GAIS data
|
|
./start.sh gais-data
|
|
|
|
# Or directly with Python
|
|
python main.py --mode gais-data
|
|
```
|
|
|
|
### Programmatic Usage
|
|
|
|
```python
|
|
from run.initialization.gais_data import import_gais_data
|
|
|
|
# Import the data
|
|
result = import_gais_data()
|
|
|
|
if result["success"]:
|
|
print(f"Successfully imported {result['total_rows']} records")
|
|
print(f"Processing time: {result['processing_time']:.2f} seconds")
|
|
print(f"Nodes created: {result['nodes_created']}")
|
|
print(f"Relationships created: {result['relationships_created']}")
|
|
else:
|
|
print(f"Import failed: {result['message']}")
|
|
```
|
|
|
|
## Data Sources
|
|
|
|
### Edubase All Data
|
|
|
|
The primary data source is the Edubase All Data CSV file, which contains information about all educational establishments in England and Wales.
|
|
|
|
**File Location**: `run/initialization/import/edubasealldata20250828.csv`
|
|
|
|
**Data Volume**: Approximately 51,900 records
|
|
|
|
**Key Fields**:
|
|
- URN (Unique Reference Number)
|
|
- Establishment details (name, type, status)
|
|
- Geographic information (address, coordinates, administrative areas)
|
|
- Educational characteristics (phase, gender, religious character)
|
|
- Capacity and pupil numbers
|
|
- Contact information
|
|
- Inspection details
|
|
|
|
## Data Processing
|
|
|
|
### Batch Processing
|
|
|
|
The import system processes data in batches to optimize performance and memory usage:
|
|
|
|
- **Batch Size**: 1,000 records per batch
|
|
- **Processing**: Nodes are created first, then relationships
|
|
- **Error Handling**: Individual record failures don't stop the entire import
|
|
|
|
### Data Validation
|
|
|
|
The system automatically handles:
|
|
- Empty/blank values (excluded from node properties)
|
|
- "Not applicable" values (treated as empty)
|
|
- Date format conversion (DD-MM-YYYY to ISO format)
|
|
- Numeric value parsing
|
|
- Duplicate node prevention
|
|
|
|
### Relationship Creation
|
|
|
|
Relationships are created using a two-pass approach:
|
|
1. **First Pass**: Create all nodes and build a mapping of keys to node objects
|
|
2. **Second Pass**: Create relationships between nodes using the mapping
|
|
|
|
## Performance Considerations
|
|
|
|
- **Memory Usage**: Data is processed in batches to minimize memory footprint
|
|
- **Database Connections**: Uses connection pooling for efficient database access
|
|
- **Duplicate Prevention**: Tracks created nodes to avoid duplicates
|
|
- **Error Resilience**: Continues processing even if individual records fail
|
|
|
|
## Future Enhancements
|
|
|
|
The GAIS import system is designed to be extensible for additional data sources:
|
|
|
|
1. **Governance Data** - School governance and management information
|
|
2. **Links Data** - Relationships between schools and other entities
|
|
3. **Groups Data** - Multi-academy trusts and federations
|
|
4. **Additional Sources** - Other publicly available educational datasets
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **File Not Found**: Ensure the Edubase CSV file is in the correct location
|
|
2. **Database Connection**: Verify Neo4j is running and accessible
|
|
3. **Memory Issues**: Reduce batch size if processing large datasets
|
|
4. **Permission Errors**: Check file permissions for the CSV data file
|
|
|
|
### Logging
|
|
|
|
The system provides comprehensive logging:
|
|
- Import progress updates
|
|
- Error details for failed records
|
|
- Performance metrics
|
|
- Node and relationship creation counts
|
|
|
|
### Testing
|
|
|
|
Use the test script to verify functionality:
|
|
|
|
```bash
|
|
python test_gais_import.py
|
|
```
|
|
|
|
## Data Quality
|
|
|
|
The import system maintains data quality by:
|
|
- Filtering out invalid or empty values
|
|
- Converting data types appropriately
|
|
- Maintaining referential integrity
|
|
- Providing detailed error reporting
|
|
|
|
## Schema Compatibility
|
|
|
|
The imported data is compatible with the existing ClassroomCopilot schema and can be integrated with:
|
|
- Calendar structures
|
|
- User management systems
|
|
- Educational content management
|
|
- Analytics and reporting tools
|
|
|
|
## Support
|
|
|
|
For issues or questions related to the GAIS data import:
|
|
|
|
1. Check the logs for detailed error information
|
|
2. Verify data file format and content
|
|
3. Ensure database connectivity and permissions
|
|
4. Review the Neo4j schema constraints and indexes
|