High-performance PHP serialize/unserialize parser written in Rust with Python bindings.
pip install phpserialize-rs
[dependencies]
php-deserialize-core = "0.1"
from php_deserialize import loads, loads_json
# Basic usage
data = b'a:2:{s:4:"name";s:5:"Alice";s:3:"age";i:30;}'
result = loads(data)
print(result) # {'name': 'Alice', 'age': 30}
# Direct JSON conversion (optimized for Databricks)
json_str = loads_json(data)
print(json_str) # {"name":"Alice","age":30}
# Handle DB-escaped strings automatically
escaped = b'"a:1:{s:4:""key"";s:5:""value"";}"'
result = loads(escaped) # Auto-unescapes
print(result) # {'key': 'value'}
# Auto-fallback for encoding mismatches (no option needed!)
# Handles data serialized with EUC-KR but stored as UTF-8
mismatch = b's:4:"\xed\x95\x9c\xea\xb8\x80";' # "한글" with wrong length
result = loads(mismatch) # Automatically recovers
print(result) # '한글'
# Strict mode (disable auto-fallback)
result = loads(data, strict=True) # Fails on length mismatch
# Error handling options
result = loads(data, errors="replace") # Replace invalid UTF-8
result = loads(data, errors="bytes") # Return bytes for invalid UTF-8
from php_deserialize.spark import php_to_json
from pyspark.sql.functions import get_json_object
# Convert PHP serialize to JSON (Arrow-optimized UDF)
df = spark.table("bronze.my_table")
df = df.withColumn("data_json", php_to_json("serialized_column"))
# Extract fields from JSON
df = df.withColumn("name", get_json_object("data_json", "$.name"))
df = df.withColumn("age", get_json_object("data_json", "$.age"))
df.display()
For Databricks installation:
%pip install phpserialize-rs
use php_deserialize_core::{from_bytes, PhpValue};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let data = br#"a:2:{s:4:"name";s:5:"Alice";s:3:"age";i:30;}"#;
let value = from_bytes(data)?;
if let PhpValue::Array(items) = value {
for (key, val) in items {
println!("{:?} => {:?}", key, val);
}
}
Ok(())
}
| Type | PHP Format | Example |
|---|---|---|
| Null | N; |
N; |
| Boolean | b:0; / b:1; |
b:1; |
| Integer | i:<value>; |
i:42; |
| Float | d:<value>; |
d:3.14; |
| String | s:<len>:"<data>"; |
s:5:"hello"; |
| Array | a:<count>:{...} |
a:1:{i:0;s:3:"foo";} |
| Object | O:<len>:"<class>":<count>:{...} |
Object with properties |
| Reference | R:<index>; / r:<index>; |
Circular references |
| Enum (PHP 8.1+) | E:<len>:"<Class:Case>"; |
E:10:"Status:Active"; |
Benchmarked on Apple M1 Pro:
| Operation | Throughput |
|---|---|
| Simple array | ~1.5 GB/s |
| Nested structure | ~800 MB/s |
| Large string | ~2.0 GB/s |
Compared to php2json (Python):
The library provides detailed error messages for debugging:
from php_deserialize import loads, PhpDeserializeError
try:
loads(b"invalid data")
except PhpDeserializeError as e:
print(f"Parse error at position {e.position}: {e.message}")
When data is exported from databases (MySQL, PostgreSQL), strings may be double-quoted and escaped:
Original: a:1:{s:4:"key";s:5:"value";}
DB Export: "a:1:{s:4:""key"";s:5:""value"";}"
The library automatically detects and handles this format:
# Both work identically
loads(b'a:1:{s:4:"key";s:5:"value";}')
loads(b'"a:1:{s:4:""key"";s:5:""value"";}"')
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Licensed under either of:
at your option.